About the Author

Jim Winstead is a lead web developer with MySQL AB, and lives in Los Angeles, California.

We are always looking for interesting articles about MySQL! Have you written something and would like to it published here? Please contact us via feedback form »

Unicode and Other Funny Characters

This is the second of our on-going series of articles that explain some the new features in MySQL 4.1, which as of this writing is in the gamma phase of its development cycle, on the way to a production-ready release in the near future.

By Jim Winstead

One of the major new features in MySQL 4.1 is strong Unicode support, along with support for specifying character sets at many different levels. This makes it much simpler to handle content in a wide range of languages in your applications, as well as making it possible to handle content in multi-byte character encodings that were not supported in earlier versions of MySQL.

Character Encodings and Unicode

A character encoding is a way of mapping a character (the letter 'A') to an integer in a character set (the number 65 in the US-ASCII character set). With something as limited as the characters in the US-ASCII character set (the twenty-six letters of the English alphabet, both lowercase and uppercase, numbers from 0 to 9, and some punctuation), fitting this into a single byte is not a problem. But once you start to create a character set for languages like German, Swedish, Hungarian, and Japanese, you start to either hit the boundaries of the 8-bit byte when you try to create a character set to represent two of the languages, or even a single language like Japanese.

So throughout the history of computing, a number of different character encodings have been specified for mapping different characters to integers. For character sets that wouldn't fit in a single byte, double-byte character sets created, and so were multi-byte character sets that use a special character to signal a shift between single-byte and double-byte encoding.

The Unicode Consortium came together to create a specification for a character encoding that would be able to encompass the characters in all written languages (although contrary to what you may have heard, that does not yet include Klingon). The result was the Unicode character set, and some encodings. The two most common (and the two that MySQL 4.1 supports) are UCS-2, which encodes everything as two-byte characters, and UTF-8, which uses a multi-byte encoding scheme that extends US-ASCII.

ISO-8859-1 is the most common character set used for Western languages, and it is extended by the Windows-1252 character set to include some other characters, such as the euro (€) and trademark symbol (™). Because Windows-1252 is a superset of ISO-8859-1, the character set is known as latin1 to MySQL, and there is no distinct ISO-8859-1 character set. This matches the common behavior in web applications, which often treat the two interchangeably.

So why not just use UCS-2 or UTF-8 for everything? Well, if you're already working with a lot of data in a particular encoding, like Big-5 (often used for Chinese), you can avoid the processing overhead of converting into and out of UTF-8 by just storing the data in Big-5 encoding. UTF-8 encoding also tends to be larger (byte-wise) than more specific encodings, because characters outside of the normal ASCII range take at least two bytes. The string "d�ja v�" is only seven bytes in ISO-8859-1, but nine in UTF-8. The characters in scripts such as Chinese, Japanese, and Korean are each three bytes in UTF-8, but can be represented as two bytes in more specific encodings such as Big-5.

Collations

Example of Sorting Differences
Language	Swedish:	z < ö
Language	German:	ö < z
Usage (German)	Dictionary:	öf < of
Usage (German)	Telephone:	of < öf

Collation. The process of ordering units of textual information. Collation is usually specific to a particular language.
— Unicode Glossary

Sorting strings is a common action to take, but on top of not everyone using the same characters, not everyone even sorts the same characters the same way! A collation is a defined way of sorting strings, and it is often language-dependent. While both Swedish and German generally use the ISO-8859-1 encoding (or latin1), there are characters that are sorted differently in the two languages (or actually two different ways for German), as the "Example of Sorting Differences" table shows.

The Unicode Collation Algorithm (UCA) is a general purpose algorithm for sorting Unicode strings. There is also the Default Unicode Collation Element Table (DUCET), which supplies a default ordering for all Unicode characters.

MySQL 4.1 implements this general-purpose algorithm, which makes creating specific collations very simple relative to the amount of coding that would have been necessary without support for the UCA. For example, here's the specification of the ucs2_lithuanian_ci collation (from strings/ctype-uca.c in the MySQL 4.1 source code).

static const char lithuanian[]=
    "& C << ch <<< Ch <<< CH< \\u010D <<< \\u010C"
    "& E << \\u0119 <<< \\u0118 << \\u0117 <<< \\u0116"
    "& I << y <<< Y"
    "& S  < \\u0161 <<< \\u0160"
    "& Z  < \\u017E <<< \\u017D";

What 4.0 Did

MySQL 4.0 (and earlier versions) only supported what amounted to a combined notion of the character set and collation with single-byte character encodings, which was specified at the server level. The default was latin1, which corresponds to a character set of latin1 and collation of latin1_swedish_ci in MySQL 4.1.

No information about the character set (and collation) of the data is stored in the table, and there is no support in MySQL 4.0 for converting between character sets. To convert the data to another character set, you would need to change the server's character set, and then use an external tool to convert the data.

A common strategy for applications built using MySQL 4.0 that needed to handle data in multi-byte character encodings (such as UTF-8) that were not supported natively was to simply store the data in VARCHAR fields (or TEXT, CHAR, etc). The application would either ignore the fact that MySQL would simply sort the data incorrectly for strings that weren't really in the character set that the server thought they were, or do the sorting in the application. That strategy worked fine, you just had to handle all of the encoding issues in your application, and you also lost the ability to take full advantage of MySQL's full-text searching.

What 4.1 Can Do

MySQL 4.1 has much stronger support for dealing with varying character sets and collations — all the way from the column level to the server default, and each connection can even specify which character set it wants to use in transferring data.

The "Determining the Default Character Set and Collation" section of the manual explains the gory details for setting the character set and collation for each possible level (server, database, table, column, connection, string literal).

The easiest thing to do is tell the server what character set you intend to work with when connecting using a SET NAMES ... statement, and make sure your server is configured with the defaults that make the most sense for your application. (Or perhaps set different databases to have different defaults.)

Even if you are using applications built against the MySQL 4.0 (or earlier) client libraries, you can still take advantage of the new character set support in MySQL 4.1. Because the earlier versions were basically blind to the encoding of the data passed between the server and client, you can set the connection to UTF-8 or use UTF-8 string literals to pass data to the server. (Or if you are using a single-byte character set that is a superset of US-ASCII, you can use that directly.)

Migrating from 4.0 to 4.1

If you have been using MySQL 4.0 as I outlined above, storing UTF-8 data in string columns regardless of the default server character set, one of the things you will want to do after upgrading to MySQL 4.1 is actually let the server know the true character set of those columns. But if you simply do an ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8, the server will try to convert the data in the myColumn column from the server default character set to UTF-8. You need to do a two-step conversion to avoid this:

 ALTER TABLE myTable MODIFY myColumn BINARY(255);
 ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;

If you have multiple columns in the same table, you should do each step for all of the columns at the same time, to avoid having the entire table rebuilt for each step of converting each column. The "Converting 4.0 Character Columns to 4.1 Format" section of the manual provides an example of that.

If you were already using a specific server character set with MySQL 4.0, you can see how those character sets were mapped to the new character sets and collations in MySQL 4.1 in the "4.0 Character Sets and Corresponding 4.1 Character Set/Collation Pairs" section of the manual.

Working with UTF-8 on the Web

Ignoring older (and badly implemented) browsers for a second, handling UTF-8 data on the web is quite simple. You just need to indicate in the header and/or body of your document the character set, like so (using PHP):

<?php header("Content-type: text/html; charset=utf-8");?>
<html>
 <head>
  <meta http-equiv="Content-type" value="text/html; charset=utf-8">
  ...

If your HTML page contains a form, browsers will generally send the results back in the character set of the page. So if your page is sent in UTF-8, you will (usually) get UTF-8 results back. The default encoding of HTML documents is ISO-8859-1, so by default you will get form data encoded as ISO-8859-1, with one big exception: some browsers (including Microsoft Internet Explorer and Apple Safari) will actually send the data encoded as Windows-1252, which extends ISO-8859-1 with some special symbols, like the euro (€) and the curly quotes (“”).

It's those "usually" and "ignoring older (and badly implemented) browsers" qualifiers that make it a little bit tricky: if you want to make sure to catch these edge cases, you'll need to do a little bit of extra work. One thing you can do is add a hidden field to your form containing some data is likely to be corrupted if the client isn't handling the character set correctly:

 <input type="hidden" name="charset_check" value="ä™®">

You can also verify that you have gotten valid UTF-8 content with this regular expression published by the W3C.

If the data is not valid UTF-8, or you already know that you are dealing with data in another character set that you want to convert into UTF-8, PHP supports a few different ways of converting the data:

mbstring extension: mb_convert_encoding(string, to, from)
iconv extension: iconv(from, to, string)
recode extension: recode_string(request, string)
built-in function: utf8_encode(string) (converts from ISO-8859-1 to UTF-8)

So handling input might look something like this:

<?php
$test  = $_REQUEST['charset_check']; /* our test field */
$field = $_REQUEST['field']; /* the data field */

if (bin2hex($test) == "c3a4e284a2c2ae") { /* UTF-8 for "ä™®" */
  /* Nothing to do: it's UTF-8! */
} elseif (bin2hex($test) == "e499ae") { /* Windows-1252 */
  $field = iconv("windows-1252", "utf-8", $field);
} else {
  die("Sorry, I didn't understand the character set of the data you sent!");
}

mysql_query("INSERT INTO table SET field = _utf8'" . addslashes($field) . "'")
  or die("INSERT failed: " . mysql_error());

Going Further

I've just shown you how to get the data into MySQL — now you can dig into the "Character Set Support" section of the manual to figure out how to make better use of the character set featuers. The support for character sets and internationalized data will grow in future versions of MySQL: more collations (such as the Persian (ucs2_persian_ci) collation that will appear in MySQL 4.1.5), making full-text search more aware of how to break up words in languages like Japanese, and potentially even more legacy character sets.

A Note on Metadata

As of MySQL 4.1, the server now treats its metadata using the UTF-8 encoding, so functions like CURRENT_USER() will be in the UTF-8 character set by default. This also means that you can create users, tables, and databases using the full range of characters available in Unicode, although there may be lingering bugs in this functionality due to mismatches between the file system's handling of non-ASCII characters in filenames.