April 2003 newsletter

The Character Sets Of Version 4.1

Newsletters older than 6 months may have links that are out of date. Please use the Search to check for updated links.

Peter Gulutzan and Alexander Barkov

The MySQL Version-4.1 alpha, now available in binary form, should make those users happy who've been wanting multiple, or just non-English, character sets. The big new features are "many character sets per database / per server / per table", "many collations (sort orders) per character set", and "Unicode".

MANY CHARACTER SETS PER DATABASE / PER SERVER / PER TABLE
With version 4.0, you certainly have a choice of character sets ... but once you've made the choice, you have to stick with it. For example, with version 4.0 you can't say "this database has character set X (by default), but Table1.column1 will have character set Y while Table1.column2 will have character set Z." With version 4.1 it's a doddle:

CREATE DATABASE d CHARACTER SET latin1;
...
CREATE TABLE Table1 (
column1 CHAR(5) CHARACTER SET latin2,
column2 VARCHAR(777) CHARACTER SET latin5);

In the example above, we've made a database with a "default" character set of latin1 (a character set popular in Western Europe). But then we overrode the default by saying that Table1.column1 would have values in latin2 (a character set popular in Eastern Europe). Meanwhile Table1.column2 will have values in latin5, a Turkish character set.

The system looks imposing because there are so many defaults: you can specify the default character set at the server level, the database level, the table level, the column level, or the connection level. But it's easy to summarize: you can associate a different character set for any database object, or you can arrange it so that most objects have the default character set and a few objects will have whatever else you choose.

MANY COLLATIONS (SORT ORDERS) PER CHARACTER SET
Even if two strings are in the same character set, they might have different rules for sorting. This can happen within a language (for example a phone book might have different sorting rules than a dictionary); however, the main concern is with the difference among languages for sorting rules, which we'll call "Collations" because that's the SQL Standard term. For example, Swedish and English have different collations. You probably won't notice this unless you use accented characters, but here's an example:

(Swedish) (English)
FRY FR�Z
FR�Z FRY
ZENDA �TZI
�TZI ZENDA

What's especially surprising is that the Swedish (not the English) collation is the MySQL default default! But not to worry. Collations, like character sets, can be changed in several places. Here's one way:

CREATE TABLE Table2
(column1 CHAR(5),
column2 CHAR(5) COLLATE latin1_general_ci);

In the above example, we've allowed the database and table to have defaults. So unless we start the server with some non-default --with-character-set specification, we'll have a Table2.column1 with a default character set (latin1) and a default collation for that character set (latin1_swedish_ci). Table2.column2, on the other hand, will have a non-default collation: latin1_general_ci.

(By the way, the "ci" at the end of the collation name means the collation is case insensitive: A and a are treated as equal. A "cs" at the end of a collation name means the collation is case sensitive.)

Every character set has at least one collation, and some character sets, like latin1, have several collations.

UNICODE
Of especial interest, and as a result of huge demand, MySQL 4.1 supports two new "Unicode" character sets: ucs2 and utf8. Both ucs2 and utf8 have the same repertoire of characters (about 40,000 of them); the difference between them is that only ucs2 is a fixed-width character set (always 16 bits per character), while utf8 is a variable-width character set (between 8 and 24 bits per character).

The point of having Unicode character sets is that, with such a large repertoire available, MySQL can support strings from pretty well any language, or from all languages together. As well as saving you a lot of fiddling with different character sets, Unicode promises to be a major factor for new developments in XML, in other computer languages, and in international Internet connectivity.

In the early alpha releases for Windows there is a problem with Unicode and other complex character sets, but we're working on it.

HOW THE FEATURES COMPARE TO STANDARD SQL AND TO OTHER DBMS PRODUCTS

The new character set and collation features match the ANSI/ISO SQL Standard specifications (non-core Features F461 and F691). The flexibility of MySQL's defaults, and the ability to specify multiple character sets and collations within a single database or table, puts MySQL on a par with Oracle and SQL Server, and ahead of Sybase 12 or DB2 7. The addition of Unicode will mean that some host languages which use Unicode as a base character set (such as Java) will have an easier time interacting with MySQL.

AVAILABILITY
You can download a copy of MySQL 4.1 now, from www.mysql.com. You'll have to wait a short time before the new documentation appears -- there's so much to say, so it takes a while to incorporate in the manual. Since enhancements and bug fixes are still going on, you should expect that some character-set names and some syntax details may change without notice.

Related Pages:

The Character Sets Of Version 4.1