This set of Frequently Asked Questions derives from the experience of MySQL's Support and Development groups in handling many inquiries about CJK (Chinese-Japanese-Korean) issues.
Questions
22.11.1: I've upgraded to MySQL 5.4. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
22.11.2: Why do I get Data truncated error messages?
22.11.3: 
        Why do some LIKE and
        FULLTEXT searches with CJK characters fail?
      
22.11.4: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
22.11.5: Why do Japanese character set conversions fail?
22.11.6: Why are my supplementary characters rejected by MySQL?
22.11.7: 
        What should I do if I want to convert SJIS
        81CA to cp932?
      
22.11.8: What problems should I be aware of when working with the Big5 Chinese character set?
22.11.9: 
        I have inserted CJK characters into my table. Why does
        SELECT display them as
        “?” characters?
      
22.11.10: 
        How does MySQL represent the Yen (¥) sign?
      
22.11.11: Why don't CJK strings sort correctly in Unicode? (I)
22.11.12: Does MySQL allow CJK characters to be used in database and table names?
22.11.13: Shouldn't it be “CJKV”?
22.11.14: 
        Do MySQL plan to make a separate character set where
        5C is the Yen sign, as at least one other
        major DBMS does?
      
22.11.15: What CJK character sets are available in MySQL?
22.11.16: Why don't CJK strings sort correctly in Unicode? (II)
22.11.17: Where can I get help with CJK and related issues in MySQL?
22.11.18: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
22.11.19: 
        How do I know whether character X is
        available in all character sets?
      
22.11.20: Of what issues should I be aware when working with Korean character sets in MySQL?
Questions and Answers
22.11.1: I've upgraded to MySQL 5.4. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
In MySQL Version 4.0, there was a single “global” character set for both server and client, and the decision as to which character to use was made by the server administrator. This changed starting with MySQL Version 4.1. What happens now is a “handshake”, as described in Section 9.1.4, “Connection Character Sets and Collations”:
When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the
character_set_client,character_set_results, andcharacter_set_connectionsystem variables. In effect, the server performs aSET NAMESoperation using the character set name.
        The effect of this is that you cannot control the client
        character set by starting mysqld with
        --character-set-server=utf8.
        However, some of our Asian customers have said that they prefer
        the MySQL 4.0 behavior. To make it possible to retain this
        behavior, we added a mysqld switch,
        --character-set-client-handshake,
        which can be turned off with
        --skip-character-set-client-handshake.
        If you start mysqld with
        --skip-character-set-client-handshake,
        then, when a client connects, it sends to the server the name of
        the character set that it wants to use — however,
        the server ignores this request from the
        client.
      
        By way of example, suppose that your favorite server character
        set is latin1 (unlikely in a CJK area, but
        this is the default value). Suppose further that the client uses
        utf8 because this is what the client's
        operating system supports. Now, start the server with
        latin1 as its default character set:
mysqld --character-set-server=latin1
        And then start the client with the default character set
        utf8:
mysql --default-character-set=utf8
        The current settings can be seen by viewing the output of
        SHOW VARIABLES:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
Now stop the client, and then stop the server using mysqladmin. Then start the server again, but this time tell it to skip the handshake like so:
mysqld --character-set-server=utf8 --skip-character-set-client-handshake
        Start the client with utf8 once again as the
        default character set, then display the current settings:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                                 |
| character_set_connection | latin1                                 |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | latin1                                 |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
        As you can see by comparing the differing results from
        SHOW VARIABLES, the server
        ignores the client's initial settings if the
        --skip-character-set-client-handshake
        is used.
      
22.11.2: Why do I get Data truncated error messages?
        For illustration, we'll create a table with one Unicode
        (ucs2) column and one Chinese
        (gb2312) column.
mysql>CREATE TABLE ch->(ucs2 CHAR(3) CHARACTER SET ucs2,->gb2312 CHAR(3) CHARACTER SET gb2312);Query OK, 0 rows affected (0.05 sec)
        We'll try to place the rare character 汌 in
        both columns.
mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec)
Ah, there is a warning. Use the following statement to see what it is:
mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------+
| Level   | Code | Message                                     |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)
        So it is a warning about the gb2312 column
        only.
mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch; +-------+--------------+--------+-------------+ | ucs2 | HEX(ucs2) | gb2312 | HEX(gb2312) | +-------+--------------+--------+-------------+ | A汌B | 00416C4C0042 | A?B | 413F42 | +-------+--------------+--------+-------------+ 1 row in set (0.00 sec)
There are several things that need explanation here.
The fact that it is a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.
              The 汌 character isn't in the
              gb2312 character set. We described that
              problem earlier.
            
Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (See Bug#9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.
              With SQL_MODE=TRADITIONAL, there would
              be an error message, but instead of error 2165 you would
              see: ERROR 1406 (22001): Data too long for column
              'gb2312' at row 1.
            
22.11.3: 
        Why do some LIKE and
        FULLTEXT searches with CJK characters fail?
      
        There is a very simple problem with
        LIKE searches on
        BINARY and
        BLOB columns: we need to know the
        end of a character. With multi-byte character sets, different
        characters might have different octet lengths. For example, in
        utf8, A requires one byte
        but ペ requires three bytes, as shown here:
+-------------------------+---------------------------+ | OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') | +-------------------------+---------------------------+ | 1 | 3 | +-------------------------+---------------------------+ 1 row in set (0.00 sec)
        If we don't know where the first character ends, then we don't
        know where the second character begins, in which case even very
        simple searches such as LIKE
        '_A%' fail. The solution is to use a regular CJK
        character set in the first place, or to convert to a CJK
        character set before comparing.
      
This is one reason why MySQL cannot allow encodings of nonexistent characters. If it is not strict about rejecting bad input, then it has no way of knowing where characters end.
        For FULLTEXT searches, we need to know where
        words begin and end. With Western languages, this is rarely a
        problem because most (if not all) of these use an
        easy-to-identify word boundary — the space character.
        However, this is not usually the case with Asian writing. We
        could use arbitrary halfway measures, like assuming that all Han
        characters represent words, or (for Japanese) depending on
        changes from Katakana to Hiragana due to grammatical endings.
        However, the only sure solution requires a comprehensive word
        list, which means that we would have to include a dictionary in
        the server for each Asian language supported. This is simply not
        feasible.
      
22.11.4: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
        Obtain a direct connection to the server using the
        mysql client (Windows:
        mysql.exe), and try the same query there. If
        mysql responds correctly, then the trouble
        may be that your application interface requires initialization.
        Use mysql to tell you what character set or
        sets it uses with the statement SHOW VARIABLES LIKE
        'char%';. If you are using Access, then you are most
        likely connecting with MyODBC. In this case, you should check
        Section 20.1.4, “Connector/ODBC Configuration”. If, for
        instance, you use big5, you would enter
        SET NAMES 'big5'. (Note that no
        ; is required in this case). If you are using
        ASP, you might need to add SET NAMES in the
        code. Here is an example that has worked in the past:
<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=server;uid=username;" \
               & "pwd=password;database=database;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject("ADODB.Connection")
Conn.Open strConnection
%>
        In much the same way, if you are using any character set other
        than latin1 with Connector/NET, then you must
        specify the character set in the connection string. See
        Section 20.2.5.1, “Connecting to MySQL Using Connector/NET”, for more
        information.
      
If you are using PHP, try this:
<?php
  $link = mysql_connect($host, $usr, $pwd);
  mysql_select_db($db);
  if( mysql_error() ) { print "Database ERROR: " . mysql_error(); }
  mysql_query("SET NAMES 'utf8'", $link);
?>
        In this case, we used SET NAMES to change
        character_set_client and
        character_set_connection and
        character_set_results.
      
        We encourage the use of the newer mysqli
        extension, rather than mysql. Using
        mysqli, the previous example could be
        rewritten as shown here:
<?php
  $link = new mysqli($host, $usr, $pwd, $db);
  if( mysqli_connect_errno() )
  {
    printf("Connect failed: %s\n", mysqli_connect_error());
    exit();
  }
  $link->query("SET NAMES 'utf8'");
?>
        Another issue often encountered in PHP applications has to do
        with assumptions made by the browser. Sometimes adding or
        changing a <meta> tag suffices to
        correct the problem: for example, to insure that the user agent
        interprets page content as UTF-8, you should
        include <meta http-equiv="Content-Type"
        content="text/html; charset=utf-8"> in the
        <head> of the HTML page.
      
If you are using Connector/J, see Section 20.3.4.4, “Using Character Sets and Unicode”.
22.11.5: Why do Japanese character set conversions fail?
        MySQL supports the sjis,
        ujis, cp932, and
        eucjpms character sets, as well as Unicode. A
        common need is to convert between character sets. For example,
        there might be a Unix server (typically with
        sjis or ujis) and a
        Windows client (typically with cp932).
      
        In the following conversion table, the ucs2
        column represents the source, and the sjis,
        cp932, ujis, and
        eucjpms columns represent the destinations
        — that is, the last 4 columns provide the hexadecimal
        result when we use CONVERT(ucs2)
        or we assign a ucs2 column containing the
        value to an sjis, cp932,
        ujis, or eucjpms column.
        
| Character Name | ucs2 | sjis | cp932 | ujis | eucjpms | 
|---|---|---|---|---|---|
| BROKEN BAR | 00A6 | 3F | 3F | 8FA2C3 | 3F | 
| FULLWIDTH BROKEN BAR | FFE4 | 3F | FA55 | 3F | 8FA2 | 
| YEN SIGN | 00A5 | 3F | 3F | 20 | 3F | 
| FULLWIDTH YEN SIGN | FFE5 | 818F | 818F | A1EF | 3F | 
| TILDE | 007E | 7E | 7E | 7E | 7E | 
| OVERLINE | 203E | 3F | 3F | 20 | 3F | 
| HORIZONTAL BAR | 2015 | 815C | 815C | A1BD | A1BD | 
| EM DASH | 2014 | 3F | 3F | 3F | 3F | 
| REVERSE SOLIDUS | 005C | 815F | 5C | 5C | 5C | 
| FULLWIDTH "" | FF3C | 3F | 815F | 3F | A1C0 | 
| WAVE DASH | 301C | 8160 | 3F | A1C1 | 3F | 
| FULLWIDTH TILDE | FF5E | 3F | 8160 | 3F | A1C1 | 
| DOUBLE VERTICAL LINE | 2016 | 8161 | 3F | A1C2 | 3F | 
| PARALLEL TO | 2225 | 3F | 8161 | 3F | A1C2 | 
| MINUS SIGN | 2212 | 817C | 3F | A1DD | 3F | 
| FULLWIDTH HYPHEN-MINUS | FF0D | 3F | 817C | 3F | A1DD | 
| CENT SIGN | 00A2 | 8191 | 3F | A1F1 | 3F | 
| FULLWIDTH CENT SIGN | FFE0 | 3F | 8191 | 3F | A1F1 | 
| POUND SIGN | 00A3 | 8192 | 3F | A1F2 | 3F | 
| FULLWIDTH POUND SIGN | FFE1 | 3F | 8192 | 3F | A1F2 | 
| NOT SIGN | 00AC | 81CA | 3F | A2CC | 3F | 
| FULLWIDTH NOT SIGN | FFE2 | 3F | 81CA | 3F | A2CC | 
Now consider the following portion of the table.
| ucs2 | sjis | cp932 | |
|---|---|---|---|
| NOT SIGN | 00AC | 81CA | 3F | 
| FULLWIDTH NOT SIGN | FFE2 | 3F | 81CA | 
        This means that MySQL converts the NOT SIGN
        (Unicode U+00AC) to sjis
        code point 0x81CA and to
        cp932 code point 3F.
        (3F is the question mark (“?”)
        — this is what is always used when the conversion cannot
        be performed.
      
22.11.6: Why are my supplementary characters rejected by MySQL?
        Before MySQL 6.0.4, MySQL does not support supplementary
        characters — that is, characters which need more than 3
        bytes — for UTF-8. We support only what
        Unicode calls the Basic Multilingual Plane / Plane
        0. Only a few very rare Han characters are
        supplementary; support for them is uncommon. This has led to
        reports such as that found in Bug#12600, which we rejected as
        “not a bug”. With utf8, we must
        truncate an input string when we encounter bytes that we don't
        understand. Otherwise, we wouldn't know how long the bad
        multi-byte character is.
      
        One possible workaround is to use ucs2
        instead of utf8, in which case the
        “bad” characters are changed to question marks;
        however, no truncation takes place. You can also change the data
        type to BLOB or
        BINARY, which perform no validity
        checking.
      
        As of MySQL 6.0.4, Unicode support is extended to include
        supplementary characters by means of additional Unicode
        character sets: utf16,
        utf32, and 4-byte utf8.
        These character sets support supplementary Unicode characters
        outside the Basic Multilingual Plane (BMP).
      
22.11.7: 
        What should I do if I want to convert SJIS
        81CA to cp932?
      
        Our answer is: “?”. There are serious complaints
        about this: many people would prefer a “loose”
        conversion, so that 81CA (NOT SIGN) in
        sjis becomes 81CA (FULLWIDTH NOT
        SIGN) in cp932. We are considering
        a change to this behavior.
      
22.11.8: What problems should I be aware of when working with the Big5 Chinese character set?
        MySQL supports the Big5 character set which is common in Hong
        Kong and Taiwan (Republic of China). MySQL's
        big5 is in reality Microsoft code page 950,
        which is very similar to the original big5
        character set. We changed to this
        
        character set starting with MySQL version 4.1.16 / 5.0.16 (as a
        result of Bug#12476). For example, the following statements work
        in current versions of MySQL, but not in old versions:
mysql>CREATE TABLE big5 (BIG5 CHAR(1) CHARACTER SET BIG5);Query OK, 0 rows affected (0.13 sec) mysql>INSERT INTO big5 VALUES (0xf9dc);Query OK, 1 row affected (0.00 sec) mysql>SELECT * FROM big5;+------+ | big5 | +------+ | 嫺 | +------+ 1 row in set (0.02 sec)
        A feature request for adding HKSCS extensions
        has been filed. People who need this extension may find the
        suggested patch for Bug#13577 to be of interest.
      
22.11.9: 
        I have inserted CJK characters into my table. Why does
        SELECT display them as
        “?” characters?
      
This problem is usually due to a setting in MySQL that doesn't match the settings for the application program or the operating system. Here are some common steps for correcting these types of issues:
Be certain of what MySQL version you are using.
            Use the statement SELECT VERSION(); to
            determine this.
          
Make sure that the database is actually using the desired character set.
            People often think that the client character set is always
            the same as either the server character set or the character
            set used for display purposes. However, both of these are
            false assumptions. You can make sure by checking the result
            of SHOW CREATE TABLE
             or —
            better — yet by using this statement:
tablename
SELECT character_set_name, collation_name
    FROM information_schema.columns
    WHERE table_schema = your_database_name
        AND table_name = your_table_name
        AND column_name = your_column_name;
Determine the hexadecimal value of the character or characters that are not being displayed correctly.
            You can obtain this information for a column
            column_name in the table
            table_name using the following
            query:
          
SELECT HEX(column_name) FROMtable_name;
            3F is the encoding for the
            ? character; this means that
            ? is the character actually stored in the
            column. This most often happens because of a problem
            converting a particular character from your client character
            set to the target character set.
          
            Make sure that a round trip possible — that
            is, when you select literal (or
            _introducer hexadecimal-value),
            you obtain literal as a
            result.
          
            For example, the Japanese
            Katakana character
            Pe (ペ')
            exists in all CJK character sets, and has the code point
            value (hexadecimal coding) 0x30da. To
            test a round trip for this character, use this query:
SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
            If the result is not also ペ, then the
            round trip has failed.
          
            For bug reports regarding such failures, we might ask you to
            follow up with SELECT HEX('ペ');. Then
            we can determine whether the client encoding is correct.
          
Make sure that the problem is not with the browser or other application, rather than with MySQL.
Use the mysql client program (on Windows: mysql.exe) to accomplish this task. If mysql displays correctly but your application doesn't, then your problem is probably due to system settings.
            To find out what your settings are, use the
            SHOW VARIABLES statement,
            whose output should resemble what is shown here:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.03 sec)
            These are typical character-set settings for an
            international-oriented client (notice the use of
            utf8 Unicode) connected to a server in
            the West (latin1 is a West Europe
            character set and a default for MySQL).
          
            Although Unicode (usually the utf8
            variant on Unix, and the ucs2 variant on
            Windows) is preferable to Latin, it is often not what your
            operating system utilities support best. Many Windows users
            find that a Microsoft character set, such as
            cp932 for Japanese Windows, is suitable.
          
            If you cannot control the server settings, and you have no
            idea what your underlying computer is, then try changing to
            a common character set for the country that you're in
            (euckr = Korea; gb2312
            or gbk = People's Republic of China;
            big5 = Taiwan; sjis,
            ujis, cp932, or
            eucjpms = Japan; ucs2
            or utf8 = anywhere). Usually it is
            necessary to change only the client and connection and
            results settings. There is a simple statement which changes
            all three at once: SET NAMES. For
            example:
          
SET NAMES 'big5';
            Once the setting is correct, you can make it permanent by
            editing my.cnf or
            my.ini. For example you might add lines
            looking like these:
[mysqld] character-set-server=big5 [client] default-character-set=big5
It is also possible that there are issues with the API configuration setting being used in your application; see Why does my GUI front end or browser not display CJK characters correctly...? for more information.
22.11.10: 
        How does MySQL represent the Yen (¥) sign?
      
        A problem arises because some versions of Japanese character
        sets (both sjis and euc)
        treat 5C as a reverse
        solidus (\ — also known as
        a backslash), and others treat it as a yen sign
        (¥).
      
        MySQL follows only one version of the JIS (Japanese Industrial
        Standards) standard description. In MySQL,
        5C is always the reverse solidus
        (\).
      
22.11.11: Why don't CJK strings sort correctly in Unicode? (I)
        Sometimes people observe that the result of a
        utf8_unicode_ci or
        ucs2_unicode_ci search, or of an
        ORDER BY sort is not what they think a native
        would expect. Although we never rule out the possibility that
        there is a bug, we have found in the past that many people do
        not read correctly the standard table of weights for the Unicode
        Collation Algorithm. MySQL uses the table found at
        http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
        This is not the first table you will find by navigating from the
        unicode.org home page, because MySQL uses the
        older 4.0.0 “allkeys” table, rather than the more
        recent 4.1.0 table. This is because we are very wary about
        changing ordering which affects indexes, lest we bring about
        situations such as that reported in Bug#16526, illustrated as
        follows:
mysql<CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);Query OK, 0 rows affected (0.05 sec) mysql>INSERT INTO tj VALUES ('が'),('か');Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM tj WHERE s1 = 'か';+------+ | s1 | +------+ | が | | か | +------+ 2 rows in set (0.00 sec)
        The character in the first result row is not the one that we
        searched for. Why did MySQL retrieve it? First we look for the
        Unicode code point value, which is possible by reading the
        hexadecimal number for the ucs2 version of
        the characters:
mysql> SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1   | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が  | 304C                        |
| か  | 304B                        |
+------+-----------------------------+
2 rows in set (0.03 sec)
        Now we search for 304B and
        304C in the 4.0.0 allkeys
        table, and find these lines:
304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA 304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
        The official Unicode names (following the “#” mark)
        tell us the Japanese syllabary (Hiragana), the informal
        classification (letter, digit, or punctuation mark), and the
        Western identifier (KA or
        GA, which happen to be voiced and unvoiced
        components of the same letter pair). More importantly, the
        primary weight (the first hexadecimal
        number inside the square brackets) is 1E57 on
        both lines. For comparisons in both searching and sorting, MySQL
        pays attention to the primary weight only, ignoring all the
        other numbers. This means that we are sorting
        が and か correctly
        according to the Unicode specification. If we wanted to
        distinguish them, we'd have to use a non-UCA (Unicode Collation
        Algorithm) collation (utf8_bin or
        utf8_general_ci), or to compare the
        HEX() values, or use
        ORDER BY CONVERT(s1 USING sjis). Being
        correct “according to Unicode” isn't enough, of
        course: the person who submitted the bug was equally correct. We
        plan to add another collation for Japanese according to the JIS
        X 4061 standard, in which voiced/unvoiced letter pairs like
        KA/GA are distinguishable
        for ordering purposes.
      
22.11.12: Does MySQL allow CJK characters to be used in database and table names?
This issue is fixed in MySQL 5.1, by automatically rewriting the names of the corresponding directories and files.
        For example, if you create a database named
        楮 on a server whose operating system does
        not support CJK in directory names, MySQL creates a directory
        named @0w@00a5@00ae. which is just a fancy
        way of encoding E6A5AE — that is, the
        Unicode hexadecimal representation for the
        楮 character. However, if you run a
        SHOW DATABASES statement, you can
        see that the database is listed as 楮.
      
22.11.13: Shouldn't it be “CJKV”?
No. The term “CJKV” (Chinese Japanese Korean Vietnamese) refers to Vietnamese character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters.
Bug#4745 is a request for a specialized Vietnamese collation, which we might add in the future if there is sufficient demand for it.
22.11.14: 
        Do MySQL plan to make a separate character set where
        5C is the Yen sign, as at least one other
        major DBMS does?
      
This is one possible solution to the Yen sign issue; however, this will not happen in MySQL 5.1 or 6.0.
22.11.15: What CJK character sets are available in MySQL?
        The list of CJK character sets may vary depending on your MySQL
        version. For example, the eucjpms character
        set was not supported prior to MySQL 5.0.3 (see
        Changes in MySQL 5.0.3). However, since the name of the
        applicable language appears in the
        DESCRIPTION column for every entry in the
        INFORMATION_SCHEMA.CHARACTER_SETS
        table, you can obtain a current list of all the non-Unicode CJK
        character sets using this query:
mysql>SELECT CHARACTER_SET_NAME, DESCRIPTION->FROM INFORMATION_SCHEMA.CHARACTER_SETS->WHERE DESCRIPTION LIKE '%Chinese%'->OR DESCRIPTION LIKE '%Japanese%'->OR DESCRIPTION LIKE '%Korean%'->ORDER BY CHARACTER_SET_NAME;+--------------------+---------------------------+ | CHARACTER_SET_NAME | DESCRIPTION | +--------------------+---------------------------+ | big5 | Big5 Traditional Chinese | | cp932 | SJIS for Windows Japanese | | eucjpms | UJIS for Windows Japanese | | euckr | EUC-KR Korean | | gb2312 | GB2312 Simplified Chinese | | gbk | GBK Simplified Chinese | | sjis | Shift-JIS Japanese | | ujis | EUC-JP Japanese | +--------------------+---------------------------+ 8 rows in set (0.01 sec)
        (See Section 19.9, “The INFORMATION_SCHEMA CHARACTER_SETS Table”, for more
        information.)
      
        MySQL supports the two common variants of the
        GB (Guojia
        Biaozhun, or National
        Standard, or Simplified Chinese)
        character sets which are official in the People's Republic of
        China: gb2312 and gbk.
        Sometimes people try to insert gbk characters
        into gb2312, and it works most of the time
        because gbk is a superset of
        gb2312 — but eventually they try to
        insert a rarer Chinese character and it doesn't work. (See
        Bug#16072 for an example).
      
        Here, we try to clarify exactly what characters are legitimate
        in gb2312 or gbk, with
        reference to the official documents. Please check these
        references before reporting gb2312 or
        gbk bugs.
        
              For a complete listing of the gb2312
              characters, ordered according to the
              gb2312_chinese_ci collation:
              gb2312
            
              MySQL's gbk is in reality
              “Microsoft code page 936”. This differs from
              the official gbk for characters
              A1A4 (middle dot),
              A1AA (em dash),
              A6E0-A6F5, and
              A8BB-A8C0. For a listing of the
              differences, see
              http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h.
            
              For a listing of gbk/Unicode mappings,
              see
              http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.
            
              For MySQL's listing of gbk characters,
              see
              gbk.
            
22.11.16: Why don't CJK strings sort correctly in Unicode? (II)
        If you are using Unicode (ucs2 or
        utf8), and you know what the Unicode sort
        order is (see Section A.11, “MySQL 5.4 FAQ — MySQL Chinese, Japanese, and Korean
      Character Sets”), but MySQL still seems
        to sort your table incorrectly, then you should first verify the
        table character set:
mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
        Since the character set appears to be correct, let's see what
        information the
        INFORMATION_SCHEMA.COLUMNS table
        can provide about this column:
mysql>SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME->FROM INFORMATION_SCHEMA.COLUMNS->WHERE COLUMN_NAME = 's1'->AND TABLE_NAME = 't';+-------------+--------------------+-----------------+ | COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME | +-------------+--------------------+-----------------+ | s1 | ucs2 | ucs2_general_ci | +-------------+--------------------+-----------------+ 1 row in set (0.01 sec)
        (See Section 19.3, “The INFORMATION_SCHEMA COLUMNS Table”, for more information.)
      
        You can see that the collation is
        ucs2_general_ci instead of
        ucs2_unicode_ci. The reason why this is so
        can be found using SHOW CHARSET, as shown
        here:
mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)
        For ucs2 and utf8, the
        default collation is “general”. To specify a
        Unicode collation, use COLLATE
        ucs2_unicode_ci.
      
22.11.17: Where can I get help with CJK and related issues in MySQL?
The following resources are available:
A listing of MySQL user groups can be found at http://dev.mysql.com/user-groups/.
You can contact a sales engineer at the MySQL KK Japan office using any of the following:
Tel: +81(0)3-5326-3133 Fax: +81(0)3-5326-3001 Email: dsaito@mysql.com
View feature requests relating to character set issues at http://tinyurl.com/y6xcuf.
Visit the MySQL Character Sets, Collation, Unicode Forum. We are also in the process of adding foreign-language forums at http://forums.mysql.com/.
22.11.18: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
A Simplified Chinese version of the Manual, current for MySQL 5.1.12, can be found at http://dev.mysql.com/doc/. The Japanese translation of the MySQL 4.1 manual can be downloaded from http://dev.mysql.com/doc/.
22.11.19: 
        How do I know whether character X is
        available in all character sets?
      
        The majority of simplified Chinese and basic nonhalfwidth
        Japanese Kana characters appear
        in all CJK character sets. This stored procedure accepts a
        UCS-2 Unicode character, converts it to all
        other character sets, and displays the results in hexadecimal.
DELIMITER //
CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN
CREATE TABLE tj
             (ucs2 CHAR(1) character set ucs2,
              utf8 CHAR(1) character set utf8,
              big5 CHAR(1) character set big5,
              cp932 CHAR(1) character set cp932,
              eucjpms CHAR(1) character set eucjpms,
              euckr CHAR(1) character set euckr,
              gb2312 CHAR(1) character set gb2312,
              gbk CHAR(1) character set gbk,
              sjis CHAR(1) character set sjis,
              ujis CHAR(1) character set ujis);
INSERT INTO tj (ucs2) VALUES (ucs2_char);
UPDATE tj SET utf8=ucs2,
              big5=ucs2,
              cp932=ucs2,
              eucjpms=ucs2,
              euckr=ucs2,
              gb2312=ucs2,
              gbk=ucs2,
              sjis=ucs2,
              ujis=ucs2;
/* If there is a conversion problem, UPDATE will produce a warning. */
SELECT hex(ucs2) AS ucs2,
       hex(utf8) AS utf8,
       hex(big5) AS big5,
       hex(cp932) AS cp932,
       hex(eucjpms) AS eucjpms,
       hex(euckr) AS euckr,
       hex(gb2312) AS gb2312,
       hex(gbk) AS gbk,
       hex(sjis) AS sjis,
       hex(ujis) AS ujis
FROM tj;
DROP TABLE tj;
END//
        The input can be any single ucs2 character,
        or it can be the code point value (hexadecimal representation)
        of that character. For example, from Unicode's list of
        ucs2 encodings and names
        (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt),
        we know that the Katakana
        character Pe appears in all CJK
        character sets, and that its code point value is
        0x30da. If we use this value as the argument
        to p_convert(), the result is as shown here:
mysql> CALL p_convert(0x30da)//
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)
        Since none of the column values is 3F —
        that is, the question mark character (?)
        — we know that every conversion worked.
      
22.11.20: Of what issues should I be aware when working with Korean character sets in MySQL?
        In theory, while there have been several versions of the
        euckr (Extended Unix Code
        Korea) character set, only one problem has been
        noted.
      
        We use the “ASCII” variant of EUC-KR, in which the
        code point 0x5c is REVERSE SOLIDUS, that is
        \, instead of the “KS-Roman”
        variant of EUC-KR, in which the code point
        0x5c is WON
        SIGN(₩). This means that you
        cannot convert Unicode U+20A9 to
        euckr:
mysql>SELECT->CONVERT('₩' USING euckr) AS euckr,->HEX(CONVERT('₩' USING euckr)) AS hexeuckr;+-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec)
MySQL's graphic Korean chart is here: euckr.


User Comments
Add your own comment.