This set of Frequently Asked Questions derives from the experience of MySQL's Support and Development groups in handling many inquiries about CJK (Chinese-Japanese-Korean) issues.
Questions
22.11.1: I've upgraded to MySQL 5.0. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
22.11.2: Why do I get Data truncated error messages?
22.11.3:
Why do some LIKE
and
FULLTEXT
searches with CJK characters fail?
22.11.4: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
22.11.5: Why do Japanese character set conversions fail?
22.11.6: Why are my supplementary characters rejected by MySQL?
22.11.7:
What should I do if I want to convert SJIS
81CA
to cp932
?
22.11.8: What problems should I be aware of when working with the Big5 Chinese character set?
22.11.9:
I have inserted CJK characters into my table. Why does
SELECT
display them as
“?” characters?
22.11.10:
How does MySQL represent the Yen (¥
) sign?
22.11.11: Why don't CJK strings sort correctly in Unicode? (I)
22.11.12: Does MySQL allow CJK characters to be used in database and table names?
22.11.13: Shouldn't it be “CJKV”?
22.11.14:
Do MySQL plan to make a separate character set where
5C
is the Yen sign, as at least one other
major DBMS does?
22.11.15: What CJK character sets are available in MySQL?
22.11.16: Why don't CJK strings sort correctly in Unicode? (II)
22.11.17: Where can I get help with CJK and related issues in MySQL?
22.11.18: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
22.11.19:
How do I know whether character X
is
available in all character sets?
22.11.20: Of what issues should I be aware when working with Korean character sets in MySQL?
Questions and Answers
22.11.1: I've upgraded to MySQL 5.0. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
In MySQL Version 4.0, there was a single “global” character set for both server and client, and the decision as to which character to use was made by the server administrator. This changed starting with MySQL Version 4.1. What happens now is a “handshake”, as described in Section 9.1.4, “Connection Character Sets and Collations”:
When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the
character_set_client
,character_set_results
, andcharacter_set_connection
system variables. In effect, the server performs aSET NAMES
operation using the character set name.
The effect of this is that you cannot control the client
character set by starting mysqld with
--character-set-server=utf8
.
However, some of our Asian customers have said that they prefer
the MySQL 4.0 behavior. To make it possible to retain this
behavior, we added a mysqld switch,
--character-set-client-handshake
,
which can be turned off with
--skip-character-set-client-handshake
.
If you start mysqld with
--skip-character-set-client-handshake
,
then, when a client connects, it sends to the server the name of
the character set that it wants to use — however,
the server ignores this request from the
client.
By way of example, suppose that your favorite server character
set is latin1
(unlikely in a CJK area, but
this is the default value). Suppose further that the client uses
utf8
because this is what the client's
operating system supports. Now, start the server with
latin1
as its default character set:
mysqld --character-set-server=latin1
And then start the client with the default character set
utf8
:
mysql --default-character-set=utf8
The current settings can be seen by viewing the output of
SHOW VARIABLES
:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
Now stop the client, and then stop the server using mysqladmin. Then start the server again, but this time tell it to skip the handshake like so:
mysqld --character-set-server=utf8 --skip-character-set-client-handshake
Start the client with utf8
once again as the
default character set, then display the current settings:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
As you can see by comparing the differing results from
SHOW VARIABLES
, the server
ignores the client's initial settings if the
--skip-character-set-client-handshake
is used.
22.11.2: Why do I get Data truncated error messages?
For illustration, we'll create a table with one Unicode
(ucs2
) column and one Chinese
(gb2312
) column.
mysql>CREATE TABLE ch
->(ucs2 CHAR(3) CHARACTER SET ucs2,
->gb2312 CHAR(3) CHARACTER SET gb2312);
Query OK, 0 rows affected (0.05 sec)
We'll try to place the rare character 汌
in
both columns.
mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec)
Ah, there is a warning. Use the following statement to see what it is:
mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)
So it is a warning about the gb2312
column
only.
mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch; +-------+--------------+--------+-------------+ | ucs2 | HEX(ucs2) | gb2312 | HEX(gb2312) | +-------+--------------+--------+-------------+ | A汌B | 00416C4C0042 | A?B | 413F42 | +-------+--------------+--------+-------------+ 1 row in set (0.00 sec)
There are several things that need explanation here.
The fact that it is a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.
The 汌
character isn't in the
gb2312
character set. We described that
problem earlier.
Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (See Bug#9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.
With SQL_MODE=TRADITIONAL
, there would
be an error message, but instead of error 2165 you would
see: ERROR 1406 (22001): Data too long for column
'gb2312' at row 1
.
22.11.3:
Why do some LIKE
and
FULLTEXT
searches with CJK characters fail?
There is a very simple problem with
LIKE
searches on
BINARY
and
BLOB
columns: we need to know the
end of a character. With multi-byte character sets, different
characters might have different octet lengths. For example, in
utf8
, A
requires one byte
but ペ
requires three bytes, as shown here:
+-------------------------+---------------------------+ | OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') | +-------------------------+---------------------------+ | 1 | 3 | +-------------------------+---------------------------+ 1 row in set (0.00 sec)
If we don't know where the first character ends, then we don't
know where the second character begins, in which case even very
simple searches such as LIKE
'_A%'
fail. The solution is to use a regular CJK
character set in the first place, or to convert to a CJK
character set before comparing.
This is one reason why MySQL cannot allow encodings of nonexistent characters. If it is not strict about rejecting bad input, then it has no way of knowing where characters end.
For FULLTEXT
searches, we need to know where
words begin and end. With Western languages, this is rarely a
problem because most (if not all) of these use an
easy-to-identify word boundary — the space character.
However, this is not usually the case with Asian writing. We
could use arbitrary halfway measures, like assuming that all Han
characters represent words, or (for Japanese) depending on
changes from Katakana to Hiragana due to grammatical endings.
However, the only sure solution requires a comprehensive word
list, which means that we would have to include a dictionary in
the server for each Asian language supported. This is simply not
feasible.
22.11.4: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
Obtain a direct connection to the server using the
mysql client (Windows:
mysql.exe), and try the same query there. If
mysql responds correctly, then the trouble
may be that your application interface requires initialization.
Use mysql to tell you what character set or
sets it uses with the statement SHOW VARIABLES LIKE
'char%';
. If you are using Access, then you are most
likely connecting with MyODBC. In this case, you should check
Section 20.1.4, “Connector/ODBC Configuration”. If, for
instance, you use big5
, you would enter
SET NAMES 'big5'
. (Note that no
;
is required in this case). If you are using
ASP, you might need to add SET NAMES
in the
code. Here is an example that has worked in the past:
<% Session.CodePage=0 Dim strConnection Dim Conn strConnection="driver={MySQL ODBC 3.51 Driver};server=server
;uid=username
;" \ & "pwd=password
;database=database
;stmt=SET NAMES 'big5';" Set Conn = Server.CreateObject("ADODB.Connection") Conn.Open strConnection %>
In much the same way, if you are using any character set other
than latin1
with Connector/NET, then you must
specify the character set in the connection string. See
Section 20.2.5.1, “Connecting to MySQL Using Connector/NET”, for more
information.
If you are using PHP, try this:
<?php $link = mysql_connect($host, $usr, $pwd); mysql_select_db($db); if( mysql_error() ) { print "Database ERROR: " . mysql_error(); } mysql_query("SET NAMES 'utf8'", $link); ?>
In this case, we used SET NAMES
to change
character_set_client
and
character_set_connection
and
character_set_results
.
We encourage the use of the newer mysqli
extension, rather than mysql
. Using
mysqli
, the previous example could be
rewritten as shown here:
<?php $link = new mysqli($host, $usr, $pwd, $db); if( mysqli_connect_errno() ) { printf("Connect failed: %s\n", mysqli_connect_error()); exit(); } $link->query("SET NAMES 'utf8'"); ?>
Another issue often encountered in PHP applications has to do
with assumptions made by the browser. Sometimes adding or
changing a <meta>
tag suffices to
correct the problem: for example, to insure that the user agent
interprets page content as UTF-8
, you should
include <meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
in the
<head>
of the HTML page.
If you are using Connector/J, see Section 20.3.4.4, “Using Character Sets and Unicode”.
22.11.5: Why do Japanese character set conversions fail?
MySQL supports the sjis
,
ujis
, cp932
, and
eucjpms
character sets, as well as Unicode. A
common need is to convert between character sets. For example,
there might be a Unix server (typically with
sjis
or ujis
) and a
Windows client (typically with cp932
).
In the following conversion table, the ucs2
column represents the source, and the sjis
,
cp932
, ujis
, and
eucjpms
columns represent the destinations
— that is, the last 4 columns provide the hexadecimal
result when we use CONVERT(ucs2)
or we assign a ucs2
column containing the
value to an sjis
, cp932
,
ujis
, or eucjpms
column.
Character Name | ucs2 | sjis | cp932 | ujis | eucjpms |
---|---|---|---|---|---|
BROKEN BAR | 00A6 | 3F | 3F | 8FA2C3 | 3F |
FULLWIDTH BROKEN BAR | FFE4 | 3F | FA55 | 3F | 8FA2 |
YEN SIGN | 00A5 | 3F | 3F | 20 | 3F |
FULLWIDTH YEN SIGN | FFE5 | 818F | 818F | A1EF | 3F |
TILDE | 007E | 7E | 7E | 7E | 7E |
OVERLINE | 203E | 3F | 3F | 20 | 3F |
HORIZONTAL BAR | 2015 | 815C | 815C | A1BD | A1BD |
EM DASH | 2014 | 3F | 3F | 3F | 3F |
REVERSE SOLIDUS | 005C | 815F | 5C | 5C | 5C |
FULLWIDTH "" | FF3C | 3F | 815F | 3F | A1C0 |
WAVE DASH | 301C | 8160 | 3F | A1C1 | 3F |
FULLWIDTH TILDE | FF5E | 3F | 8160 | 3F | A1C1 |
DOUBLE VERTICAL LINE | 2016 | 8161 | 3F | A1C2 | 3F |
PARALLEL TO | 2225 | 3F | 8161 | 3F | A1C2 |
MINUS SIGN | 2212 | 817C | 3F | A1DD | 3F |
FULLWIDTH HYPHEN-MINUS | FF0D | 3F | 817C | 3F | A1DD |
CENT SIGN | 00A2 | 8191 | 3F | A1F1 | 3F |
FULLWIDTH CENT SIGN | FFE0 | 3F | 8191 | 3F | A1F1 |
POUND SIGN | 00A3 | 8192 | 3F | A1F2 | 3F |
FULLWIDTH POUND SIGN | FFE1 | 3F | 8192 | 3F | A1F2 |
NOT SIGN | 00AC | 81CA | 3F | A2CC | 3F |
FULLWIDTH NOT SIGN | FFE2 | 3F | 81CA | 3F | A2CC |
Now consider the following portion of the table.
ucs2 | sjis | cp932 | |
---|---|---|---|
NOT SIGN | 00AC | 81CA | 3F |
FULLWIDTH NOT SIGN | FFE2 | 3F | 81CA |
This means that MySQL converts the NOT SIGN
(Unicode U+00AC
) to sjis
code point 0x81CA
and to
cp932
code point 3F
.
(3F
is the question mark (“?”)
— this is what is always used when the conversion cannot
be performed.
22.11.6: Why are my supplementary characters rejected by MySQL?
Before MySQL 6.0.4, MySQL does not support supplementary
characters — that is, characters which need more than 3
bytes — for UTF-8
. We support only what
Unicode calls the Basic Multilingual Plane / Plane
0. Only a few very rare Han characters are
supplementary; support for them is uncommon. This has led to
reports such as that found in Bug#12600, which we rejected as
“not a bug”. With utf8
, we must
truncate an input string when we encounter bytes that we don't
understand. Otherwise, we wouldn't know how long the bad
multi-byte character is.
One possible workaround is to use ucs2
instead of utf8
, in which case the
“bad” characters are changed to question marks;
however, no truncation takes place. You can also change the data
type to BLOB
or
BINARY
, which perform no validity
checking.
As of MySQL 6.0.4, Unicode support is extended to include
supplementary characters by means of additional Unicode
character sets: utf16
,
utf32
, and 4-byte utf8
.
These character sets support supplementary Unicode characters
outside the Basic Multilingual Plane (BMP).
22.11.7:
What should I do if I want to convert SJIS
81CA
to cp932
?
Our answer is: “?”. There are serious complaints
about this: many people would prefer a “loose”
conversion, so that 81CA (NOT SIGN)
in
sjis
becomes 81CA (FULLWIDTH NOT
SIGN)
in cp932
. We are considering
a change to this behavior.
22.11.8: What problems should I be aware of when working with the Big5 Chinese character set?
MySQL supports the Big5 character set which is common in Hong
Kong and Taiwan (Republic of China). MySQL's
big5
is in reality Microsoft code page 950,
which is very similar to the original big5
character set. We changed to this
character set starting with MySQL version 4.1.16 / 5.0.16 (as a
result of Bug#12476). For example, the following statements work
in current versions of MySQL, but not in old versions:
mysql>CREATE TABLE big5 (BIG5 CHAR(1) CHARACTER SET BIG5);
Query OK, 0 rows affected (0.13 sec) mysql>INSERT INTO big5 VALUES (0xf9dc);
Query OK, 1 row affected (0.00 sec) mysql>SELECT * FROM big5;
+------+ | big5 | +------+ | 嫺 | +------+ 1 row in set (0.02 sec)
A feature request for adding HKSCS
extensions
has been filed. People who need this extension may find the
suggested patch for Bug#13577 to be of interest.
22.11.9:
I have inserted CJK characters into my table. Why does
SELECT
display them as
“?” characters?
This problem is usually due to a setting in MySQL that doesn't match the settings for the application program or the operating system. Here are some common steps for correcting these types of issues:
Be certain of what MySQL version you are using.
Use the statement SELECT VERSION();
to
determine this.
Make sure that the database is actually using the desired character set.
People often think that the client character set is always
the same as either the server character set or the character
set used for display purposes. However, both of these are
false assumptions. You can make sure by checking the result
of SHOW CREATE TABLE
or —
better — yet by using this statement:
tablename
SELECT character_set_name, collation_name FROM information_schema.columns WHERE table_schema = your_database_name AND table_name = your_table_name AND column_name = your_column_name;
Determine the hexadecimal value of the character or characters that are not being displayed correctly.
You can obtain this information for a column
column_name
in the table
table_name
using the following
query:
SELECT HEX(column_name
) FROMtable_name
;
3F
is the encoding for the
?
character; this means that
?
is the character actually stored in the
column. This most often happens because of a problem
converting a particular character from your client character
set to the target character set.
Make sure that a round trip possible — that
is, when you select literal
(or
_introducer hexadecimal-value
),
you obtain literal
as a
result.
For example, the Japanese
Katakana character
Pe (ペ'
)
exists in all CJK character sets, and has the code point
value (hexadecimal coding) 0x30da
. To
test a round trip for this character, use this query:
SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
If the result is not also ペ
, then the
round trip has failed.
For bug reports regarding such failures, we might ask you to
follow up with SELECT HEX('ペ');
. Then
we can determine whether the client encoding is correct.
Make sure that the problem is not with the browser or other application, rather than with MySQL.
Use the mysql client program (on Windows: mysql.exe) to accomplish this task. If mysql displays correctly but your application doesn't, then your problem is probably due to system settings.
To find out what your settings are, use the
SHOW VARIABLES
statement,
whose output should resemble what is shown here:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.03 sec)
These are typical character-set settings for an
international-oriented client (notice the use of
utf8
Unicode) connected to a server in
the West (latin1
is a West Europe
character set and a default for MySQL).
Although Unicode (usually the utf8
variant on Unix, and the ucs2
variant on
Windows) is preferable to Latin, it is often not what your
operating system utilities support best. Many Windows users
find that a Microsoft character set, such as
cp932
for Japanese Windows, is suitable.
If you cannot control the server settings, and you have no
idea what your underlying computer is, then try changing to
a common character set for the country that you're in
(euckr
= Korea; gb2312
or gbk
= People's Republic of China;
big5
= Taiwan; sjis
,
ujis
, cp932
, or
eucjpms
= Japan; ucs2
or utf8
= anywhere). Usually it is
necessary to change only the client and connection and
results settings. There is a simple statement which changes
all three at once: SET NAMES
. For
example:
SET NAMES 'big5';
Once the setting is correct, you can make it permanent by
editing my.cnf
or
my.ini
. For example you might add lines
looking like these:
[mysqld] character-set-server=big5 [client] default-character-set=big5
It is also possible that there are issues with the API configuration setting being used in your application; see Why does my GUI front end or browser not display CJK characters correctly...? for more information.
22.11.10:
How does MySQL represent the Yen (¥
) sign?
A problem arises because some versions of Japanese character
sets (both sjis
and euc
)
treat 5C
as a reverse
solidus (\
— also known as
a backslash), and others treat it as a yen sign
(¥
).
MySQL follows only one version of the JIS (Japanese Industrial
Standards) standard description. In MySQL,
5C
is always the reverse solidus
(\
).
22.11.11: Why don't CJK strings sort correctly in Unicode? (I)
Sometimes people observe that the result of a
utf8_unicode_ci
or
ucs2_unicode_ci
search, or of an
ORDER BY
sort is not what they think a native
would expect. Although we never rule out the possibility that
there is a bug, we have found in the past that many people do
not read correctly the standard table of weights for the Unicode
Collation Algorithm. MySQL uses the table found at
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
This is not the first table you will find by navigating from the
unicode.org
home page, because MySQL uses the
older 4.0.0 “allkeys” table, rather than the more
recent 4.1.0 table. This is because we are very wary about
changing ordering which affects indexes, lest we bring about
situations such as that reported in Bug#16526, illustrated as
follows:
mysql<CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec) mysql>INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM tj WHERE s1 = 'か';
+------+ | s1 | +------+ | が | | か | +------+ 2 rows in set (0.00 sec)
The character in the first result row is not the one that we
searched for. Why did MySQL retrieve it? First we look for the
Unicode code point value, which is possible by reading the
hexadecimal number for the ucs2
version of
the characters:
mysql> SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1 | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が | 304C |
| か | 304B |
+------+-----------------------------+
2 rows in set (0.03 sec)
Now we search for 304B
and
304C
in the 4.0.0 allkeys
table, and find these lines:
304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA 304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
The official Unicode names (following the “#” mark)
tell us the Japanese syllabary (Hiragana), the informal
classification (letter, digit, or punctuation mark), and the
Western identifier (KA
or
GA
, which happen to be voiced and unvoiced
components of the same letter pair). More importantly, the
primary weight (the first hexadecimal
number inside the square brackets) is 1E57
on
both lines. For comparisons in both searching and sorting, MySQL
pays attention to the primary weight only, ignoring all the
other numbers. This means that we are sorting
が
and か
correctly
according to the Unicode specification. If we wanted to
distinguish them, we'd have to use a non-UCA (Unicode Collation
Algorithm) collation (utf8_bin
or
utf8_general_ci
), or to compare the
HEX()
values, or use
ORDER BY CONVERT(s1 USING sjis)
. Being
correct “according to Unicode” isn't enough, of
course: the person who submitted the bug was equally correct. We
plan to add another collation for Japanese according to the JIS
X 4061 standard, in which voiced/unvoiced letter pairs like
KA
/GA
are distinguishable
for ordering purposes.
22.11.12: Does MySQL allow CJK characters to be used in database and table names?
This issue is fixed in MySQL 5.1, by automatically rewriting the names of the corresponding directories and files.
For example, if you create a database named
楮
on a server whose operating system does
not support CJK in directory names, MySQL creates a directory
named @0w@00a5@00ae
. which is just a fancy
way of encoding E6A5AE
— that is, the
Unicode hexadecimal representation for the
楮
character. However, if you run a
SHOW DATABASES
statement, you can
see that the database is listed as 楮
.
22.11.13: Shouldn't it be “CJKV”?
No. The term “CJKV” (Chinese Japanese Korean Vietnamese) refers to Vietnamese character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters.
Bug#4745 is a request for a specialized Vietnamese collation, which we might add in the future if there is sufficient demand for it.
22.11.14:
Do MySQL plan to make a separate character set where
5C
is the Yen sign, as at least one other
major DBMS does?
This is one possible solution to the Yen sign issue; however, this will not happen in MySQL 5.1 or 6.0.
22.11.15: What CJK character sets are available in MySQL?
The list of CJK character sets may vary depending on your MySQL
version. For example, the eucjpms
character
set was not supported prior to MySQL 5.0.3 (see
Section C.1.90, “Changes in MySQL 5.0.3 (23 March 2005 Beta)”). However, since the name of the
applicable language appears in the
DESCRIPTION
column for every entry in the
INFORMATION_SCHEMA.CHARACTER_SETS
table, you can obtain a current list of all the non-Unicode CJK
character sets using this query:
mysql>SELECT CHARACTER_SET_NAME, DESCRIPTION
->FROM INFORMATION_SCHEMA.CHARACTER_SETS
->WHERE DESCRIPTION LIKE '%Chinese%'
->OR DESCRIPTION LIKE '%Japanese%'
->OR DESCRIPTION LIKE '%Korean%'
->ORDER BY CHARACTER_SET_NAME;
+--------------------+---------------------------+ | CHARACTER_SET_NAME | DESCRIPTION | +--------------------+---------------------------+ | big5 | Big5 Traditional Chinese | | cp932 | SJIS for Windows Japanese | | eucjpms | UJIS for Windows Japanese | | euckr | EUC-KR Korean | | gb2312 | GB2312 Simplified Chinese | | gbk | GBK Simplified Chinese | | sjis | Shift-JIS Japanese | | ujis | EUC-JP Japanese | +--------------------+---------------------------+ 8 rows in set (0.01 sec)
(See Section 19.9, “The INFORMATION_SCHEMA CHARACTER_SETS
Table”, for more
information.)
MySQL supports the two common variants of the
GB (Guojia
Biaozhun, or National
Standard, or Simplified Chinese)
character sets which are official in the People's Republic of
China: gb2312
and gbk
.
Sometimes people try to insert gbk
characters
into gb2312
, and it works most of the time
because gbk
is a superset of
gb2312
— but eventually they try to
insert a rarer Chinese character and it doesn't work. (See
Bug#16072 for an example).
Here, we try to clarify exactly what characters are legitimate
in gb2312
or gbk
, with
reference to the official documents. Please check these
references before reporting gb2312
or
gbk
bugs.
For a complete listing of the gb2312
characters, ordered according to the
gb2312_chinese_ci
collation:
gb2312
MySQL's gbk
is in reality
“Microsoft code page 936”. This differs from
the official gbk
for characters
A1A4
(middle dot),
A1AA
(em dash),
A6E0-A6F5
, and
A8BB-A8C0
. For a listing of the
differences, see
http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h.
For a listing of gbk
/Unicode mappings,
see
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.
For MySQL's listing of gbk
characters,
see
gbk.
22.11.16: Why don't CJK strings sort correctly in Unicode? (II)
If you are using Unicode (ucs2
or
utf8
), and you know what the Unicode sort
order is (see Section A.11, “MySQL 5.0 FAQ — MySQL Chinese, Japanese, and Korean
Character Sets”), but MySQL still seems
to sort your table incorrectly, then you should first verify the
table character set:
mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Since the character set appears to be correct, let's see what
information the
INFORMATION_SCHEMA.COLUMNS
table
can provide about this column:
mysql>SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
->FROM INFORMATION_SCHEMA.COLUMNS
->WHERE COLUMN_NAME = 's1'
->AND TABLE_NAME = 't';
+-------------+--------------------+-----------------+ | COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME | +-------------+--------------------+-----------------+ | s1 | ucs2 | ucs2_general_ci | +-------------+--------------------+-----------------+ 1 row in set (0.01 sec)
(See Section 19.3, “The INFORMATION_SCHEMA COLUMNS
Table”, for more information.)
You can see that the collation is
ucs2_general_ci
instead of
ucs2_unicode_ci
. The reason why this is so
can be found using SHOW CHARSET
, as shown
here:
mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)
For ucs2
and utf8
, the
default collation is “general”. To specify a
Unicode collation, use COLLATE
ucs2_unicode_ci
.
22.11.17: Where can I get help with CJK and related issues in MySQL?
The following resources are available:
A listing of MySQL user groups can be found at http://dev.mysql.com/user-groups/.
You can contact a sales engineer at the MySQL KK Japan office using any of the following:
Tel: +81(0)3-5326-3133 Fax: +81(0)3-5326-3001 Email: dsaito@mysql.com
View feature requests relating to character set issues at http://tinyurl.com/y6xcuf.
Visit the MySQL Character Sets, Collation, Unicode Forum. We are also in the process of adding foreign-language forums at http://forums.mysql.com/.
22.11.18: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
A Simplified Chinese version of the Manual, current for MySQL 5.1.12, can be found at http://dev.mysql.com/doc/. The Japanese translation of the MySQL 4.1 manual can be downloaded from http://dev.mysql.com/doc/.
22.11.19:
How do I know whether character X
is
available in all character sets?
The majority of simplified Chinese and basic nonhalfwidth
Japanese Kana characters appear
in all CJK character sets. This stored procedure accepts a
UCS-2
Unicode character, converts it to all
other character sets, and displays the results in hexadecimal.
DELIMITER // CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2) BEGIN CREATE TABLE tj (ucs2 CHAR(1) character set ucs2, utf8 CHAR(1) character set utf8, big5 CHAR(1) character set big5, cp932 CHAR(1) character set cp932, eucjpms CHAR(1) character set eucjpms, euckr CHAR(1) character set euckr, gb2312 CHAR(1) character set gb2312, gbk CHAR(1) character set gbk, sjis CHAR(1) character set sjis, ujis CHAR(1) character set ujis); INSERT INTO tj (ucs2) VALUES (ucs2_char); UPDATE tj SET utf8=ucs2, big5=ucs2, cp932=ucs2, eucjpms=ucs2, euckr=ucs2, gb2312=ucs2, gbk=ucs2, sjis=ucs2, ujis=ucs2; /* If there is a conversion problem, UPDATE will produce a warning. */ SELECT hex(ucs2) AS ucs2, hex(utf8) AS utf8, hex(big5) AS big5, hex(cp932) AS cp932, hex(eucjpms) AS eucjpms, hex(euckr) AS euckr, hex(gb2312) AS gb2312, hex(gbk) AS gbk, hex(sjis) AS sjis, hex(ujis) AS ujis FROM tj; DROP TABLE tj; END//
The input can be any single ucs2
character,
or it can be the code point value (hexadecimal representation)
of that character. For example, from Unicode's list of
ucs2
encodings and names
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt),
we know that the Katakana
character Pe appears in all CJK
character sets, and that its code point value is
0x30da
. If we use this value as the argument
to p_convert()
, the result is as shown here:
mysql> CALL p_convert(0x30da)//
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8 | big5 | cp932 | eucjpms | euckr | gb2312 | gbk | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379 | A5DA | ABDA | A5DA | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)
Since none of the column values is 3F
—
that is, the question mark character (?
)
— we know that every conversion worked.
22.11.20: Of what issues should I be aware when working with Korean character sets in MySQL?
In theory, while there have been several versions of the
euckr
(Extended Unix Code
Korea) character set, only one problem has been
noted.
We use the “ASCII” variant of EUC-KR, in which the
code point 0x5c
is REVERSE SOLIDUS, that is
\
, instead of the “KS-Roman”
variant of EUC-KR, in which the code point
0x5c
is WON
SIGN
(₩
). This means that you
cannot convert Unicode U+20A9
to
euckr
:
mysql>SELECT
->CONVERT('₩' USING euckr) AS euckr,
->HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec)
MySQL's graphic Korean chart is here: euckr.
User Comments
Add your own comment.