MySQL :: MySQL 5.5 Reference Manual :: 9.1.10.4 The utf8 Character Set (Three-Byte UTF-8 Unicode Encoding)

MySQL 5.5 Reference Manual :: 9 Internationalization and Localization :: 9.1 Character Set Support :: 9.1.10 Unicode Support :: 9.1.10.4 The utf8 Character Set (Three-Byte UTF-8 Unicode Encoding)

« 9.1.10.3 The utf32 Character Set (UTF-32 Unicode Encoding)

9.1.10.5 The utf8mb3 “Character Set” (Alias for utf8) »

Section Navigation [Toggle]

9.1.10 Unicode Support
9.1.10.1 The ucs2 Character Set (UCS-2 Unicode Encoding)
9.1.10.2 The utf16 Character Set (UTF-16 Unicode Encoding)
9.1.10.3 The utf32 Character Set (UTF-32 Unicode Encoding)
9.1.10.4 The utf8 Character Set (Three-Byte UTF-8 Unicode Encoding)
9.1.10.5 The utf8mb3 “Character Set” (Alias for utf8)
9.1.10.6 The utf8mb4 Character Set (Four-Byte UTF-8 Unicode Encoding)

9.1.10.4. The `utf8` Character Set (Three-Byte UTF-8 Unicode Encoding)

UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. (An older standard for UTF-8 encoding, RFC 2279, describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)

The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte or four-byte sequences.

The utf8 character set is the same in MySQL 5.5 as before 5.5 and has exactly the same characteristics:

No support for supplementary characters (BMP characters only)
A maximum of three bytes per multi-byte character

Exactly the same set of characters is available in utf8 as in ucs2. That is, they have the same repertoire.

Previous / Next / Up / Table of Contents

User Comments

Add your own comment.

Top / Previous / Next / Up / Table of Contents

9.1.10.4. The utf8 Character Set (Three-Byte UTF-8 Unicode Encoding)

User Comments

9.1.10.4. The `utf8` Character Set (Three-Byte UTF-8 Unicode Encoding)