This section contains a list of terms and definitions used in the context of collations.
allkeys.txt: An example of a series of collating-table entries as defined by UCA ( Unicode Collation Algorithm).
Actually UCA says allkeys.txt
is a
“collation element table”, that is, it is the
part of a collating table which shows collating elements.
COLLATING ELEMENT: The unit which linguistically-aware users perceive as the minimal building block in string comparisons.
Usually there is a one-to-one relation between characters and collating elements, for example in English there is a character “A” and a collating element for “A”. More rarely there is a many-to-one relation, for example in traditional Spanish the two-character combination “LL” is a single collating element.
Usually there is a one-to-many relation between collating elements and weights (because there are multiple levels); however, for an ignorable character, one collating element has zero weights.
COLLATING TABLE: A table which describes all the rules for a collation, including Posix-like “Locale” declarations and a list of collating elements.
Here are entries for collating elements from two sources,
ISO 14651 and
allkeys.txt
:
[From ISO 14651] <U0024> <S2C4>;<BASE>;<MIN>;<U0024> % DOLLAR SIGN <UFF04> <S2C4>;<BASE>;<WIDE>;<UFF04> % FULLWIDTH DOLLAR SIGN <UFE69> <S2C4>;<BASE>;<SMALL>;<UFE69> % SMALL DOLLAR SIGN
[From allkeys.txt 4.0] 0024 ; [.0E0F.0020.0002.0024] # DOLLAR SIGN FF04 ; [.0E0F.0020.0003.FF04] # FULLWIDTH DOLLAR SIGN; QQK FE69 ; [.0E0F.0020.000F.FE69] # SMALL DOLLAR SIGN; QQK
Clearly these are the same thing, but ISO 14651 uses names
(e.g. “BASE”) where
allkeys.txt
uses numbers (e.g.
0020
). So ISO 14651 had to define earlier
in its table BASE = 0020; MIN =
0002; WIDE = 0003; SMALL= 000F
etc.
COLLATION ELEMENT: Do not use. Use collating element.
COLLATION TABLE: Do not use. Use collating table.
COLLATING TABLE ENTRY: A line in a collating table, representing one fact.
Each “line” in allkeys.txt
(which is a subset of a collating table) is an entry for one
collating element.
CONTRACTION: A mapping from
N
characters to
less-than-N
collation elements.
Contraction is rare, for example the character
“C” has one collation element “C”.
But take an example from traditional Spanish:
“LL” is a single collation element between
“L” and “M”. Contraction also
occurs when there has been decomposition. For example here
are two collating element entries (from
allkeys.txt
5.0.0):
0622 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE 0627 0653 ; [.15E2.0020.0002.0622] # ARABIC LETTER ALEF WITH MADDA ABOVE
Notice that there is one collation element labelled
0627 0653
, which clearly is the result of
mapping from two characters U+0627 ARABIC LETTER
ALEF
and U+0653 ARABIC MADDAH
ABOVE
, with the same weights as the composed
character U+0622 ARABIC LETTER ALEF WITH MADDA
ABOVE
.
EXPANSION: A one-to-many mapping from collating element to weighting levels.
For example, German Sharp S may be treated as
“ss”, so the
allkeys.txt
entry for
collating element 00DF
(Sharp S) is:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S;
The entry for “s” alone is:
0073 ; [.11AF.0020.0002.0073] # LATIN SMALL LETTER S
Since 0000
means ignorable, two-level
weight strings are:
11AF 11AF 0020 0199 0020 /* for SHARP S */ 11AF 11AF 0020 0020 /* for 'ss' */
IGNORABLE CHARACTER: A character which has one collating element which has no significance for comparison. One ignorable character has one collating element, but zero weights at all levels. For example (from allkeys.txt): 0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA This is ignorable for three levels but not four levels. Therefore it is an “ignorable character” when you produce a weight string for one, two, or three levels. “Ignorable at level 1” means the level-1 weight is ignorable, as represented by 0000 in allkeys.txt. “Fully ignorable” means ignorable for all levels.
ISO 14651: The ISO/IEC 14651 “International String Ordering” standard.
Draft documents:
LEVEL: A prioritization order for weights.
Each level has a name “level + number”, for
example “level 1”, “level 2”,
“level 3”, “level 4”. (Do not use,
or rarely use, equivalent terms “primary”,
“secondary”, “tertiary”,
“quaternary”.) Typically level 1 is the
character-differs level for WHERE
clauses, levels 2 and following are case-differs or
accent-differs something-minor-differs levels which might be
useful for ORDER BY
clauses. For example,
from allkeys.txt
5.0.0:
0061 ; [.0FD0.0020.0002.0061] # LATIN SMALL LETTER A 24D0 ; [.0FD0.0020.0006.24D0] # CIRCLED LATIN SMALL LETTER A; 0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A
There are four levels here. Level 1 is always
0FD0
for “A”. Level 2 is
always 0020
. Level 3 is
0002
for
SMALL
, 0006
for
CIRCLED
,
0008
for CAPITAL
.
Level 4 is the same as the Unicode code point value. Do not
confuse “weight level” with “weighting
level”.
ORDERING KEY: Do not use. Use “weight string”.
SORTKEY: Do not use. Use “weight string”.
SUBKEY: A sequence of weights for a single level.
UCA: Unicode Collation Algorithm as described in Unicode Technical Standard #10, http://www.unicode.org/reports/tr10.
WEIGHT: A positive numeric value used for comparisons.
Weights come from collating tables and go to weight strings.
Often weight appears as a 4-digit number in collating
tables. For example (from
allkeys.txt
):
0062 ; [.0FE6.0020.0002.0062] # LATIN SMALL LETTER B
This is the entry for collating element
0062
, and there are 4 weights:
0FE6
and 0020
and
0002
and 0062
.
WEIGHT STRING: A binary string, sometimes called a “sortkey” or an “ordering key”, produced by taking a series of weights from a collating table for a certain number of levels, ordering them by position and level, and outputting.
For example: starting with a character string
ABC
, and knowing that the number of
levels is 2, look up the collating elements for
A
and B
and
C
in allkeys.txt
5.0.0:
0041 ; [.0FD0.0020.0008.0041] # LATIN CAPITAL LETTER A 0042 ; [.0FE6.0020.0008.0042] # LATIN CAPITAL LETTER B 0043 ; [.0FFE.0020.0008.0043] # LATIN CAPITAL LETTER C
Result: 0FD0.0FE6.0FFE.0020.0020.0020
.
MySQL's weight_string()
function produces a weight string.
WEIGHTING ELEMENT: A sequence of weights, in ascending order by level.
For example, from allkeys.txt
5.0.0:
00DF ; [.11AF.0020.0004.00DF][.0000.0199.0004.00DF][.11AF.0020.001F.00DF] # LATIN SMALL LETTER SHARP S
There are three weighting elements in this example, each is surrounded by square brackets:
[.11AF.0020.0004.00DF] [.0000.0199.0004.00DF] [.11AF.0020.001F.00DF]
Often one collating element has only one weighting element
(which has many weights), but SHARP S
is
an example of expansion.
ZERO WEIGHTS: The meaning is “an empty sequence of weights” (the ISO 14651 definition), not “weights with value 0000” (the UCA definition).
For example (from allkeys.txt
):
0591 ; [.0000.0000.0000.0591] # HEBREW ACCENT ETNAHTA
There are three “empty sequences of weights”
here, all of which look like 0000
, which
we interpret as code for “empty”.