Internationalization Topics

Collation

Collation is a series of rules based upon the locale, and used to sort characters and words. Whereas in US English, everything is sorted based on A-Z, other countries have different rules:

  • The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
  • Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
  • Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
  • Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".

Ideographic languages (such as Katakana and Hiragana) have different collation rules altogether. Ideographic text is collated according to pronunciation, radical, structure, or stroke count.

Collation of numeric, time and date data presents no international issues for commercial applications.

The collation of text displayed to the user must be carried out in a locale-sensitive manner. It is recommended that collation be carried out in the database (see Oracle Collation for more information). However, if it is necessary to collate data in the program logic layer, programmers should be aware of locale-specific sorting capabilities. In such cases, locale-specific collation should be carried out either by a C/C++ Collator or Java Collator.