Internationalization Topics

ICU Collator for C/C++

The ICU Collator performs collation based upon a series of rules defined by the current locale. Whereas in US English everything is sorted based on A-Z, other countries have different rules:

  • The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
  • Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
  • Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
  • Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".

The following code illustrates the ICU Collator:

UChar *s [] = { /* list of Unicode strings */ }
uint32_t listSize = size_of_the_list;
UErrorCode status = U_ZERO_ERROR;
UCollator *coll = ucol_open("en_US", &status);
uint32_t i, j;
if(U_SUCCESS(status)) {
  for(i=listSize-1; i>=1; i--) {
    for(j=0; j<i; j++) {
      if(ucol_strcoll(s[j], -1, s[j+1], -1) = UCOL_LESS) {
        swap(s[j],
        s[j+1]);
      }
    }
  }
  ucol_close(coll);
}