Internationalization Topics

Collation in Java

If database collation is not acceptable (as is the case with some database products like mySQL), Java's Collator class does allow the application to perform string comparisons for different languages. You invoke the Collator.compare method to perform a locale-independent string comparison. The compare method returns an integer less than, equal to, or greater than zero when the first string argument is less than, equal to, or greater than the second string argument.

Use this class to build searching and sorting routines for natural language text.
Collator is an abstract base class. Subclasses implement specific collation strategies. One subclass, RuleBasedCollator, is currently provided with the JDK and is applicable to a wide set of languages. Other subclasses may be created to handle needs that are more specialized.

Like other locale-sensitive classes, the Collator can use the static factory method, getInstance, to obtain the appropriate Collator object for a given locale. You will only need to look at the subclasses of Collator if you need to understand the details of a particular collation strategy or if you need to modify that strategy.

You can set a Collator's strength property to determine the level of difference considered significant in comparisons. Four strengths are provided: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. The exact assignment of strengths to language features is locale dependent. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ê" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical.

Java Collation Example

Look at the following three strings: äpple, banan, and orange. The order shown is the correct order if we were to sort these strings using German collation rules. An uninformed programmer might try to sort these strings using the following program:

public class IncorrectSort {
  public static void main(String [] argv) {
    String fruit[] = { "orange", "äpple", "banan"   };
    String tmp;

    for (int i = 0; i < fruit.length; i++) {
      for (int j = i + 1; j < fruit.length; j++) {
        if ( fruit[i].compareTo( fruit[j] ) > 0 ) {
          // Swap fruit[i] and fruit[j]
          tmp = fruit[i];
          fruit[i] = fruit[j];
          fruit[j] = tmp;
        }
      }
    }

    for (int k = 0; k < fruit.length; k++)
      System.out.println(fruit[k]);
  }
}

The program sorts the strings incorrectly as banan, orange, äpple. It does this because the encoded value of "ä" is greater than "b" and "o".

Below is the correct way to sort these strings:

import java.util.Locale;
import java.text.Collator;

public class CorrectSort {
  public static void main(String [] argv) {
    String fruit[] = { "orange", "äpple", "banan"   };
    String tmp;
    Collator collate =
    Collator.getInstance(Locale.GERMAN);

    for (int i = 0; i < fruit.length; i++) {
      for (int j = i + 1; j < fruit.length; j++) {
        if ( collate.compare( fruit[i], fruit[j] ) > 0 ) {
          // Swap fruit[i] and fruit[j]
          tmp = fruit[i];
          fruit[i] = fruit[j];
          fruit[j] = tmp;
        }
      }
    }

    for (int k = 0; k < fruit.length; k++)
      System.out.println(fruit[k]);
  }
}

In this example, the strings properly sort as äpple, banan, and orange.

The following example shows how to compare two strings using the Collator for the default locale:

// Compare two strings in the default locale
Collator myCollator = Collator.getInstance();
if( myCollator.compare("abc", "ABC") < 0)
  System.out.println("abc is less than ABC");
else
  System.out.println("abc is greater than or equal to ABC");

The following shows how both case and accents could be ignored for US English:

//Get the Collator for US English and set its strength to PRIMARY
Collator usCollator = Collator.getInstance(Locale.US);
usCollator.setStrength(Collator.PRIMARY);
if(usCollator.compare("abc", "ABC") == 0) {
  System.out.println("Strings are equivalent");