Internationalization Topics

Locales in C/C++

A locale identifier is a string composed of 2 or 3 elements specifying a language, the region in which that language is employed, and an optional variant. A somewhat oversimplified view of how a locale is used by an application is that the language portion is used to display text appropriate to the locale, and the region portion is used to format dates, time, currency, etc., appropriate to the locale. Examples of locales are:

Locale Language Region Variant
en_US English USA None
de_AT German Austria None
en_GB English United Kingdom None
fr_FR French France None
fr_FR_Euro French France The variant specifies that currency is to be displayed as Euro, rather than Francs

Abbreviations for the language portion of the locale, always written in lower case, are defined by ISO-639. Abbreviations for the country portion of the locale, which are written in upper case, are defined by ISO-3166. The variant field is defined by the run-time environment and differs between IBM, Sun, and Microsoft. For example IBM and Sun support the _Euro variant but Microsoft does not. Oracle uses unique language and region designators, not standard locale designators.

For applications that can be fully converted to Unicode, Lingoport recommends the use of the ICU Locale. Applications relying on the platform character set should use the ANSI C locale.

ANSI C Locale

Using the glibc documentation, default locales are available to all the objects in a program. If you set a new default locale for one section of code, it can affect the entire program. Application programs should not set the default locale as a way to request an international object. The default locale is set to be the system locale on that platform.

A C program inherits locale environment variables upon startup. However, by default, these variables do not control the locale used by library functions. To use these environment variables, you must use the call setlocale:

setlocale (LC_ALL, "");

The following is a list of locale categories:

Category Description
LC_COLLATE This category applies to collation of strings.
LC_CTYPE This category applies to classification and conversion of characters.
LC_MONETARY This category applies to formatting monetary values.
LC_NUMERIC This category applies to formatting numeric values that are not monetary.
LC_TIME This category applies to formatting date and time values.
LC_ALL This is not an environment variable, it is only a macro that you can use with setlocale to set a single locale for all purposes.
LANG If this environment variable is defined, its value specifies the locale to use for all purposes except as overridden by the variables above.

Visual C++

In Visual C++ the locale is a unique combination of language, Country/Region, and code page, and is specified using the setlocale() function for an MBCS application, _wsetlocale for a UTF-16 Unicode application, or _tsetlocale for a Generic application, where the _MBCS and _UNICODE compiler flags determine which function is called. For example, using the wide-character _wsetlocale function:

wchar_t fr_FR L'French_France';

_wsetlocale(LC_ALL,fr_FR);
_wsetlocale(LC_NUMERIC,fr_FR);

This code sets the locale to fr_FR. The first call to _wsetlocale() sets all locale information to fr_FR, while the second sets only the numeric locale information to fr_FR.

Subsequent calls to wprintf() will display numeric information in a manner appropriate to setlocale() information.

See the MSDN Library for the list of language and country strings that can be used to specify the locale.

ICU Locale for C

In ICU for C, a locale is a character string. For example, to set a locale based upon Belgian French with a Euro currency convention:

const char *loc = "fr_BE_EURO";

The locale string will be used by the ICU components to perform various locale-based formatting activities. For example, the following creates various number formatters for the German locale:

UErrorCode status = U_ZERO_ERROR;
UNumberFormat *nf;

nf = unum_open(UNUM_DEFAULT, "de_DE", &status);
unum_close(nf);
nf = unum_open(UNUM_CURRENCY, "de_DE", &status);
unum_close(nf);
nf = unum_open(UNUM_PERCENT, "de_DE", &status);
unum_close(nf);

More information can be found at the ICU Userguide website.

Standard C++ Locale

In C++, the locale class is an abstraction that manages the locale facets--separate classes that encapsulate specific internationalization functionality. More information on using the standard locale for C++ can be found at cantrip.org. The following describes the capabilities of the standard C++ facets:

  • Code Conversion: The facet codecvt<internT,externT,stateT> is used when converting from one encoding scheme to another, such as from the multibyte encoding JIS to the wide-character encoding Unicode. The main member functions are in() and out().
  • Collate: The facet collate<charT> provides features for string collation, including a compare() function used for string comparison.
  • Ctype: The facet ctype<charT> encapsulates the Standard C++ Library ctype features for character classification, like tolower(), toupper(), is(ctype_base::space,...) etc.
  • Messages: The facet messages<charT> implements message retrieval. It provides facilities to access message catalogues via open() and close(catalog), and to retrieve messages via get(..., int msgid,...).
  • Monetary: The facets money_get<charT,bool,InputIterator> and money_put<charT,bool, OutputIterator> handle formatting and parsing of monetary values. They provide get() and put() member functions that parse or produce a sequence of digits, representing a count of the smallest unit of the currency. For example, the sequence $1,056.23 in a common US locale would yield 105623 units, or the character sequence "105623”. The facet moneypunct <charT, bool International> handles monetary formats and punctuation like the facet numpunct<charT> handles numeric formats and punctuation. It comes with functions like curr_symbol(), etc.
  • Numeric: The facets num_get<charT,InputIterator>and num_put<charT, OutputIterator> handle numeric formatting and parsing. The facets provide get() and put() member functions for values of type long, double, etc. The facet numpunct<charT> specifies numeric formats and punctuation. It provides functions like decimal_point(), thousands_sep(), etc.
  • Time: The facets time_get<charT,InputIterator> and time_put<charT, OutputIterator> handle date and time formatting and parsing. They provide functions like put(), get_time(), get_date(), get_weekday(),etc.

More information about the C++ locale and facets can be found here.

ICU for C++

In ICU for C++, locales are represented by the locale class. These locale objects can be specified according to a user's preference and then be passed as arguments for functions requiring locale-sensitive processing. It should be noted that ICU locales do not specify the character encoding used by the operating system.

For example, to set a locale based upon Belgian French with a Euro currency convention:

Locale *loc = new Locale("fr", "BE", "EURO");

The locale object will be used by the ICU components to perform various locale-based formatting activities. For example, the following creates various number formatters for the "Germany" locale:

UErrorCode status = U_ZERO_ERROR;
NumberFormat *nf;

nf = NumberFormat::createInstance( Locale::GERMANY, status);
delete nf;
nf = NumberFormat::createCurrencyInstance( Locale::GERMANY, status);
delete nf;
nf = NumberFormat::createPercentInstance( Locale::GERMANY, status );
delete nf;