Internationalization Topics

Unicode

Unicode is a character set containing a unified representation of much of the world's characters. It is developed, maintained, and promoted by the Unicode Consortium, a nonprofit computer industry organization. The Unicode character set has been encoded using several different encoding schemes, such as:

  • fixed-width, 2-byte encoding (UTF-16 a.k.a. UCS-2 used commonly on MS Windows platforms)
  • fixed-width, 4-byte encoding (UTF-32 a.k.a. UCS-4 used commonly on GNU systems)
  • variable-width, multibyte encoding (UTF-8 a one to four 8-bit byte scheme commonly used on UNIX platforms, or UTF-7 used for compatibility with legacy email systems).

Unicode maps code points to characters, but does not actually specify how the data will be represented in memory, in a database, or on a Web page. This is where the actual encoding of Unicode data comes into play. Some common encodings are:

  • UCS-2: Universal Character Set maps from a character set definition to the actual code units used to represent the data. UCS-2 is the main Unicode encoding used by Microsoft Windows NT® 4.0, Microsoft SQL Server version 7.0, and Microsoft SQL Server 2000. UCS-2 allows for encoding of 65,536 different code points. All information that is stored in Unicode (via NCHAR, NVARCHAR, and NTEXT) in SQL Server 2000 is stored in this encoding, which uses 2 bytes for every character, regardless of the character being used.
  • UTF-16: UCS Transformation Format transforms a UCS representation so that the data can be passed more reliably through specific environments and provide extensibility beyond the character limitations for the surrogate range. UTF-16 is identical to the 16-bit encoding form of Unicode (AKA UCS-2) except that it allows for the mapping of the surrogate range above 65,536 by using encoded pairs of 16-bit values. It is the primary Unicode encoding scheme used by Microsoft Windows 2000.
  • UTF-32: AKA UCS-4, this form of encoding uses 32 bits and, therefore, covers all of IS0 10646. It is the primary Unicode encoding scheme used by GNU systems, running on UNIX/Linux platforms.
  • UTF-8: Many ASCII and other byte oriented systems that require 8-bit encodings (such as mail servers) must span a vast array of computers using different encodings, byte orders, and languages. UTF-8 treats Unicode data independently of the byte ordering on the computer. Keep in mind that when Microsoft speaks of Unicode they are speaking of UCS-2 and UTF-16. They consider UTF-8 to be another multi-byte character set, whereas the rest of the world considers UTF-8 and the word Unicode to be synonymous.

Unicode Support within Current Technologies

Many other database systems (such as Oracle and Sybase SQL Server) support Unicode using UTF-8 storage. Depending on a server's implementation, this can be technically easier for a database engine to implement, since all of the existing text management code on the server that is designed to deal with data one byte at a time does not require major changes.

In the Windows environment, UTF-8 storage has these disadvantages:

  • The Component Object Model (COM) supports UTF-16/UCS-2 in its APIs and interfaces. This requires simple conversion of UTF-8 to UTF-16 when use in a COM interface is required. This issue only applies when COM is used, as the SQL Server database engine does not typically call COM interfaces.
  • The Windows NT and Windows 2000 kernels are both Unicode and use UCS-2 and UTF-16, respectively. Once again, a UTF-8 storage format requires simple conversions to UTF-16. As with the previous note on COM, this would not result in a conversion hit in the SQL Server database engine, but would potentially affect many client-side operations.
  • UTF-8 can be slower for many string operations. Sorting, comparing, and virtually any string operation can be slowed by the fact that characters do not have a fixed width. However, UTF-8 makes things as simple as possible by indicating the number of bytes in a multi-byte sequence with the first byte in in the sequence.
  • UTF-8 requires one byte for most Latin-based characters, two bytes for most Middle Eastern locales, and three bytes for Asian characters. Overall, it is reasonably efficient with respect to storage.
  • XML's default encoding is UTF-8, as is Oracle's Database when implemented for handling multilingual data within a single database instance. To include localized strings in an XML document, convert the file and strings to UTF8 format. Recent changes to the XML standard suggest always explicitly declaring UTF8 in the encoding tag even though it is the default encoding. Microsoft XML parsers since IE 4.0 write out XML as UTF8.