Internationalization Topics

Email

Internet Email was originally a 7-bit only data transfer protocol, defined by RFC 822. RFC 822 requires that email systems support only a 7-bit ASCII character encoding for email messages and headers; 7-bit ASCII is the steadfast standard for email delivery over the Internet. RFCs 2045-2049 introduced an additional standard called Multipurpose Internet Mail Extensions (MIME) that extends RFC 822. MIME solved the problem of delivering emails containing text utilizing non-ASCII character sets (for example, text encoded as Shift-JIS or utf-8). Please see RFCs 2045-2049 for additional information.

In the MIME header, the character set should be specified. For the best mail reader compliance, native encoding should be used. The process should be to convert the textual output from a Unicode-compliant text repository to a native character encoding by setting the MIME charset parameter accordingly. For example:

charset=utf-8

MIME provides a mechanism called "Content Transfer Encoding" for converting 8-bit data to a 7-bit clean format. The two choices relevant here are Quoted Printable, and Base-64.

For text that is mainly ASCII with some extended characters (i.e. most ISO-Latin-1), Quoted Printable will be more compact. Base-64 will be more compact for text that does not contain much ASCII, such as most Asian texts.

Here is an example of the header information that would indicate data originally in ISO-Latin-1, and then encoded using the Base-64 algorithm for transport.

Content-Type: text/plain; charset=utf-8
Content-transfer-encoding: base64

A user-agent (email client program) that conformed to MIME would reverse the Base-64 encoding, and treat the resulting data as ISO-Latin-1 text. Note that many user agents do not support Unicode (hence the reason for using native encodings). The algorithms to encode data using either Base-64 or Quoted-Printable are specified in RFC 2045 in great detail, and therefore are not reproduced here.

Email header information is used to enable characters outside of the ASCII range. Content-Type describes what character set to use for the contained message (i.e. ISO 8859-1 Character set). Content-transfer-encoding describes which mechanism encodes the message body (i.e. Quoted Printable or Base-64).