Internationalization and localization tools


File and Path Functions

Related Links

Link to Wide Character File and Path Functions.
Link to Windows Generic File and Path Functions.

Internationalization (I18n) Discussion:

This class of functions accepts single-byte or multibyte string arguments. In an ANSI UTF-16/32 or Windows UTF-16 Unicode application, the appropriate wide-character equivalent function should be used, if available. In the case of a Windows Generic application, call the equivalent generic function, and use the _MBCS or _UNICODE defines to map to the correct multibyte or wide-character function.

Internationalized Paths

This class of functions accepts path names as arguments. Path names may vary depending on the localized version of the machine. Hardcoded path names must be carefully evaluated within the context of internationalization. One possible solution is to use a locale-based path structure. In some internationalized designs, it may be prudent to externalize the path name.

Pathnames

In an internationalized application, it is important to understand the relationship between the target system's support of non-ASCII filenames and the C/C++ functions that are used to create and access files and paths.

For example, a Windows 95/98/ME system does not support UTF-16 Unicode natively, but will support filenames containing non-ASCII characters from the system's multibyte code page. A Windows Unicode application running on this system will result in a conversion of the Unicode filename to a multibyte string, which could result in the loss of character support. In the case of a Windows MBCS application, the multibyte code page used by the application should be the same as that used by the OS for full support of that character set.

In the case of a Windows NT/2K/XP system, Unicode is supported natively. Therefore, any Windows Unicode application will have the full range of characters when creating or accessing filenames and paths. However, in the case of a Windows MBCS application, the system's code page will be used to convert the MBCS filename to a UTF-16 encoded filename. And, therefore, if non-ASCII filenames are to be used in the MBCS application, the application's code page will need to be the same as the system's code page, for the non-ASCII filenames to be converted correctly.

On Linux/Unix platforms, the filesystem often supports UTF-8 and so although wide-character strings will need to be converted to UTF-8 strings, filenames and paths should work as expected.

File I/O

An important consideration in an internationalized application is how it handles reading and writing non-ASCII data to files.

Windows Platforms
On Windows platforms, file I/O operations take place in one of two translation modes: text or binary, depending on the mode in which the file is opened. Note that a file is assumed to have multibyte characters when in text mode, and UTF-16 Unicode characters when in binary mode. The default mode for files is text mode, though that can be changed by directly setting the global variable _fmode in the program. Alternatively, the binary mode can be specified when a file-open function is called, such as _open, fopen, freopen, or _fsopen, overriding the current default setting of _fmode by specifying the appropriate argument to the function. Note that the stdin, stdout, and stderr streams always open in text mode by default, though you can also override this default when opening any of these files. Use _setmode to change the translation mode using the file descriptor after the file is open.

In a Windows Unicode application, if the stream I/O routines, such as fwprintf, fwscanf, fgetwc, fputwc, fgetws, and fputws, operate on a file that is open in the default text mode, there are two kinds of character conversion that will take place:

    Unicode-to-MBCS or MBCS-to-Unicode conversion. As mentioned above, when a Unicode stream-I/O function operates in text mode, the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).
    In the case of a carriage return/linefeed (CR-LF) translation, the operating system will convert the two multibyte characters to a single linefeed character before the MBCS-to-Unicode conversion for input functions, and the single linefeed character back to a CR-LF combination after the Unicode-to-MBCS conversion for output functions.

When the Windows platform file is open in binary mode, it is assumed to be UTF-16 Unicode, and thus, no CR-LF translation or character conversion occurs during input or output. To correctly use wcin, the global input stream as a wide stream, or wcout, the global output stream as a wide stream, call
_setmode(_fileno(stdin), _O_BINARY) or
_setmode(_fileno(stdout), _O_BINARY), respectively.

ANSI Platforms
Similar to Windows, the orientation of a stream needs to be set properly to handle either UTF-8 Unicode characters, or UTF-16/32 wide characters. In the case of UTF-8 multibyte characters, a narrow orientation is desired; for wide characters, a wide orientation is required. The orientation of the stream will be set in one of three ways:

    By making a call to one of the narrow I/O function calls: this will set the stream orientation to narrow.
    By making a call to one of the wide-character I/O function calls: this will set the stream orientation to wide.
    By calling fwide(stream, mode) to set the orientation to narrow (pass in negative mode), or wide (pass in positive mode). Pass in 0 for mode to query the stream's current orientation.

Note that it is important to never mix the use of wide and narrow operations on a stream, as the behavior will be unpredictable. In addition, in the case of wide-character stream I/O, it is important to set the locale for the LC_CTYPE category prior to opening the stream. This is because, unless the character set is specified via the mode argument (using the ccs=CHARSET string) when the stream is opened, it will be taken from the LC_CTYPE category of the current locale, and the associated conversion functions to convert to and from the internal wchar_t characters will be loaded. Once set, the conversion functions will not change even if the locale selected for the LC_CTYPE category is changed.

Click on a function for more information:

access/_access

basename

canonicalize_file_name

chdir/_chdir

chmod/_chmod

chown

chroot

creat/_creat

creat64

ctermid

dirname

fdopen/_fdopen

_findfirst/_findfirst64/_findfirsti64

_findnext/_findnext64/_findnexti64

fopen

fopen64

freopen

freopen64

_fsopen

_fullpath

fwide

_get_current_dir_name

getcwd/_getcwd

_getdcwd

getwd

lchown

link

lstat

lstat64

lutimes

_makepath

mkdir/_mkdir

mkdtemp

mkstemp

mktemp/_mktemp

open/_open

open64

opendir

pathconf

popen/_popen

readlink

realpath

remove

rename

rmdir/_rmdir

_searchenv

sopen/_sopen

_splitpath

stat/_stat

stat64/_stat64

_stati64

symlink

tempnam/_tempnam

tmpnam

tmpnam_r

truncate

truncate64

ttyname

ttyname_r

unlink/_unlink

utime/_utime

_utime64

utimes

 

 Locale-Sensitive C++ Methods

 

Lingoport internationalization and localization services and software