Internationalization and localization tools


Locale-Sensitive Length Functions

Related Links

Link to Locale-Sensitive Wide Character Length Functions.
Link to Locale-Sensitive Multibyte Length Functions.
Link to Locale-Sensitive Windows Generic Length Functions.

Internationalization (I18n) Issue:

All of these functions operate on 7-bit or 8-bit ASCII characters, which is to say they have arguments and/or return values of type char. These functions are typically replaced when migrating to a Multibyte, Wide Character, or Windows Generic applications. The issue with this particular set of functions is that special attention needs to be paid to their size argument when migrating to one of the above 3 types of application.

I18n Solution:

Avoid using the sizeof operator, and be careful with the size parameter when migrating from these functions.

I18n Discussion:

When migrating to wchar_t Wide Characters:

The wchar_t datatype is either 2 bytes in the case of the UTF-16 encoding of the UCS-2 character set, or 4 bytes in the case of the UTF-32 encoding of the UCS-4 character set.

In a single byte environment, the number of bytes and the number of characters in a character string is the same. This is not the case with wchar_t wide character strings, where the number of bytes in a string is either 2 or 4 times larger than the number of characters.

For example, in a single byte application, something like the following would work fine:

char buffer[16];
char buffer2[16];
strncpy(buffer, buffer2, sizeof(buffer)); // singlebyte

This code works because the size of buffer is the same as the number of characters and the number of bytes in the string. If the code is quickly modified to use wchar_t as follows:

wchar_t buffer[16];
wchar_t buffer2[16];
wcsncpy(buffer, buffer2, sizeof(buffer)); // bad, overflow

This code will not work because wcsncpy is expecting the number of wide characters in the buffer, and instead we are passing the number of bytes with the sizeof operator. This is the type of problem that needs to be considered with this set of functions. The following is a safer way to handle these length sensitive functions:

#define BUFLEN 16
wchar_t buffer[BUFLEN];
wchar_t buffer2[BUFLEN];
wcsncpy(buffer, buffer2, BUFLEN); // good, # of wchar_t's

When migrating to Multibyte Characters like UTF-8 or SJIS:

A multibyte character is a sequence of one or more bytes. An example of a multibyte character set is UTF-8. Such character sets will alternate between single and multiple bytes to encode each character.

In a single byte environment, the number of bytes and the number of characters in a character string is the same. This is not the case with multibyte character strings, where the number of bytes in a string may be but is not necessarily the same as the number of characters.

Since multibyte strings are terminated with a single byte null character however, in the case of these particular functions which are concerned primarily with length, the single byte version of functions may continue to be used. Although windows often provides a multibyte version of any given one of these functions, they are not necessary.

For example, in a single byte application, something like the following would work fine:

char buffer[16];
char buffer2[16];
strncpy(buffer, buffer2, sizeof(buffer)); // singlebyte

This code works because the size of buffer is the same as the number of characters and the number of bytes in the string. As it turns out, this code would work fine with multibyte strings as well in this particular case. However, the buffers may not always be the same size, and using sizeof is dangerous if you chose to migrate to wide characters some time in the future. Therefore we recommend that you use strlen instead.

wchar_t buffer[16];
wchar_t buffer2[16];
strncpy(buffer, buffer2, strlen(buffer2) + 1);// multibyte

Note that the Microsoft platforms sometimes have special _mbs functions, however in the case of this particular set of functions whose only problems relate to length, they are not necessary, and for many of these functions they are not available.

When migrating to Microsoft Generic Characters:

When a Generic function call is used, the _MBCS or _UNICODE compiler switch determines whether to map to the multibyte or wide version of the function. For example, _tcsnccpy will map to strncpy if the _MBCS compiler switch is defined, and to wcsncpy if the _UNICODE switch is defined.

The arguments and return values for these functions are also Generic and dependent on the _MBCS/_UNICODE switch for their mapping to either narrow or wide-character data.

See the MSDN Library's Using Generic Text Mappings for more information.

In this particular case we should use an approach which works correctly for both multibyte and wide character applications. As shown in the 2 previous sections, the sizeof operator should be avoided. Instead we chose a method based on generic number of characters, so our code will work for singlebyte, multibyte, and wide character applications:

TCHAR buffer[16];
TCHAR buffer2[16];
_tcsnccpy(buffer, buffer2, _tcslen(buffer));

Note that we use _tcsnccpy, which will always want a number of characters to be copied, in conjunction with _tcslen which will always return the number of characters in the string.

Click on a function for more information:

bcmp

bcopy

bzero

memccpy/_memccpy

memchr

memcmp

memcpy

memmove

mempcpy

memset

stpncpy

strncat

_strncnt

strncpy

strndup

strndupa

strnlen

strnlen_l

strnset/_strnset

 

Links to additional locale-sensitive length functions:

Link to Locale-Sensitive Multibyte Length Functions.
Link to Locale-Sensitive Wide Character Length Functions.
Link to Windows Generic Locale-Sensitive Length Functions.

 

 Locale-Sensitive C++ Methods

 

Lingoport internationalization and localization services and software