++98 has two native character types: char and wchar_t. The latter is purportedly used for manipulating Unicode strings. In reality however, wchar_t is unsuitable for this purpose. The C++09 standard is about to solve this problem by adding two new character datatypes that will enable you to use Unicode portably and easily. Learn how to create Unicode literal characters and strings and how to detect whether your implementation supports UTF-16 and UTF-32.
You want to manipulate Unicode strings and characters in C++, but wchar_t and its related facilities such as wstring and wcout just don’t cut it.
Familiarize yourself with the new _Char16_t and _Char32_t native datatypes and their related Standard Library facilities.
Windows C++ programmers are used to treating wchar_t arrays as Unicode strings. The Standard Library also defines wchar_t versions of stream objects (wifstream, wcin, wstreambuf, wstring, and others). However, wchar_t has an implementation-defined size, which makes it non-portable. While Windows represents wchar_t as a 16-bit integral type, certain Unix implementations represent the same type as a 32-bit type. To add to the confusion, wchar_t may be signed or unsigned. For these reasons and others, wchar_t is unsuitable for handling Unicode strings portably and reliably.
Know your Unicode
Originally, Unicode was defined as a 16-bit encoding system. Later it was extended to 32-bits. Today, the Unicode standard consists of three major encoding systems: UTF-8, UTF-16, and UTF-32. The first one enables you to store Unicode characters as a sequence of 8-bit bytes. UTF-8 is free from the hassles of byte ordering. Additionally, it fits into good old char. The problem is that it isn’t fully compatible with the ASCII and EBCDIC codesets (although there are conversion routines). UTF-16 is a 16-bit encoding system which can represent most of the modern scripts and symbols. UTF-32 uses 32 bits. The main advantage of UTF-32 is that you can represent every Unicode symbol in one character. However, it wastes space and many implementations and programming languages don’t support it yet.
Today, with the recent ratification of Unicode 4.0.0, the Unicode standard is quite stable and nearly complete. To enable C++ programmers to manipulate Unicode strings and characters, two new datatypes will be added to the language: _Char16_t and _Char32_t. These names will become reserved keywords in C++09. Fortunately, typedefs with friendlier names may be used instead of these ugly names. char16_t is a new typedef name for _Char16_t and char32_t is a new typedef name for _Char32_t. The underlying types of _Char16_t and _Char32_t are uint_least16_t and uint_least32_t, respectively. The following table summarizes this information:
new type and keyword
new typedef name
You’re probably wondering why the _Char16_t and _Chart32_t types and keywords are needed in the first place when the typedefs uint_least16_t and uint_least32_t are already available. The main problem that the new types solve is overloading. It’s now possible to overload functions that take _Char16_t and _Char32_t arguments, and create specializations such as std::basic_string<_char16_t> that are distinct from std::basic_string
You’re already familiar with char and wchar_t literals such as the following:
char c1='a';wchar_t wc1=L'a';
Literals of type _Char16_t look like this:
Here, the literal u’a’ represents a constant integral value whose type is _Char16_t. The size of the literal constant equals sizeof(_Char16_t).
A literal of type _Char32_t looks like this:
As you can see, u and U have different meanings in this context.
To define _Char16_t and _char32_t string literals, use the following prefixes:
const _Char16_t utf16msg= u"hello";const _Char32_t utf32msg= U"hello";
The type of the literal string u”hello” is array of n const _Char16_t and has static storage duration, where n is the size of the string as defined as follows: the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating u’