devxlogo

Prepare Yourself for the Unicode Revolution

Prepare Yourself for the Unicode Revolution

++98 has two native character types: char and wchar_t. The latter is purportedly used for manipulating Unicode strings. In reality however, wchar_t is unsuitable for this purpose. The C++09 standard is about to solve this problem by adding two new character datatypes that will enable you to use Unicode portably and easily. Learn how to create Unicode literal characters and strings and how to detect whether your implementation supports UTF-16 and UTF-32.


You want to manipulate Unicode strings and characters in C++, but wchar_t and its related facilities such as wstring and wcout just don’t cut it.


Familiarize yourself with the new _Char16_t and _Char32_t native datatypes and their related Standard Library facilities.

Bad Character
Windows C++ programmers are used to treating wchar_t arrays as Unicode strings. The Standard Library also defines wchar_t versions of stream objects (wifstream, wcin, wstreambuf, wstring, and others). However, wchar_t has an implementation-defined size, which makes it non-portable. While Windows represents wchar_t as a 16-bit integral type, certain Unix implementations represent the same type as a 32-bit type. To add to the confusion, wchar_t may be signed or unsigned. For these reasons and others, wchar_t is unsuitable for handling Unicode strings portably and reliably.

Know your Unicode
Originally, Unicode was defined as a 16-bit encoding system. Later it was extended to 32-bits. Today, the Unicode standard consists of three major encoding systems: UTF-8, UTF-16, and UTF-32. The first one enables you to store Unicode characters as a sequence of 8-bit bytes. UTF-8 is free from the hassles of byte ordering. Additionally, it fits into good old char. The problem is that it isn’t fully compatible with the ASCII and EBCDIC codesets (although there are conversion routines). UTF-16 is a 16-bit encoding system which can represent most of the modern scripts and symbols. UTF-32 uses 32 bits. The main advantage of UTF-32 is that you can represent every Unicode symbol in one character. However, it wastes space and many implementations and programming languages don’t support it yet.

Today, with the recent ratification of Unicode 4.0.0, the Unicode standard is quite stable and nearly complete. To enable C++ programmers to manipulate Unicode strings and characters, two new datatypes will be added to the language: _Char16_t and _Char32_t. These names will become reserved keywords in C++09. Fortunately, typedefs with friendlier names may be used instead of these ugly names. char16_t is a new typedef name for _Char16_t and char32_t is a new typedef name for _Char32_t. The underlying types of _Char16_t and _Char32_t are uint_least16_t and uint_least32_t, respectively. The following table summarizes this information:

new type and keyword

new typedef name

underlying type

_Char16_t

char16_t

uint_least16_t

_Char32_t

char32_t

uint_least32_t

You’re probably wondering why the _Char16_t and _Chart32_t types and keywords are needed in the first place when the typedefs uint_least16_t and uint_least32_t are already available. The main problem that the new types solve is overloading. It’s now possible to overload functions that take _Char16_t and _Char32_t arguments, and create specializations such as std::basic_string<_Char16_t> that are distinct from std::basic_string .

Character Literals
You’re already familiar with char and wchar_t literals such as the following:

char c1='a';wchar_t wc1=L'a';

Literals of type _Char16_t look like this:

_Char16_t uc1=u'a'; 

Here, the literal u’a’ represents a constant integral value whose type is _Char16_t. The size of the literal constant equals sizeof(_Char16_t).

A literal of type _Char32_t looks like this:

_Char32_t uc2=U'a';

As you can see, u and U have different meanings in this context.

String Literals
To define _Char16_t and _char32_t string literals, use the following prefixes:

const _Char16_t utf16msg[]= u"hello";const _Char32_t utf32msg[]= U"hello";

The type of the literal string u”hello” is array of n const _Char16_t and has static storage duration, where n is the size of the string as defined as follows: the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating u’’.

Universal Character Names
Universal character names in the form unnnn and Unnnnnnnn contain hexadecimal values of Unicode symbols defined in Annex C of ISO 10646-1. For example, ‘u0531’ is the universal character name of the first letter in the Armenian codepage. At present, the support of universal character names is implementation-defined?some C++ compilers accept them, while others don’t.

A new standard header called will be added to C++, which includes the definitions of the typedefs char16_t and char32_t. If the macro __STDC_UTF_16__ is defined in , values of type _Char16_t shall have UTF-16 encoding, as defined by ISO 10646. Similarly, if the macro __STDC_UTF_32__ is defined in , values of type _Char32_t shall have UTF-32 encoding, as defined by ISO 10646.

The Standard Library will also provide _Char16_t and _Char32_t typedefs, in analogy to the typedefs wstring, wcout, etc., for the following standard classes:

· filebuf, streambuf, streampos, streamoff· ios, istream, ostream· fstream, ifstream, ofstream· stringstream, istringstream, ostringstream· string

Improvements in the Pipeline
In conclusion, C++ is about to be rid of another major embarrassment?the lack of native Unicode support?soon. The said proposal is now being added to the Working Paper, which means that it will be incorporated into C++09. It’s time to get ready!

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist