Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Prepare Yourself for the Unicode Revolution-2 : Page 2




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

Date: 1/31/2018 @ 2 p.m. ET

Bad Character
Windows C++ programmers are used to treating wchar_t arrays as Unicode strings. The Standard Library also defines wchar_t versions of stream objects (wifstream, wcin, wstreambuf, wstring, and others). However, wchar_t has an implementation-defined size, which makes it non-portable. While Windows represents wchar_t as a 16-bit integral type, certain Unix implementations represent the same type as a 32-bit type. To add to the confusion, wchar_t may be signed or unsigned. For these reasons and others, wchar_t is unsuitable for handling Unicode strings portably and reliably.

Know your Unicode
Originally, Unicode was defined as a 16-bit encoding system. Later it was extended to 32-bits. Today, the Unicode standard consists of three major encoding systems: UTF-8, UTF-16, and UTF-32. The first one enables you to store Unicode characters as a sequence of 8-bit bytes. UTF-8 is free from the hassles of byte ordering. Additionally, it fits into good old char. The problem is that it isn't fully compatible with the ASCII and EBCDIC codesets (although there are conversion routines). UTF-16 is a 16-bit encoding system which can represent most of the modern scripts and symbols. UTF-32 uses 32 bits. The main advantage of UTF-32 is that you can represent every Unicode symbol in one character. However, it wastes space and many implementations and programming languages don't support it yet.

Today, with the recent ratification of Unicode 4.0.0, the Unicode standard is quite stable and nearly complete. To enable C++ programmers to manipulate Unicode strings and characters, two new datatypes will be added to the language: _Char16_t and _Char32_t. These names will become reserved keywords in C++09. Fortunately, typedefs with friendlier names may be used instead of these ugly names. char16_t is a new typedef name for _Char16_t and char32_t is a new typedef name for _Char32_t. The underlying types of _Char16_t and _Char32_t are uint_least16_t and uint_least32_t, respectively. The following table summarizes this information:

new type and keyword

new typedef name

underlying <cstdint> type







You're probably wondering why the _Char16_t and _Chart32_t types and keywords are needed in the first place when the typedefs uint_least16_t and uint_least32_t are already available. The main problem that the new types solve is overloading. It's now possible to overload functions that take _Char16_t and _Char32_t arguments, and create specializations such as std::basic_string<_Char16_t> that are distinct from std::basic_string <wchar_t>.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date