Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


A Developer's Guide to Python 3.0: Numbers, Strings, and Data : Page 5

Python 3.0 makes critical—and not-backwardly-compatible—changes to data types. Find out how these changes will affect your code.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

All Strings Are Now Unicode

In Python 2.x there are two types of strings: byte strings (str) and Unicode strings(unicode). Byte strings contain bytes (usually interpreted by Python based on your default locale). Unicode strings, of course, contain Unicode characters:

>>> s = 'hello' >>> u = u'\u05e9\u05dc\u05d5\u05dd' >>> type(s) <type 'str'> >>> type(u) <type 'unicode'>

Both str and unicode were derived from a common base class called "basestring:"

>>> unicode.__bases__ (<type 'basestring'>,) >>> unicode.__base__ <type 'basestring'> >>> str.__bases__ (<type 'basestring'>,)

In Python 3.0, all strings are Unicode. The str type has the same semantics as unicode in Python 2.x, and there is no separate unicode type. The basestring base class is gone as well:

>>> s = '\u05e9\u05dc\u05d5\u05dd' >>> type(s) <class 'str'>

Instead of Python 2.x's byte string there are now two types: bytes and bytearray. There are both immutable and mutable versions of a byte array. The bytes type supports a large number of string-like methods, as shown below:

>>> dir(bytes) ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

The bytearray type also has the following mutating methods: extend(), insert(), append(), reverse(), pop(), and remove().

They also support the + and * operators (using the same semantics as strings) and the bytearray type also supports += and *=.

You can't convert to or from str without explicit encoding, because neither bytes nor bytearray know about encoding, and str objects must have an encoding. If you try to pass a bytes or bytearray object directly to str() you will get a result of repr(). To convert you must use the decode() method:

>>> a = bytearray(range(48, 58)) >>> a bytearray(b'0123456789') >>> s = str(a) >>> s "bytearray(b'0123456789')" >>> s = a.decode() >>> s '0123456789'

To convert from a string to bytes or bytearray you must use the string's encode() method or provide an encoding to the constructor of the bytes or bytearray object:

>>> s = '1234' >>> s '1234' >>> b = bytes(s) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: string argument without an encoding >>> b = s.encode() >>> b b'1234' >>> b = bytes(s, 'ascii') >>> b b'1234'

The string representation has changed too. In Python 2.x the return type of repr() was str, which was an ASCII-based string. In Python 3.0 the return type is still str, but it's now a Unicode string. The default encoding of the string representation is determined by the output device.

Author's Note: To explore this topic in more detail, the relevant PEPs are PEP-358, PEP-3118, PEP-3137 and PEP-3138.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date