All Strings Are Now Unicode
In Python 2.x there are two types of strings: byte strings (
str) and Unicode strings(
unicode). Byte strings contain bytes (usually interpreted by Python based on your default locale). Unicode strings, of course, contain Unicode characters:
>>> s = 'hello'
>>> u = u'\u05e9\u05dc\u05d5\u05dd'
>>> type(s)
<type 'str'>
>>> type(u)
<type 'unicode'>
Both
str and
unicode were derived from a common base class called "basestring:"
>>> unicode.__bases__
(<type 'basestring'>,)
>>> unicode.__base__
<type 'basestring'>
>>> str.__bases__
(<type 'basestring'>,)
In Python 3.0,
all strings are Unicode. The
str type has the same semantics as
unicode in Python 2.x, and there is no separate
unicode type. The basestring base class is gone as well:
>>> s = '\u05e9\u05dc\u05d5\u05dd'
>>> type(s)
<class 'str'>
Instead of Python 2.x's byte string there are now two types: bytes and bytearray. There are both immutable and mutable versions of a byte array. The bytes type supports a large number of string-like methods, as shown below:
>>> dir(bytes)
['__add__', '__class__', '__contains__',
'__delattr__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__gt__', '__hash__', '__init__',
'__iter__', '__le__', '__len__', '__lt__',
'__mul__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__rmul__',
'__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'capitalize', 'center', 'count', 'decode', 'endswith',
'expandtabs', 'find', 'fromhex', 'index', 'isalnum',
'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip',
'partition', 'replace', 'rfind', 'rindex', 'rjust',
'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines',
'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']
The bytearray type also has the following mutating methods:
extend(),
insert(),
append(),
reverse(),
pop(), and
remove().
They also support the
+ and
* operators (using the same semantics as strings) and the bytearray type also supports
+= and
*=.
You can't convert to or from
str without explicit encoding, because neither bytes nor bytearray know about encoding, and str objects
must have an encoding. If you try to pass a bytes or bytearray object directly to
str() you will get a result of
repr(). To convert you must use the
decode() method:
>>> a = bytearray(range(48, 58))
>>> a
bytearray(b'0123456789')
>>> s = str(a)
>>> s
"bytearray(b'0123456789')"
>>> s = a.decode()
>>> s
'0123456789'
To convert from a string to bytes or bytearray you must use the string's
encode() method or provide an encoding to the constructor of the bytes or bytearray object:
>>> s = '1234'
>>> s
'1234'
>>> b = bytes(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
>>> b = s.encode()
>>> b
b'1234'
>>> b = bytes(s, 'ascii')
>>> b
b'1234'
The string representation has changed too. In Python 2.x the return type of
repr() was
str, which was an ASCII-based string. In Python 3.0 the return type is still
str, but it's now a Unicode string. The default encoding of the string representation is determined by the output device.
| Author's Note: To explore this topic in more detail, the relevant PEPs are PEP-358, PEP-3118, PEP-3137 and PEP-3138. |