n my last article, “Drill-down on Three Major New Modules in Python 2.5 Standard Library,” I discussed how the ctypes, pysqlite, and ElementTree language modules can save you time and aggravation. In this, my third and final, article on the new 2.5 version of Python, I’ll go over some additional language enhancements and modules that each, individually, adds an important ingredient to some of the smaller subsets of the Python community. I will also cover performance improvements, porting your code from previous versions of Python, and some other odds and ends that will be important to anyone who is ready to adopt the latest release of Python.
Absolute and Relative Imports
To get started, I’ll run through some of the basics of Python language organization.
Python software is organized in modules (.py files) stored in packages. The modules may be pre-compiled (.pyc) or could be extension modules. Python packages are usually just directories that appear in sys.path. Sub-packages are sub-directories of a package directory (or other sub-package) that contain an __init__.py. If the __init__.py doesn’t exist then the sub-directory is ignored by Python’s import mechanism.
Python locates modules that you import by searching a list of directories (or zip files) stored in sys.path. This list is initialized with the directory of the running program, the contents of the PYTHONPATH environment variable, and a list of platform-dependent directories. Programs may modify sys.path at runtime to control the import behavior.
Prior to Python 2.5 imports were always relative to your sys.path. The algorithm was very simple:
When importing ‘aaa.a‘ scan through sys.path. Try to import aaa.a.py from each entry in sys.path.
There were two problems with this algorithm:
- Local modules might shadow library modules with identical names. This becomes more of a problem as the standard library grows.
- Modules inside nested packages had to use the full path to import modules from a sibling package or parent package.
Python 2.5 added a __future__ option to change the import behavior in order to address these problems. I created a little package and a couple of helper modules to demonstrate the import behavior in Python 2.5:
aaa (package) |-- __init__.py |-- a.py |-- aa.py
Here is the content of the modules:
__init__.py-----------print 'aaa/__init__ here'a.py----from __future__ import absolute_importprint 'aaa/a here'import aaaa.py-----print r'aaaaa here'
I “installed” the package by copying it to Lib/site-packages (the location of third-party Python packages).
In addition I created two modules in the site-packages directory.
import_test.py-----------print 'import_test here'import aaa.aaa.py-----print 'aa here'
Each module just prints its package (if in a package) and its name. It all starts with import_test.py that imports aaa.a. This results in the automatic import of aaa/__init__.py and then aaa/a.py. The latter, aaa/a.py, is the interesting piece. It uses the new absolute_import feature. It imports aa. A module named aa.py exists in the a.py‘s directory (aaa) and in the site-packages directory. Without absolute_import the local aa.py would have been imported (aaa/a.py), but the “absolute” aa.py in site-packages is imported instead. Here is the output of running import_test.py:
aaa/__init__ hereaaa/a hereaa here
If I comment out the __future__ line the local aaa/aa.py module will be imported from aaa.a.py:
aaa/__init__ hereaaa/a hereaaa/aa here
What if you want to import both the local aa and the absolute aa? Prior to Python 2.5 you would have had to play tricks and dynamically modify your sys.path (and hopefully remember to restore it afterwards). With Python 2.5 you can use the new dot notation:
aaa/a.py---------from __future__ import absolute_importprint r'aaa/a here'import aafrom . import aa
Output (of import_test.py):
import_test hereaaa/__init__ hereaaa/a hereaa hereaaa/aa here
The ‘.’ allows you to import from the current directory. Double dot? ‘..’ ?can be used to import from a parent package in a relative path notation.
There is one caveat. The relative import syntax works inside packages only. If you try to use it in a main module you will get the following exception:
ValueError: Attempted relative import in non-package
The ‘__index__’ Method
Slicing is an operation performed on sequences that allows you to extract a subset of the element. The syntax is: sequence[start:stop:step]. ‘start‘ is mandatory, ‘stop‘ defaults to the end of the sequence, and ‘step‘ defaults to 1. When you slice a sequence, Python starts from the start index and returns another sequence (same type as original) that contains all the elements between ‘start‘ and ‘stop‘ in increments of ‘step‘.
You can use a negative number to count from the end of the sequence too. The step may also be negative, but in this case the start index must be bigger than the stop index resulting in a reverse slice. This used to be the only way to reverse a sequence before the reversed() built-in function was introduced in Python 2.4 (yes, you get some history for the same price).
Now let’s break some bread and slice it too.
# prepare a list called bread with 10 integers bread = range(1,11)print bread# plain sliceprint bread[1:10:2]# slice using negative indicesprint bread[-9:-7]# old way of reversing a sequenceprint bread[::-1]
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10][2, 4, 6, 8, 10][2, 3][10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
The indices for start, stop, and step used to be integers or long integers only. This is fine for almost everybody. Why would you want to index into a collection using a different type? You wouldn’t. After all, the meaning of an index is a specific location inside the sequence, and locations are always integers. However, NumPy, which is the leading Python scientific computing package, requires it. NumPy is a Python extension that provides a lightning fast multi-dimensional array and various functions, linear algebra operations, and transformations to act on it.
I can hear you thinking: “What’s the big deal about arrays? Didn’t we have them back in the day in BASIC for the Dragon32?”. Well, you didn’t have THAT kind of array. Multi-dimensional arrays (tensors) are a crucial building block for many scientific computations. NumPy is a very important and influential package that single-handedly made Python a great success in the scientific community. As evidence of its importance, NumPy is slated for inclusion in the standard Python library at some point.
NumPy uses its own data types (remember the ctypes data types?) to represent integers with higher fidelity than Python’s native int and long. These types were not usable for slicing, which is very common in NumPy. The most viable solution was to allow arbitrary types to be used as slicing indices if they define an __index__ method whose return value is int or long. In the following code I defined a few classes with __index__ methods that I use to dice and slice a poor ‘bread.’
class One(object): def __index__(self): return 1class Two(object): def __index__(self): return 2class Ten(object): def __index__(self): return 10print bread[One():Ten():Two()]one = One()two = Two()ten = Ten()print x[one:ten:two]
Output:
[2, 4, 6, 8, 10][2, 4, 6, 8, 10]
My Name Is __missing__, dict.__missing__
The dict __missing__ method is a neat addition to the arsenal of useful tools. It addresses a common problem of returning a default value from a failed lookup on a dictionary.
Suppose your program needs to store securely the code names of British secret agents. You are aware of course that these code names all start with double zero and end with a positive integer. After careful analysis of the problem domain you decide to use a 100×100 sparse matrix (a matrix that contains mostly zeros) to store the code names. Your input is a list of tuples. The first and second elements are the row and column (two-dimensional index), and the third element is the integer that follows the mandatory ’00’. You can represent such a matrix using a plain (non-sparse) dictionary:
sparse_matrix = {}for row in range(100): for col in range(100): sparse_matrix[(row,col)] = 0 for i in (5,4,8), (88, 33, 7), (99,99,9): sparse_matrix[i[:2]] = i[2] print '%d%d%d %s' % (sparse_matrix[(1,1)], sparse_matrix[(14,61)], sparse_matrix[(88,33)], 'licensed to kill')
Output:
007 licensed to kill
That works, but it’s not very smart or sparse. A huge dictionary of 10,000 entries is required to identify just three agents?and it takes a while to initialize this huge array with zeros. A much better solution is to keep just the non-zero elements. The problem is what to do when someone accesses a zero entry (missing from the dictionary). The dictionary throws a KeyError exception:
Traceback (most recent call last): File "/Users/gsayfan/Documents/docs/Publications/DevX/Python 2.5 - Fresh from the Oven/part_3.py", line 57, in print '%d%d%d %s' % (sparse_matrix[(1,1)],KeyError: (1, 1)
There were several cumbersome solutions prior to Python 2.5. All of them required the caller to handle the missing value. One way was to wrap every access to the dictionary in a try-except block; another way was to use the get() method and pass in a default value to return; and the last way was to use the setdefault() method, which is similar to get() but also sets the default value in the dictionary for posterity.
x = {1:1, 2:2, 3:3}# This is just uglytry: print x[0]except KeyError: print 8# This just gets the default value without modifying the dictprint x.get(0, 8)print 'x has %d entries' % len(x)# This actually adds the entry 0:8 to the dictprint x.setdefault(0, 8)print 'x has %d entries' % len(x)
Output:
88x has 3 entries8x has 4 entries
In Python 2.5 there is an elegant way to handle this situation. The dict type has a new hook function called __missing__. It is called whenever you try to access a missing key. The default implementation is to raise the infamous KeyError exception, but you can subclass dict and override the __missing__ method in your subclass to do whatever you want. This is much better because the caller is not responsible for handling default values. Sometimes the returned value should be based on dynamic calculation and the caller doesn’t even know what the proper default value is. Note the dict size remains the same even when accessing non-existing elements.
class SparseDict(dict): def __missing__(self, key): return 0sparse_matrix = SparseDict()for i in (5,4,3), (88, 33, 7), (99,99,99): sparse_matrix[i[:2]] = i[2]print '%d%d%d %s' % (sparse_matrix[(1,1)], sparse_matrix[(14,61)], sparse_matrix[(88,33)], 'licensed to kill')print len(sparse_matrix)print sparse_matrix
Output:
007 licensed to kill3{(88, 33): 7, (5, 4): 3, (99, 99): 99}
This solution is elegant and allows full flexibility (you even have the requested key to base your return value on, if you want it). Nonetheless, it feels a little intrusive to write a subclass for every dictionary with a default, especially if you have multiple dictionaries with different defaults. Have no fear. Python 2.5 comes with a default dict, which is almost as flexible as implementing __missing__ yourself.
The default dict lives in the collections package, and it accept a default_factory callable in its constructor. Whenever a non-existing key is accessed, the default_factory will be invoked to produce the proper value. Don’t worry, you don’t need to start writing factory classes or functions now. Most of Python’s types are also factory functions and in most cases this is exactly what you want. For example, Python’s int is a factory function that returns 0 when invoked without arguments. This is exactly what we need for our sparse matrix. Note that accessing non-existing entries sets them in the dictionary just like calling setdefault().
import collectionssparse_matrix = collections.defaultdict(int)for i in (5,4,3), (88, 33, 7), (99,99,99): sparse_matrix[i[:2]] = i[2]print '%d%d%d %s' % (sparse_matrix[(1,1)], sparse_matrix[(14,61)], sparse_matrix[(88,33)], 'licensed to kill')print len(sparse_matrix)print sparse_matrix
Output:
007 licensed to kill5defaultdict(, {(88, 33): 7, (5, 4): 3, (99, 99): 99, (14, 61): 0, (1, 1): 0})
More Modules
hashlib
Hashlib is a new module that provides various secure hash algorithms. The supported algorithms (always available) are: MD5, SHA-1, SHA-224, SHA-256, SHA-384, and SHA-512. Other algorithms may be present and you can try to instantiate them. Secure hash algorithms are used for protocols and standards such as SSH, SSL, PGP, TLS, and S/MIME.
Hashlib has a uniform simple interface for all the algorithms and it’s very easy to use. You create a hash object. You call the update() method one or more times to add text. Finally you call the digest() or hexdigest() methods to get the hash value.
import hashlibx = hashlib.sha256()x.update('Yeah, ')x.update('it ')x.update('works!!!')d1 = x.digest()print x.hexdigest()x = hashlib.sha256()x.update('Yeah, it works!!!')d2 = x.digest()print x.hexdigest()assert d1 == d2x = hashlib.sha224()x.update('Yeah, it works!!!')print x.hexdigest()x = hashlib.sha1()x.update('Yeah, it works!!!')print x.hexdigest()x = hashlib.md5()x.update('Yeah, it works!!!')print x.hexdigest()
Output:
d17380061dff0857ad21450c1206feceb3ada7196b8ef8109fb8b460761241b4d17380061dff0857ad21450c1206feceb3ada7196b8ef8109fb8b460761241b43bb9c0cbc9edb898f5eeefad262d1bafa9edf84bf63c2020ca33dab098f74ddd62bcecfa3a63df80f97d6933aedefc1792928a72f3bb2c4069300dcb64492ea0
You can call the update() method with one big string or multiple times with consecutive sub-strings. If the sub-strings add up to the big string you will get the same hash value (from the same algorithm). The hash values differ in the number of bits they return: md5 returns 128 bits, sha-1 returns 160 bits, and the sha-xxx algorithms return xxx bits respectively.
The digest() method returns a raw buffer of bytes that might contain non-printable characters, so I just used the result for comparison. The hexdigest() method returns a stringified hexadecimal representation of the digest value that you may safely print.
If OpenSSL is installed hashlib will bind to it dynamically, and then additional algorithms may be available. You need to use the new() method to access additional algorithms, if present, and pass the algorithm name to new.
Hashlib deprecates the md5 and sha1 modules. These modules were available in earlier versions, and the new hashlib borrows their simple interface. The modules are still available as a backward compatibility gesture, but they actually use hashlib under the covers.
wsgiref
WSGI stands for the Web Server Gateway Interface. It is a standard interface for web servers, web applications, and middleware. The idea is that web applications that comply with WSGI will able to utilize lots of WSGI-compliant middleware components (e.g. authentication, session, compression) and deploy to any WSGI-compliant web server. The WSGI specification is not aimed at the web application developer, but at the web framework developer.
Some history … Python has an enormous number of web frameworks. The (speculative) reason is that it is so easy to write a web framework in Python that people preferred to roll their own than to use an existing one. Existing frameworks were often incomplete, not well documented, and targeted at the specific needs of some other developer. There is an ongoing debate in the community whether this proliferation is beneficial.
In the last year, this debate grew more intense as Ruby on Rails blazed new trails and the Python community got nervous. By comparison to Rails, Python’s babel tower of web frameworks looked like a bad idea. There were various proposals to unify web frameworks or to pick one web framework.
In the end, two web frameworks emerged as de facto leaders: DJango and TurboGears. (It is debatable whether these are really the leading Python web frameworks, but they are definitely the only two with a book.)
Editor’s Note: The author is a co-author of the TurboGears book, but he does not earn royalties. |
Back to WSGI, the fragmentation of the Python web framework scene worried a few people including Philip J. Eby. Eby, a serious mover and shaker in the Python community, is known for his PEAK initiative, which was supposed to provide a pythonic J2EE. Somewhere along the way PEAK started to split and spin off important standalone projects and other ideas such as generic functions and setuptools. Eby decided to do something about the web framework situation and wrote Python Enhancement Proposal (PEP)-333, “Python Web Server Gateway Interface” v1.0 and followed up with a reference implementation called (you guessed it) wsgiref.
The core of wsgiref is so simple that you can write a fully functioning web application in a few lines of code (I’ll provide it soon) that you can effortlessly deploy on any WSGI-compliant server with any WSGI middleware. Before I start showcasing wsgiref, I want to stress that you SHOULD NOT develop web applications from scratch. It is very simple and possible, but there is no need. There are lots of excellent Python web frameworks out there and they all support WSGI, so go ahead and use them for real projects.
Now I’ll get down to writing some code. A WSGI web application is callable (i.e. a function, a class constructor, or any object with a __call__ method). That’s it. You pass this callable to a WSGI-compliant server and that’s your deployment. WSGI is based on HTTP’s request-response model. Whenever a request comes in the server will recognize your application as callable and pass two arguments: the environment and the start_response callable. In your code you can query the environment that contains the request’s path (URL), the HTTP headers’ query parameters, and other relevant data. You call the start_response callable and pass the response headers and status; then you return the response body as a list of strings. It’s really simple.
I’m a very creative guy, so when I thought about a cool web application a “Quote of the Day” immediately jumped to mind. I carefully Googled it to make sure no one thought about it before me, and I was immediately rewarded with 134,000,000 results :-). Being strong-willed I decided to keep going with the original plan. My QOTD web application is highly sophisticated and chockfull of buzzwords. It has a dynamic back-end web service that aggregates an RSS feed from another web application (http://brainyquote.com), parses the RSS using ElementTree, extracts the quotes of the day, and stores the quotes in a lightning fast in-memory database (a simple Python list). When a request comes in the QOTD web application selects a random quote, formats it in the following format: and returns it to the caller. The full code consists of 10 lines of code + three lines of import statements.
I start by importing all the necessary modules. The quotes list is initialized to the empty list. The qotd_app function is the actual web application callable (yes, three lines of code). It gets a random quote from the quotes “database” (the quotes list will already be populated by the time qotd_app is called for the first time). It calls start_response with the ok status and a content type HTTP header of text/html. Finally, it returns the body of the response, which is the quote itself wrapped in a minimal HTML markup.
The code after the function definition is the main initialization code. It downloads the QOTD RSS feed using urllib2.urlopen. It proceeds to parse it using ElementTree.XML and finds all the elements in the feed called
To test it save the code to a Python file and run it, then browse with your favorite browser to http://localhost:8888 and check out the pearls of wisdom it will emit. Every refresh will bring a new random quote from the list. The code is shown in Listing 1.
WSGI web applications are cool but not really useful in professional applications. Besides WSGI web applications there are two other pieces to WSGI: web servers and middleware. You probably don’t want to write a new web server, but if you are serious about WSGI you’d do well to investigate the WSGI middleware path. These are components that sit between the web application and web server. You can compose them freely and, because they all expose the same callable API, you can wrap any web application with multiple WSGI middleware.
Optimizations and Internal Changes
Python is not the fastest language out there. Most of the time it’s fast enough. If you need more performance you can write performance critical code in C or C++ extensions and call it from Python. Nevertheless, sometimes you will wish Python was a little faster so you can develop more applications in pure Python. Your wish is the Python developers command. Python 2.5 introduces multiple performance enhancements.
Py_ssize_t as index
Python used to store various counts in a variable of the C type int. This is a 32-bit type, which meant that lists or tuples couldn’t have more than 2,147,483,647 bits. On 32-bit systems you couldn’t fit more than that in the entire 32-bit addressable memory. On 64-bit systems you have much more addressable memory, so this number isn’t so big anymore.
Python 2.5 uses the 64-bit Py_ssoze_t typedef for indices and counts, which allows you to fully utilize the memory of 64-bit systems. This change affects mostly C extension writers. Read PEP- 353 if you want all the gory details.
Memory Functions
Python as a runtime virtual machine does a lot of memory management on behalf of your code. Small objects are allocated in 256KB arenas. When you allocate a small object of any size the memory will either be allocated from an existing arena with available space or from a new 256KB arena.. This arrangement allows you to amortize the cost of frequent memory allocations at the cost of inconsistent allocation time. This is a reasonable tradeoff for a language like Python. Nevertheless, Python 2.4 never released empty arenas. Thus, if, in the beginning of your program, you allocated lots of small objects and then your program switched to a state that used a small number of objects, all the arenas that were allocated initially just sat there and hogged memory. Python 2.5 addresses this issue: Empty arenas are de-allocated and the memory is returned to the operating system.
This change resulted in different types of memory functions in the Python C API. Prior to Python 2.5 the various memory function families were all reduced to the system malloc. Now, some functions use obmalloc and some use the plain malloc. This means that it is important to free memory using the correct function. This should concern only extension writers.
The Need for Speed Sprint
The NeedForSpeed sprint it was a privately sponsored event that took place from May 21 to 28, 2006, in Reykjavik, Iceland. Several prominent Python hackers were flown in and spent a week improving Python’s performance. The results were integrated into Python 2.5. The major successes were significant improvements to repeated function calls (by caching the associated frame object), huge gains in string performance and string to int conversions, reduced interpreter startup time, and faster exceptions. The event produced several orders of magnitude performance improvements!
Author’s Note: The order of magnitude improvements apply to Psyco only; Psyco is a dynamic just-in-time compiler. Psyco is not part of standard Python and it doesn’t work on Mac, so the orders of magnitude performance improvements should refer to Psyco only. |
Metadata for Python packages
The only chink in Python’s armor is its relatively weak support for installation, deployment, and updates of large systems with many dependencies. It might not be important for the typical utility or administration script, but Python is used more and more for developing large-scale systems. The distutils module is the official way of creating and distributing Python packages. It is based on a set up script that can create source and binary distributions, including metadata, for different platforms. Until Python 2.5 it lacked any notion of dependency between packages.
Python 2.5 added a few metadata fields (based on PEP-314): ‘requires’, ‘obsoletes,’ and ‘download_url’. Python also has an online repository for packages called the cheeseshop , which contains an index of downloadable packages. Unfortunately, it seems the new metadata fields don’t really solve the dependency issues because there is no semantics attached to these fields and no tool support.
Python’s salvation may be the setuptools project , again by the prolific Philip J. Eby. This project aims to enhance the distutils module and be compatible with it. It is the de facto standard for distributing and installing Python packages. It is at version 0.6c3 and quite usable, but it’s not perfect yet.
Balance and Traction
Python 2.5 is a mostly backward compatible and balanced release. It introduced multiple language enhancements, several new and improved modules in the standard library, and lots of performance enhancements. Best of all, it created healthy traction and innovation without disrupting its growing user base.
Python is well poised to target larger and more complicated systems, while preserving its essential simplicity and the friendliness that attracted so many developers in the first place.