Drill-down on Three Major New Modules in Python 2.5 Standard Library

his article is the second part in a three-part series about Python 2.5. The first part discussed the major changes and enhancements to the Python language itself. This part introduces the main modules that were added to the Python standard library. The third part will discuss a whole bag of smaller improvements and changes that are relevant to specific subsets of the Python community.

Python has a vibrant community that produces lots of useful packages and modules. The best ones?the ones that have proven themselves in the field?sometimes get included in the standard Python library. This is important for several reasons:

  1. High availability?People who deploy large Python-based systems that rely on standard modules only have it easy when it comes to installation, deployment, and upgrades.
  2. High visibility?Being included in the standard library means that the module will be documented in the official Python documentation as well as in Python books. Example programs and articles are more likely to use standard modules because they don’t require special installation (see point 1).
  3. Blessed status?If there are multiple modules that provide some functionality then the module picked for inclusion in the standard library obviously has been deemed better.

There are three modules recently included in the standard library that I’ll discuss in this article: ctypes, pysqlite, and ElementTree.

  • ctypes allows calling C functions in dynamic/shared libraries without writing extensions.
  • Pysqlite is a great embedded database package.
  • ElementTree is a pythonic and efficient set of XML processing tools.

Arguably, these modules are the most important for the majority of Python users. I’ll discuss the hashlib and wsgiref modules, which are also important, in a third article (coming soon).

Module No. 1: ctypes
Python is slow. Most of the time that doesn’t matter. You might use it simply to write small scripts that finish before you even blink, or you might use it to glue together some tools. You can even write decent games in Python that perform well using a library like PyGame. However, if you develop core parts of a large-scale system in Python you might find out that Python is too slow. In this case you can always write the critical parts in C or C++ and wrap them with an extension module. But of course, this process is not slick and streamlined like pure Python development.

There are many ways to automate it and make it less painful (e.g. SWIG and Boost::Python). ctypes offers a simpler approach. It allows you to call C functions in dynamic libraries directly.

Dynamic libraries use platform-specific mechanisms. ctypes tries very hard to operate at a higher abstraction level , but in some cases it is just impossible. Some libraries may be available only on a certain platform and the library itself may have a different name. In this article, I will use libc for all the examples because it is so ubiquitous. I use Mac OS X, but the examples should work on every Linux/Unix OS. I will also refer to Windows from time to time because there are important capabilities that are available on Windows only.

Finding and Loading Libraries
Before you can start calling those great C functions, you need to locate and load the dynamic library that contains them. There are two ways to locate a library:

  1. You can call the ctypes.util.find_library() function
  2. You can just know where it is

In both cases you end up with a path to the dynamic library.

Here is how to use find_library to find the path to the libc library:

Python 2.5 (r25:51918, Sep 19 2006, 08:49:13) [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> >>> from ctypes.util import find_library>>> find_library('c')'/usr/lib/libc.dylib'

In this code, find_library() is doing its best to shield you from OS-specific details. Note that I didn’t have to specify the extension (.dylib on Mac) or the ‘lib’ prefix.

Once you locate the dynamic library you can load it. There are different ways to do it, but for the most part they all depend on the dynamic library type, the calling convention, the platform, and interaction with the Python C API. On Linux/Mac OSX you should use the CDLL class to load a dynamic library with the C calling convention. On Windows you should use the WINDLL for dynamic libraries that use the standard (Pascal) calling convention and OLEDLL for COM objects.

Here is how to load libc on Linux/Mac OSX:

>>> from ctypes import CDLL>>> from ctypes.util import find_library>>> libc = CDLL(find_library('c'))>>> libc

Calling Functions
Python has support for random number generation (well, pseudo random numbers). The random module provides a bunch of functions to generate anything you want. The problem is that there is no simple way to generate a random integer between 0 and X, which is almost always what I want. You can use random.randint(min, max) but you will have to provide two numbers for min and max and then you will need to know that the random number you will get is in the range [min, max], which means min <= x <= max. This not intuitive to me because in computers (and often in math, too) half open ranges are the norm [min, max), which means min <= mix < max. Even Python's own range function returns the half open range.

Here is what I have to do to get a random number in the range [0,4) :

>>> import random>>> random.randint(0, 3)2

So, I don’t like random.randint(). Luckily, ctypes comes to the rescue with its rand() function. rand() takes no arguments and always returns a random integer between 0 and max_int. Converting it to the range [0, 4) is as simple as this:

>>> libc.rand() % 41

Ok, let’s try some math. What is the square root of 1?

>>> libc.sqrt(1)1

Cool, that works. Let’s try some more:

>>> libc.sqrt(4)1

Oops. That’s not good. As I recall the square root of 4 should be 2. What happened? So, CDLL objects assume that all functions return an int unless you tell them otherwise. That means that the return value of sqrt that happens to be a double precision floating point number will be coerced automatically to a Python int type (types.IntType). For some reason everything I tried to feed to sqrt returns 1 (or overflow error).

The way to fix it is to tell the sqrt function that it should return a double and not an int. ctypes provides a bunch of type factories designed to make it easy to map native C types to Python types. The full list can be found here: http://docs.python.org/dev/lib/node452.html.

Here is how to tell sqrt to return double:

from ctypes import c_double>>> libc.sqrt.restype = c_double

Now, sqrt also expects a double parameter. You probably think that you can pass in a Python double just like you passed an int but you would be wrong. Only the following types are converted automatically to C types:

  • None becomes a NULL pointer
  • int and long become the default C int type (exact type depends on platform)
  • strings and unicode strings become char * or wchar_t * respectively.

In order to call a C function that accepts a double you must use one of ctypes type constructors. It’s as simple as calling a function and passing a Python value:

>>> ctypes.c_double(8.65)c_double(8.6500000000000004)

The small error is an artifact of the way floating point numbers are represented in modern computers and is not a bug. Don’t be alarmed.

So, let’s get the sqrt() function going already:

>>> libc.sqrt(ctypes.c_double(4))2.0

Yay, it works. What happens if you try to pass a raw Python double? Nothing good, that’s for sure:

>>> libc.sqrt(4.0)Traceback (most recent call last):  File "", line 1, in ctypes.ArgumentError: argument 1: : Don't know how to convert parameter 1

ctypes.ArgumentError is the exception ctypes raises if it can’t convert the object you passed in.

Look Who’s Tokenning!
Let’s take it up a notch and call a relatively complicated function such as strtok. strtok is unusual because it keeps state between consecutive calls. It takes a character buffer (char *) as its first argument and a separator string (const char *) as its second argument. Here is the C signature of strtok:

char *strtok(char *s1, const char *s2);

The strtok() function gets the next token from string s1, where tokens are strings separated by characters from s2. To get the first token from s1, strtok() is called with s1 as its first parameter. Remaining tokens from s1 are obtained by calling strtok() with a null pointer for the first parameter.

Great, a very complicated way to do s1.split(). It has its merits though if you need to process a huge buffer of text that you don’t mind destroying and you don’t want to pay for all the copies that are done implicitly by split(). Anyway, the main point here is how to make ctypes call this function from Python. The process with any function is similar:

  1. Observe the return type and argument types
  2. Wrap argument types that are not supported automatically
  3. Call the function
  4. Get the results

In case of strtok s2 can be a regular Python string. The return type however is not an int so it needs wrapping and s1 needs to be a writable character array. A Python string won’t do for s1. ctypes provides the create_string_buffer() function for creating character arrays from Python strings (or just empty ones with a given size).

The following sample code creates a character array from the Python string ‘123 456 789’, sets the return type to be ctypes.c_char_p and then calls strtok() repeatedly and prints each token until it returns NULL/None:

p = ctypes.create_string_buffer('123 456 789')libc.strtok.restype = ctypes.c_char_px = libc.strtok(p, ' ')while x:            print x            x = libc.strtok(None, ' ')

Output:

123456789

Arrays, Pointers and Callbacks
ctypes supports even more constructs that, together, cover everything you may want to do with C function parameters and results. It allows you to define pointers to any data type, arrays of arbitrary data types, and custom structs and unions. It even allows you to define callback functions in Python. That means that a C function that accepts a C function pointer as an argument and calls it during its execution will be able to call your Python function. I will use the venerable qsort function to demonstrate arrays, pointers, and a callback function. You will have to take my word about structs and unions or try it yourself.

Pointers are really easy with ctypes. You can create a pointer for any ctypes type using the ctypes.POINTER factory function. To create a pointer from an existing variable use the pointer function. To access the value of a pointer px you can use px.content.value or simply px[0]. Note that ctypes variables are always mutable.

    x  = ctypes.c_int(888)    px = ctypes.pointer(x)    print 'x.value=', x.value    print 'px[0]=', px[0]    px.contents.value = 444    print 'x.value=', x.value    print 'px[0]=', px[0]    

Output:

x.value= 888px[0]= 888x.value= 444px[0]= 444    

Arrays are even easier. You need to create an array type for each element type and size. An array of three integers is a different type from an array of five integers. You create an array type by multiplying a ctypes type by an integer n. Once you have an appropriate array type you create an instance by using the array type as a factory function and passing in the elements of the array. You can access the array elements using standard indexing or standard iteration (for loop).

garbled_song = ('mo', 'mini', 'ini', 'miny')# Create an array type of char pointers the size of the garbled songStringArrayType = ctypes.c_char_p * len(garbled_song)# Create an instance of this array and assign it the garbled songstrings = StringArrayType(*garbled_song)print ' '.join(strings)# Modify an element of the arraystrings[1] = 'used_to_be_mini'print ' '.join(strings)

Output:

mo mini ini minymo used_to_be_mini ini miny

Now, I’ll put everything together and show you how to implement callback functions in Python (called from C). The children’s song/expression “ini mini miny mo” has the nice property of being an alphabetically sorted sequence of words. I’ll use the indispensable qsort function to sort the garbled sentence: “mo mini ini miny.”

qsort is very convenient since it’s available in libc and it operates on arrays using pointers. Here is the C signature of qsort:

void qsort(void *base, size_t length, size_t width,        int (*compare)(const void *, const void *));

This means that qsort is a function that returns nothing and accepts an array of arbitrary items as its first argument, the number of elements in the array as the second argument, the size (in bytes) of each element as its third argument, and finally a comparison function pointer as its fourth argument. The comparison function should accept two (const) pointers to array elements and return a negative number if the first element is smaller than the second element, 0 if they are equal, and a positive number if the second element is smaller than the first element.

This is pretty complicated, but ctypes comes through. You already know how to define pointers and arrays, so the only unknown is how to define a callback function. ctypes provides the CFUNCTYPE factory function. You pass in as first argument the result type and then the types of the arguments in order. Here is the definition of the comparison function type:

CmpFuncType = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(ctypes.c_char_p), ctypes.POINTER(ctypes.c_char_p))

Note that I specified as arguments pointers to ctypes.c_char_p because I intend to compare strings. The signature would be different if I wanted to compare other data types. The next step is to define a Python comparison function that corresponds to this signature. This is very easy. The only semi-tricky part is that the s1 and s2 are passed in as pointers so I have to dereference them using the s1[0] and s2[0] indexing. Then I just use Python’s built-in cmp function that follows the same convention of returning a negative, zero, or positive integer based on the result of the comparison.

def string_compare(s1, s2):    return cmp(s1[0],  s2[0])

All the pieces are in place and I can finally sort the sentence. The initial setup code is identical to the arrays sample code from earlier. I set the result type of qsort to None (because qsort is a void C function) and then invoke it using the strings array and the string_compare comparison function. Note that although garbled_song is an immutable Python tuple, the array created from it is very mutable (otherwise qsort wouldn’t work).

garbled_song = ('mo', 'mini', 'ini', 'miny')StringArrayType = ctypes.c_char_p * len(garbled_song)strings = StringArrayType(*garbled_song)print ' '.join(strings)libc.qsort.restype = Nonelibc.qsort(strings,                            len(strings),                            ctypes.sizeof(ctypes.c_char_p),                           CmpFuncType(string_compare))print ' '.join(strings)

Output:

mo mini ini minyini mini miny mo

Module No. 2: sqlite3
Python takes data persistence very seriously. If you want to see some evidence just check: http://docs.python.org/lib/persistence.html. You will find 12 modules besides sqlite3. These modules can be classified into two groups: object serialization and dictionary-like databases based on the ndbm interface. I have seen the object serialization modules (especially pickle) used a lot in Python projects and libraries but I have never seen any of the database modules used in any real code. This doesn’t say much because I am just one person and maybe my interests don’t intersect with the people who use these interfaces. However, there is another path to data persistence that I believe is often traveled. In addition to its plethora of data persistence modules Python also defines a standard DB interface called DB API 2.0.

DB API 2.0 is defined at http://www.python.org/dev/peps/pep-0249/. This API provides a different conceptual model based on connection and cursor objects and is tailored for accessing relational databases. Now, this is an API I have seen used extensively, especially in the web programming world (TurboGears, SQLObject, SQLAlchemy).

There are DB-API 2.0 bindings for all the major relational databases out there and it is used by many libraries and middleware object-relational mappers. The full list is here: http://www.python.org/doc/topics/database/modules/. All these bindings aren’t part of the Python standard library and of course the databases themselves are third-party software. This means that if you develop an application that accesses a relational database you have to do some installation; that can be as complicated as it gets for enterprise-scale databases. sqlite3 provides a great option for developers. You can develop your application using the built-in sqlite3 module without even thinking about it, and if you ever need to upgrade to a stronger database your code should be compliant in most cases. I will talk later about porting sqlite code to other databases.

Using sqlite for rapid development is not a new trick. People have used it prior to Python 2.5, but it could get irritating in some cases. Modern Linux distros come with sqlite, but which version? The two ubiquitous versions are sqlite 2.8.x and the 3.x. On Windows you have to download the sqlite3 DLL and put it on your path. Once you had the right version of sqlite installed, you had to download and install the pysqlite bindings, which had confusing nomenclature (pysqlite 1.x for sqlite 2.8.x and pysqlite 2.x for sqlite 3.x). I wasted a whole day trying to install pysqlite 2.1.2 on Gentoo due to conflicts with pysqlite 1.x , which was already installed. Now that it’s part of the standard library it is guaranteed to be available and there is no need for special installation on any platform.

sqlite3 itself is a fantastic embedded database. I had the pleasure to work with it directly in C in a couple of C++ projects and through an object-relational mapper (SQLObject) in a Python/TurboGears project. It is not fully ANSI-SQL 92 compliant, but comes close enough. It requires zero administration, provides an easy command line interface for browsing DB files and there are a bunch of graphical front-ends you can use.

Enough talking; let’s see some code. The goal of the next example is to develop an inventory system for a role-playing game. The player is a barbarian battle mage that can equip itself with weapons, armor, and rings. The database will contain four tables: item_types, equipment, inventory, and items. The equipment table will contain all the items the hero is currently wearing and wielding. The inventory table will contain all the non-equipment items in its possession. The items table will contain all the items in the game. The item_types table will contain the possible types of items (‘armor, ‘weapon’, ‘ring’). This is very simplistic of course and you may come up with a different DB design, but it will do for showcasing sqlite3.

First thing is to create the database. With sqlite3 you can just open a file and if it doesn’t exist a fresh empty database is created. How convenient is that? To create a schema and populate it with values you can use a simple text file that contains DDL (data definition language) and SQL commands. Listing 1 shows the game.sql file for the game.db.

You can do it programmatically too, which may be appropriate for insertions. You will see how later. For the time being I want to create a DB file and populate it given game.sql. I invoke the sqlite3 interactive console with a parameter game.db. It creates this DB file automatically (subsequent calls will load the existing file). The sqlite prompt comes up. The version I use (3.1.3) is pretty outdated and is just what’s installed on my Mac by default. It will do just fine for playing interactively with databases processed by other 3.x versions of sqlite programmatically. Interactive commands start with a dot (‘.’). (For example, type .help to get some help.) I used the .read to load the game.sql file and then .tables to see that all the tables were created. Finally, I select all the items:

[Gigi] > sqlite3 game.dbSQLite version 3.1.3Enter ".help" for instructionssqlite> .read game.sqlsqlite> .tablesequipment        item_types       sqlite_sequenceinventory        items          sqlite> select * from items;1|1|iron armor2|1|gold armor3|1|50% invisibility armor4|2|battle axe5|2|morning star6|2|spear7|2|broad sword of the damned8|3|ring of protection (20% shield)9|3|ring of might (+5 strength)10|3|ring of swiftness (+4 speed)sqlite> 

Now I have a database with some items, but the hero is unarmed, bankrupt, and naked: no weapon, no inventory, and no armor. If you don’t trust me just check the inventory and equipment tables and verify that they are empty.

The sqlite3 module implements the DB-API 2.0 interface. This interface is built on the notions of connections and cursors. Connections allow you to connect to a particular instance of your DB. And cursors allow you to execute DDL and SQL commands, which extract data from the DB and populate the cursor, allowing you to iterate over it.

I’ll begin by accessing my items table programmatically. First I have to create a connection by calling the connect function and pass a DB filename. You can set the isolation_level to None to get auto-commit behavior (every statement is committed immediately to the DB). Then you should call the connection’s cursor() method to get a cursor you will use to execute commands and browse the results.

conn = sqlite3.connect('game.db')conn.isolation_level = Nonec = conn.cursor()c.execute('select * from items')for x in c:    print x

Output:

(1, 1, u'iron armor')(2, 1, u'gold armor')(3, 1, u'50% invisibility armor')(4, 2, u'battle axe')(5, 2, u'morning star')(6, 2, u'spear')(7, 2, u'broad sword of the damned')(8, 3, u'ring of protection (20% shield)')(9, 3, u'ring of might (+5 strength)')(10, 3, u'ring of swiftness (+4 speed)')

Each row in the in the result set is represented by a tuple. The order of the columns corresponds to the definition of the order of fields in the query. In case of ‘select * from …’ the order is determined by the schema. Note that there is no need to terminate the command with the mandatory SQL semi-colon.

The whole model is very simple and intuitive: You create a connection and get a cursor, then you execute commands on the cursor, and the cursor returns the results.

To make it a little more interesting I’ll define an Item class and a Hero class, which manage equipment and inventory persistently by interacting with the DB. The implementation is very simplistic and should not be used as the basis of industrial strength code. Both classes rely on the global a connected cursor named ‘c’. Again, don’t do it at home. This is for demonstration purposes only.

The Item class accepts an item_id, selects it from the items table using the ‘c’ cursor, and provides some accessors to its attributes. The only interesting part is that I have to use the fetchone() method of the cursor to get to the selected item. You can use next() too. Cursors are iterables: Even if the result set contains just one row, you can use either fetchone() or next() to get to it. When you’re using the for loop Python calls next on your behalf.

class Item(object):    def __init__(self, item_id):        self.row = c.execute("select * from items where id==%d" % item_id).fetchone()        self.id = item_id        self.type = self.row[1]        self.description = self.row[2]

The Hero class (see Listing 2) is more exciting. It provides a pickItem and dropItem methods that insert/delete items to/from the inventory. It also provides a semi-private _show method that simply dumps the content of a table to the standard output using print. In the __init__ method I’m using the new Python 2.5 partial function application feature to define concisely two new methods?showInventory() and showEquipment()?which are just a partial application of _show() method. This is much cleaner than defining actual functions that call _show with a parameter (see Listing 2).

Now, I can create some items in the game world and let the hero pick them up:

    h = Hero()    i1 = Item(1)    i2 = Item(2)    h.pickItem(i1)    h.pickItem(i2)    h.showInventory()

Output:

----- inventory -----(25, 1, 1) iron armor(26, 1, 2) gold armor---------------------

Dropping an item is as easy as calling the dropItem() method and passing in an item id to drop (if there multiple items with the same id, all of them will be dropped.)

    h.dropItem(i1.id)    h.showInventory()

Output:

----- inventory -----(26, 1, 2) gold armor---------------------

You know how to pick up items and store them in the protagonist’s inventory. The hero is still freezing though. To equip him with some outerwear you’ll need some more code. The equipment table represents all the items the hero wears or can use. I’ll add an equip() method to the Hero class. This method will delete an item from the inventory and insert it into the equipment table. This is a non-atomic operation; if only one of them succeed you will either lose an item (if delete succeeds but insert fails) or end up with a duplicate (if delete fails but insert succeeds). Relational databases thrive on such conundrums and sqlite is no different. The way to handle it properly is using transactions.

Remember that the connection operates in auto-commit mode. That means that it commits after each command. However, by wrapping the two operations in begin and commit/rollback commands you get transactional semantics without having to modify the isolation_level. The try-except block guarantees that if something goes wrong the transaction will be rolled back. If everything works, the transaction is committed at the end of the try block.

    def equip(self, id):        try:            c.execute('begin')            item = Item(id)            c.execute('delete from inventory where item_id=%d' % id)            c.execute('insert into equipment (type, item_id) values(%d, %d)' % (item.type, item.id))            c.execute('commit')        except Exception, e:            print e            c.execute('rollback')

The Item class seems like a kludge. It’s convenient but if you have lots of objects writing such a class for each one gets old fast. There are various solutions to this problem, everything from ad-hoc DAO (data access object) code generators to full-fledged ORM (object-relational mappers). The sqlite3 module features a modest and lightweight solution of its own. Enter the row factory. The connection object has a row_factory attribute that you can set to any callable to control the object that’s returned for each row (instead of the default tuple). sqlite3 provides a useful row_factory called Row that returns an object that allows accessing columns by index like a tuple but also by name (insensitive).

    conn.row_factory = sqlite3.Row    c = conn.cursor()    c.execute('select * from items')    item = c.fetchone()    print item['id'], item['tYPe'], item['DesCrIpTion']    

Output:

1 1 iron armor

sqlite3 has a few other interesting features like an authorizer that can intercept access to any column, converters and adapters to translate between SQL types and Python types and of course the memory database that allows for lighting quick tests. I encourage you to explore this wonderful module further.

Module No. 3: xml.etree.ElementTree
This module contains pythonic XML processing tools for parsing and constructing XML documents. Python boasts several standard XML modules that support the DOM and SAX APIs. However, the DOM API (xml.dom.minidom) is modeled after the W3C DOM API and is quite cumbersome. ElementTree is the brainchild of Fredrick Lunde ( http://www.effbot.org). It is a highly pythonic and high-performance XML package. Lunde also contributed the cElementTree, which is a C extension that exposes the same API as the Python package. The performance of cElementTree is amazing (speed and memory foot print).

Many pythonistas reject XML as a data exchange format altogether and prefer to simply use direct Python data structures for data exchange. This can be done either as plain text (to be evaluated on the other side using the eval() function) or pickled. However, no one can escape XML these days. It is especially dominant in the important web services domain. To discuss ElementTree, I will continue with the role-playing game example.

ElementTree is based on the Element data type. An element has a tag and may also have children (sub-elements), attributes (key-value pairs), content (text string), and a tail (text string that follows the element until the next sibling element). ElementTree is optimized for non-mixed data models (where text never contains elements), which will be the focus of this article.

The Forbidden Forest
Our well-equipped hero from the sqlite3 section is about to enter an ominous forest. Game areas are created and exchanged using XML in the game because XML is better suited for dealing with hierarchical data structures. The forest contains enemies, treasures and other items. ElementTree lets you express it very concisely:

import osfrom xml.etree.ElementTree import (ElementTree,                                   Element,                                   SubElement,                                   dump,                                   XML)                                   root = Element('forest')SubElement(root, 'treasures')SubElement(root, 'enemies')SubElement(root, 'items')print 'The root has', len(root), 'sub-elements'for e in root:  print e.tagprintdump(root)                  

Output:

The root has 3 sub-elementstreasuresenemiesitems

I created a root Element giving it a tag. You can also provide attributes as a dictionary and even more attributes via named parameters. After creating the root element I created a few sub-elements of the root (treasures, enemies, items). You can iterate over sub-elements using a simple for loop over an element. The len() of an element returns the number of its sub-elements. This is the essence of the “pythonicity” of ElementTree. It uses Python idioms to expose its data model.

The dump() function takes an Element and dumps its contents recursively to the screen. It is very handy for interactive development. You create some elements hook them up together and dumps the root to the screen to make sure you got it right.

It’s time to inhabit the forest with some fearsome creatures. The creatures in this game have a name, numeric life and strength attributes, and an optional special attack or power. When life reaches 0 the creature (or the hero) is dead. The strength determines the damage the creature dishes in each attack and the special attack is, well, special. It can affect many aspects of a battle. Let’s start with your garden variety crazy ogre. In order to add a crazy ogre to the enemies in the forest I use the find() of the root element to find the enemies sub-element and then append() an ‘enemy’ element with the various attributes. Note that I passed some of the attributes as a dictionary, but the ‘special’ attribute as a named parameter. This is fine?all the attributes are equivalent.

enemies = root.find('enemies')enemies.append(Element('enemy',                        {'name':'Crazy Ogre',                         'life':'85',                         'strength':'18'},                        special='bone crusher'))

Let’s dump the forest and see what it looks like:

dump(root)

Output:

That code is not very readable and I’ve only dumped a single enemy. What would it look like with a few more enemies?

Pretty Dumping
The problem with the dump() function is that when you build your element tree using ElementTree’s Element and SubElement classes no indentation or new lines are added to the XML. You end up with a single long line of verbose XML.

I created a little recursive function called pretty_dump() that takes an Element and returns an XML string with a nice layout of its content including all sub-elements. Nested elements are indented. Elements with no children appear on the same line. Every element starts in a new line. The code processes recursively every element while increasing the indentation level by two spaces and building the XML strings incrementally. The function doesn’t actually print anything to the screen and just returns the final XML string. I used the os.linesep, which is the line terminator character[s]. Python defines it for every platform to make sure it works nicely everywhere.

def pretty_dump(e, ind=''):    # start with indentation    s = ind    # put tag (don't close it just yet)    s += '<' + e.tag    # add all attributes    for (name, value) in e.items():        s += ' ' + name + '=' + "'%s'" % value    # if there is text close start tag, add the text and add an end tag    if e.text and e.text.strip():        s += '>' + e.text + ''    else:        # if there are children...        if len(e) > 0:            # close start tag            s += '>'            # add every child in its own line indented            for child in e:                s += os.linesep + pretty_dump(child, ind + '  ')            # add closing tag in a new line            s += os.linesep + ind + ''        else:            # no text and no children, just close the starting tag            s += ' />'    return s

Here is pretty_dump() in action:

print pretty_dump(root)

Output:

            

Parsing XML
ElementTree is not just an XML builder. It is a parser too. It can take an XML file or string and create an ElementTree out of it. This area of ElementTree is probably the most common one and yet the interface is pretty clunky IMHO. To parse files you call the parse() function with a filename; to parse an XML string you can use one of two identical functions: XML() or fromstring(). Nothing is consistent about this choice of functions. It feels wrong to have more than one way to parse XML strings.

The forest contains only a single crazy ogre. It’s not much of a challenge. Let’s add a bunch of other enemies. How about Godzilla, Micro Godzilla, King Kong, Prince Kong, a Fearsome Dragon, a Hot Dragoness, a Drunk Dragon, a Killer Bunny, and a Wolf Pack for good measure. Simply passing the XML string that describes each enemy to the XML() function is enough to create an Element that contains sub-elements for each enemy. The wolf pack is a nested enemy that contains individual wolves.

# initializing from a stringenemies_xml = """                                                                """# Creating an Element from a stringenemies = XML(enemies_xml)

The output looks just like the input in this case, which is a good validation of the pretty_dump function.

Finding Your Way in the Forest
So, you have a nice populated forest with lots of enemies and the hero is ready to bravely enter it. The hero is powerful, courageous, and can dance like a ballerina. Unfortunately he is also a stompophob. Stompophobs as you very well know are afraid to death to be stomped. This is a very rational aptitude where the likes of Godzilla and King Kong walk the earth.

The hero naturally has access to our forest XML file, and he wishes to know about all the stompers in the area. ElementTree sports several flavors of finding stompers such as find(), findall(), and findtext(). All these functions accept a parameter that can be either a tag name or a limited XPath expression. ElementTree supports a very basic subset of XPath. You can search for a specific tag in your direct children or on an entire tree or you can start from a specific branch. For example, to find all the compound enemies in the forest the following expression will do:

compound = enemies.findall('./enemy/enemy')for e in compound:    print e.get('name')

Output:

Wolf 1Wolf 2Wolf 3

The XPath expression finds all the wolves in the wolf pack. Note that there is no way to get the parent of a node, so if you want to find the compound itself, you are out of luck. The XPath support doesn’t include attributes. This means that there is no way to perform searches based on attribute names or values (or the text content of nodes). This poses a serious problem to the hero since the ‘stomping’ special attack is stored in attributes of the forest creatures. A lesser man or woman would probably buckle up and go around the forest. The hero, however, is both heroic and well versed in the art of XML and ElementTree. He decides to transform the forest so that attributes become elements.

Here is his plan: Scan recursively the element tree. For each element with attributes create an ‘attributes’ tag, insert into it a sub-element for each attribute (tag is the attribute name, text is the value), and set the original attributes to None. Note that I created the ‘attributes’ element using Element and not SubElement. This creates a standalone element that later I insert() as a sub-element explicitly. The reason I didn’t use SubElement is two-fold: I wanted to show you another way to add sub-elements and also I wanted to make sure the ‘attributes’ sub-element will be the first sub-element. The SubElement() function always appends the new sub-element.

def attributes2elements(e):    for child in e.getchildren():        attributes2elements(child)    if e.items:        # make sure that the attributes element is the first one        attributes = Element('attributes')        e.insert(0, attributes)        for (name, value) in e.items():            a = SubElement(attributes, name)            a.text = value        e.attrib = {}

In Listing 3, the hero wastes no time and invokes attributes2elements on the original forest XML string. The verbosity tripled instantly, but at least the information is preserved and the XML contains no attributes.

Detecting Stompers
At this point I can invoke one of the find() functions to locate stompers. However, it is not very simple. Here’s why.

stompers = [e for e in enemies.findall('.//special') if e.text == 'stomping']for s in stompers:    print pretty_dump(s)

Output:

stompingstompingstompingstomping

This code indeed locates all the stompers, but only their special element. There is no way to climb back up and find the ‘enemy’ element. (Take note of the XPath expression to locate all the sub-elements under the current node in any level.) In order to find the stomping enemy elements some ingenuity is required. Listing 4 checks every enemy (working at the enemy element level) to see if it has a special attribute with a value of stomping.

Elementary ElmentTree
ElementTree is a fine piece of software that proves that a friendly API can also be performant. ElementTree offers much more than that, including decent namespace support, fine-grained XML tree building, reading and writing to files, etc. For performance buffs the cElementTree is a real boon. The official documentation is here: http://docs.python.org/dev/lib/module-xml.etree.ElementTree.html, but it is very weak. I recommend going to the source: http://effbot.org/zone/element-index.htm. And be sure to keep your eye out for many fine tutorials and articles by third-party developers.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

Top 5 B2B SaaS Marketing Agencies for 2023

In recent years, the software-as-a-service (SaaS) sector has experienced exponential growth as more and more companies choose cloud-based solutions. Any SaaS company hoping to stay ahead of the curve in this quickly changing industry needs to invest in effective marketing. So selecting the best marketing agency can mean the difference

technology leadership

Why the World Needs More Technology Leadership

As a fact, technology has touched every single aspect of our lives. And there are some technology giants in today’s world which have been frequently opined to have a strong influence on recent overall technological influence. Moreover, those tech giants have popular technology leaders leading the companies toward achieving greatness.

iOS app development

The Future of iOS App Development: Trends to Watch

When it launched in 2008, the Apple App Store only had 500 apps available. By the first quarter of 2022, the store had about 2.18 million iOS-exclusive apps. Average monthly app releases for the platform reached 34,000 in the first half of 2022, indicating rapid growth in iOS app development.