RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Dig Deep into Python Internals, Part 2 : Page 2

Advanced techniques such as metaclasses, code injection, and call-stack walking harden Python for the enterprise. One novel use of Python's dynamic nature allows you to add private code access checking. Follow along to learn how.

Hardening Python
Python, being the free spirit that it is, has no real access-checking mechanism (private, public, protected, package, etc). Any variable and any function can be accessed from any piece of code, so long as you qualify it properly. Python does provide a sort of name-hiding feature for class attributes. Attributes that start with two underscores (e.g. __blabla) and end with at most one underscore (e.g. __blabla_ is ok, __blabla__ is not ok) are implicitly prefixed by an underscore and the class name. So, __blabla becomes _classname__blabla (assuming classname is really the name of the class). Code inside the class can access the attribute using the short name (__blabla), but external code will have to use the full name (_classname__blabla).

The Puritan class in the example code below declares two "private" variables and one "non-private". Note that the dump() method can access all variables with their regular name, while the external code must qualify the attribute name with '_Puritan'.

class Puritan(object):
    __classPrivate = 3
    def __init__(self):
        self.__instancePrivate = 4
    def dump(self):    
        print Puritan.__classPrivate
        print self.__instancePrivate

if __name__=='__main__':
    p = Puritan()
        print Puritan.__classPrivate
    except AttributeError, e:
        print e    
        print p.__instancePrivate
    except AttributeError, e:
        print e
    print Puritan.__notPrivate__
    print Puritan._Puritan__classPrivate
    print p._Puritan__instancePrivate


type object 'Puritan' has no attribute '__classPrivate'
'Puritan' object has no attribute '__instancePrivate'
Another way to get to "private" attributes is through the __dict__. This name mangling technique hasn't been popular in the Python community. The most common practice is to prefix private attributes with a single leading underscore as in _private. A single leading underscore actually means that 'import * from m' will not import all names (classes, functions, variables, etc) that have a single leading underscore. Anyway, all these semi-formal schemes don't really enforce code access verification and they are easy to circumvent. The question is how important is real access verification? The answer is it's getting more and more important for large systems.

Python, as opposed to most other dynamic languages, is being used to develop enterprise-grade systems. Bugs in enterprise-grade systems are notoriously expensive (especially if they are discovered late in the development cycle). Everything that can help reduce the number of bugs is welcome. In a large team of developers there will inevitably be someone who likes shortcuts, and will therefore call this private method temporarily, potentially ruining the integrity of the system. Another scenario where access verification may be important is when your system exposes a Python API and loads plugins written by some third party. In this case, you are potentially exposed to both clumsy and malevolent individuals. This trend of scaling up Python to ever larger systems is evident also in the quest for optional static typing for Python by Guido Van Rossum, Python's creator, and others.

Let's assume I convinced you Python needs code access verification. What can you do about? It turns out there is plenty you can do. You can decide to focus on renaming all the private attributes in your code and the libraries you use to the double underscore style. Then, you can review your code and make sure nobody is accessing something private. When you get tired, you can decide to write a little program that will do it for you. Finally, you can run this program periodically to scan your code for violations. This approach may turn out to be too tedious and error-prone. Also, you can't really say much about the code using static analysis if your code contains eval(), exec(), and friends or uses various dynamic code modification tricks.

The solution I'll present is based on access verification checks at runtime (of class attributes). Whenever a private attribute is accessed, by some mysterious magic the caller will be checked, and if it doesn't belong to the same class an exception will be raised.

Functions, Code Objects and Frames
Before you can submerge yourself into peeking and poking the call stack let's clear the dust out of some basic concepts. When you write a function 'foo' you type the arguments and the code that operates on these arguments (and possibly on the environment) and you decide whether or not foo() returns a result. When the module that contains your 'foo' is loaded (or compiled to .pyc) Python takes your function, compiles it to a code object that contains a bunch of metadata as well as a bytecode that can be executed by the Python virtual machine. In addition, Python creates a function object that contains a bunch of different metadata and also a reference to the code object and finally puts it in the global dictionary of the module. At runtime when function 'foo' executes, a frame object is created and put at the top of the call stack. This frame object has yet another set of metadata and a reference to the same code object referenced by 'foo'.

It turns out that Python can provide a lot of information about the entire call stack and particularly the direct caller.
Listing 1 is a miniature tour de force of this confusing compile-time/run-time code management. The 'dumpObject' function is a helper function that accepts an object and a regular expression filter. It traverses the object's attributes and prints the name and value (by eval()uating it) of each attribute that matches the filter. This is convenient for exploring the relevant attributes of function, code and frame objects since their attributes have a distinctive prefix (func_, co_, and f_). The 'a' function gets the current frame object using sys._getframe() and then it calls dumpObject three times