Python Boosts JAR File Auditor Functionality

nderstanding the Java classpath and Java’s classloading mechanism are essential for any proficient Java developer. In a previous DevX article (Put an End to Jar File and Class Name Conflicts), I discussed how duplicate jar files and classes can cause hard-to-detect naming conflicts, which produce errors that are difficult to debug. A simple jar file audit utility I wrote in Java can make identifying these problematic duplications much easier.

However, this Java auditor was my first pass at the tool. I’ve since discovered that Python enables a port of the utility that is cleaner in its implementation and more complete in its auditing capabilities than my original. In this article, I present the Python port (which I hereafter refer to as the auditor), discuss its advantages, and highlight some of the great Python features that allow you to produce robust functionality with minimal code.

Author’s Note: To follow this article, you need a basic understanding of Python syntax and data structures. A basic knowledge of object-oriented design also is a prerequisite for understanding the code, particularly the use of Python XML binding. The audit returns results as an XML document and uses an XSL stylesheet for formatting. So some knowledge of XML and XSLT technology also would be advantageous if you wish to modify or enhance the stylesheet that’s included with the source code for the utility.

Python’s Advantages Over Java
The original Java version of the auditor walks a specified directory tree and audits it for duplicate jar files and duplicate class files within them. The problem with its algorithm is that the design relies on the existence of duplicate jar files in order to identify duplicate class files. For example, if class A were in JarFile1 and another instance of class A was in JarFile2, the Java auditor wouldn’t catch the duplication because the jar file names are different.

This gap in the program logic was more a sin of commission than omission. I wanted the auditor to be a simple, lightweight utility that didn’t require an embedded object data store or hooks to an RDBMS. So I decided to consciously scale down the utility, but I struggled to find an intuitive method to track the progress of the audit. I wanted to identify jar-file and class duplications and provide the necessary location data in the audit report by means of a multi-keyed data structure.

Unfortunately, no such structure exists in Java. I could’ve used some combination of objects, hash tables, and linked lists, but this seemed to be overkill for such a simple concept. So I opted for a less-than-optimal first attempt and hoped that the insight I gained from coding it would lead me to the kind of solution I desired.

While I was struggling with the audit-tracking problem, I started learning Python and discovered Python’s dictionary data type. A dictionary is a key-value pair somewhat akin to a hashtable. The crucial difference between the two structures is that a dictionary supports the notion of multi-keyed data whereas a hashtable uses a single object as a key. Some workarounds could create a unique object for the needed keys in Java, but Python’s dictionary makes the storage and access of the audit data far simpler and more intuitive.

The Python Auditor
The auditor’s base programmatic operation is to walk a directory tree and scan for jar files. I didn’t have it look for plain old class files because they’re easy to find with a brute-force search on Windows or Unix. The command line arguments specify the directory tree and the file type that is being audited. I typically look only for jar files because including zip files casts too big a net on a Windows file system. So the command line would look like this so far:

python C:Auditorauditor.py c:AuditTestDir *.jar

The last piece of the command line is a redirection symbol to tell the auditor where it should print the XML results:

	python C:Auditorauditor.py c:AuditTestDir *.jar  >
C:AuditorResultsmyauditresults.txt

The first thing the main( ) method of the script does is invoke the listFiles( ) method passing the root directory and the file extension pattern. The listFiles( ) method (much of which is directly borrowed from page 144 of O’Reilly’s Python Cookbook) creates an instance of the Bunch class called args that collects all of the command line arguments into a list of name value pairs. The Bunch class’s constructor does this using a slick Pythonic construct shown in the following snippet:

class Bunch:	def __init__(self, **kwds): self.__dict__.update(kwds)	arg = Bunch(recurse=recurse, pattern_list=pattern_list,
return_folders=return_folders, results=[])

The **kwds argument collects all of the arguments into name/value pair maps. So you’ll see, for example, arg.pattern_list in the code as an attribute of the Bunch class object referenced by the variable arg.

To walk the directory, I invoked the command os.path.walk and used the Python standard os library. This is an amazing piece of built-in functionality, and I’m somewhat surprised that no comparable method exists in the vast collection of standard Java libraries. The Pythonic walk( ) does much of the heavy lifting for me. The Python documentation succinctly describes the operation of this method:

[os.path.walk() ] Calls the function visit with arguments (arg, dirname, names) for each directory in the directory tree rooted at path (including path itself, if it is a directory). The argument dirname specifies the visited directory, the argument names lists the files in the directory (gotten from os.listdir (dirname)).

The following is the auditor implementation of the visit( ) method:

def visit(arg, dirname, files):	"""Called by walk to visit each file and assess its suitability
for inclusion in the audit""" #Append to arg.results all relevant files (and perhaps folders) for name in files: fullname = os.path.normpath(os.path.join(dirname, name)) if arg.return_folders or os.path.isfile(fullname): for pattern in arg.pattern_list:
#*.jar command line argument if fnmatch.fnmatch(name, pattern): arg.results.append(fullname) addToAudit(fullname)
#added for audit break # if recursion is blocked then set result list if not arg.recurse: files[:]=[]

Methods are first-class objects in Python, so they can be passed as arguments just like any other object or primitive value in other languages.

The critical auditing functionality I added was the invocation of the addToAudit() method. This method accepts the full name of the jar file, which it uses to perform auditing of that jar file based on its name. This is where the flexibility of Python’s dictionary data type comes into play.

Python’s Dictionary Data Type
To capture and interrelate all of the data elements that I determined, the four dictionaries shown in Table 1 were required.

Table 1. Auditor Dictionaries Used to Track Audit Progress and Report Results
VariableName: Key: Values: Purpose:
master_jar_list Jar file name Tuple(Full jar file name, time modified) Master inventory of all jar file names audited
dup_jar_list Tuple(Full jar file path name, Jar file name) master_jar_list[jarfile] – lookup to the master_jar_list to retrieve this value. Listing of all duplicated jar file names
master_class_name_list Class name Tuple(Full filename, package_name, date modified) Master inventory of all class names audited
dup_class_name_list Tuple(Class name, package name) Tuple(full jar name, date class modified, [List of all jar file locations that contain this class] Listing of all duplicate class names, whether or not the packaging matches

These four data structures suffice to keep a running inventory of the audit process. In the addToAudit( ) method, the following code performs the necessary checks to add the jar file to either the master jar file inventory or the list of duplicate jars:

if(master_jar_list.has_key(jarfile)):#duplicate jar dictionary uses the full file name as the key#the value is a tuple of the   *****ADD time modified  *********dup_jar_list[(full_jarname, jarfile)] = master_jar_list[jarfile]else:#master list can use just the X.jar name as the key whereas 
the duplicate list must distinguish between#potentially multiple copies of the same file. master_jar_list[jarfile] = (full_jarname, time_modified)#check class files archived in the .jar filereadZip(full_jarname, jarfile)

Once the jar file name is evaluated, the last line passes the jar file along to the readZip( ) method, which interrogates the contents of the jar and uses the class-name-specific dictionaries for appropriate auditing:

for aFile in z.filelist:	            #Master class name list is keyed by class name only and Dup's show 
#packaging#this way classes of the same name but with different packaging will be identified
#as duplicates. The file_locations list is an aggregate listing of all the
#places that a class of this exact name and packaging occur. If just the class
#name is the same but the packaging is different then the class will be listed as
#a separate duplicate entry.file_locations = []if(master_class_list.has_key(class_name)): full_jarname, mod_date = master_jar_list[jar_name_only] if(dup_class_list.has_key(lookup_key)): file_locations = dup_class_list[lookup_key][2] file_locations.append(full_filename)dup_class_list[lookup_key] = (full_jarname, formatted_modified_date,
file_locations) else: #the master_class_list call below returns a tuple of
which the first element is neededfile_locations.append(master_class_list[class_name][0])dup_class_list[lookup_key] = (full_jarname, formatted_modified_date,
file_locations) else: #add to the master class list for the first occurrence master_class_list[class_name]=(full_filename, package_name,
formatted_modified_date)

After assembling all the necessary audit results into four dictionaries, the buildXML( ) method loads the dictionaries into an instance of the DictionaryHolder class. This simple manipulation allows an XML document to be generated from the DictionaryHolder object by means of the Gnosis Python XML binding libraries. The following line of code is all that is required to create a valid XML document from the contents of the dictionaries:

#variable o is the reference to the DictionaryHolder objectxml_string = xml_pickle.XML_Pickler(o).dumps(deepcopy=1)

XSL Stylesheets Dress Up the Output
As I mentioned previously, using Python’s print statement in conjunction with simple command-line redirection makes it easy to send the output XML document to wherever you’d like. Unfortunately, the following generic XML output is rather tough to read because the tags that the XML binding library implemented aren’t specifically named to correspond to the individual data elements in the dictionaries:

                         

Fear not, this can be overcome. An XSL stylesheet can format the raw output into tables. The tables (see Figure 1) may not be works of art, but they are easier to read than the raw XML. I’ll leave it to you to enhance the look and feel as you see fit.

Figure 1: XSL Stylesheet Formats Raw Output into Tables

Powerful Apps, Minimal Code
Python’s powerful libraries and minimalist syntax allow you to write readable, high- octane applications in fewer lines of code than with Java. In this case, the Python port of the auditor takes full advantage of the following features to deliver more robust functionality in far fewer lines of code than the Java auditor did:

  • Python’s flexible data structures make it easy to create complex data relationships.
  • Python is an object-oriented language, so you can make use of cutting edge XML binding technology to reduce the coding time and effort it takes to produce XML data tailored to meet your needs.
  • A wealth of open-source code libraries are available to help bolster Python’s out-of-the-box offerings.
  • Share the Post:
    Share on facebook
    Share on twitter
    Share on linkedin

    Overview

    Recent Articles: