Build an XML Based Scheduling Utility

omplex applications such as data warehousing need a strong operational support infrastructure to manage the daily tasks that keep these systems running. Typically, such applications rely on multiple programs that must execute in sequence and on a specific schedule. Scheduling utilities manage that process and are an important part of the infrastructure. A basic scheduling utility runs a task, checks the return code and then (depending on the return code) either runs the next task or exits. Ideally, a scheduling utility also captures operational metadata, such as execution time, CPU and I/O usage, task output and error codes, making continual improvement and process control possible.

You can use almost any data structure?even ASCII flat files?to capture the process flow metadata required for such an application, but XML is a better choice, because it captures hierarchical information naturally.

In this article, you’ll see how to use XML to structure the process flow information of a sample data warehouse application. The application has several jobs that must run in a certain order. A simple Apache Xerces Java Parser implementation of an XML based scheduling utility uses an XML file to control the job flow. The article also discusses some possible ways to enhance for the scheduling utility using XML tools and techniques.

Define a Sample Data Warehousing Application
Imagine a Data Warehousing Application that has the following tasks, called units, which must run in a specified order. Table 1 shows the task descriptions and names.

Unit Description
Unit Name
Initialize the System INITIALIZE_SYSTEM
Archive Data from previous Run ARCHIVE_DATA
Run Analytics on the Data (Call made to a Core Engine) RUN_ANALYTICS
Process the Information obtained from the previous Run Analytics job and store it in a Database PROCESS_ANALYTICS
Free the System FREE_SYSTEM

Each unit contains subunits, which ensure the success of the parent unit. For example, the PROCESS_ANALYTICS unit consists of programs that perform the following activities, each of which is a subunit.

Subunit Description
Subunit Name
Housekeeping (Set_Ctrl)
Drop Type 1 Table Indexes
Drop Type 2 Table Indexes
Drop Type 3 Table Indexes
(Drop_Index1)
(Drop_Index2)
(Drop_Index3)
Load Tables with data from ASCII files (Load_Tables)
Aggregation Program (Aggr_Prgm)
Build Indexes (Build_Index)

Further, the subunits may have complex run dependencies. For example, the Load_Tables Job should run only if the subunits (Drop_Index1) and (Drop_Index2) succeed. The Aggregation Program (Aggr_Prgm) should run only if the (Load_Tables) and (Drop_Index3) jobs succeed.

The above example illustrates the typical hierarchical nature of process flow metadata. XML is a good candidate for storing hierarchical information of this type and can be used to model the process flow visually.Implement the XML based Scheduling Utility
You can model the process flow information shown in Tables 1 and 2 as an XML file conforming to the DTD contained in the job_stream.dtd file (see Listing 1).

The job_stream.xml file (see Listing 2.) contains the XML representation of the process flow information for the sample Data Warehouse Application. The run_condition elements contain the subunit names (job_names) that must complete successfully for the job to run. For example, the Load_Tables job_box element’s run_condition specifies (success(Drop_Index1) AND success(Drop_Index2)) or simply ((Drop_Index1) AND (Drop_Index2)). If a job has no dependencies on a previous job , specify none in the run_condition. Both AND and none are keywords and cannot be used as part of the name of a job_box.

The DOMScheduler.java file contains the source code for a simple XML based scheduling utility implemented using the Apache Xerces Java Parser 1.4.3 Release. It accepts an XML file as input, which must validate against the jobstream.dtd. You can use the current version of the scheduling utility to run a single job stream of programs. The programs themselves can be compiled executables of any kind, for example, Java applications, C/C++ executables, DOS batch programs etc. The scheduler calls the Runtime.exec() method (which in turn calls upon system level APIs like the Win32API CreateProcess()), which may not work well for special processes on certain native platforms such as Win16/DOS processes on Win32, or on shell scripts. The user must populate the XML file with all the unit and subunit information.XML Scheduling Utility Program Outline
The program schedules jobs in two steps. First, it sorts the units (job_sum_boxes) and stores them in the appropriate order in a string array. The sort serves to place dependencies in their proper order by by searching the ‘run_condition’ of each unit for the name of the previous unit(s) run and then placing them in the job stream.

Next, starting with the first unit, the program sorts the subunits (job_boxes) corresponding to each unit and executes the command associated with each subunit. Again, the sort searches the ‘run_condition’ of each subunit for the name(s) of the previous subunit(s) run.

The code below shows the overall structure of the DOMScheduler class:

   public class DOMScheduler {         // Declare class variables      /* Code Snipped */      ...      ...      // Constructor       public DOMScheduler(String xmlFile) {            DOMParser  parser = new DOMParser();         try {            parser.parse(xmlFile);            Document document = parser.getDocument();                        // Initialize the class Variables            /*  Code Snipped  */            …            for (int loopCounter =0; loopCounter...) {               ...               outerTraverse(document);            }            /* Code Snipped */            ...            for (int outerLoop=0; ...) {               /* Code snipped */              ...               traverse(document);               traverseTwo(document);   }         } catch {            ...         }         public void outerTraverse(Node node) {}      public void traverse(Node node) {}      public void traverseTwo(Node node) {}      public void innerTraverse(Node node, …..) {}         public static void main(String[] args) {         ...         DOMScheduler domScheduler = new _            DOMScheduler(args[0]);      }   }

The DOMScheduler constructor calls the class methods outerTraverse(), traverse() and traverseTwo(). The outerTraverse() method sorts the units . After the units have been sorted, the traverse() and traverseTwo() methods sort the subunits in each of the parent units. The traverseTwo() method calls the innerTraverse() method.

The string arrays jobSumBoxNameinRCondition[] and jobBoxNameinRCondition[] (declared as class variables) store the unit (job_sum_box) and subunit (job_box) names of the unit or subunit respectively that’s currently being processed. The next loop uses those names while searching for the ‘run_condition’s of the units/subunits. The tempJobSumBoxNameinRCondition[] and tempJobBoxNameinRCondition[] string arrays are temporary holders for jobSumBoxNameinRCondition[] and jobBoxNameinRCondition[] within the outerTraverse() and traverseTwo() methods.

All the traverse class methods have the same recursive Traversal logic, and outline of which is shown below:

   public void traverse (Node node) {      type = Node.getNodetype      if type = Element_node {         //Additional logic         ...       Attr attrs[] =          SortAttributes(node.getAttributes())         //Do processing of Attributes using attrs[ ]      }      NodeList children = node.getChildNodes()      if (children != null) then {         for (int i =0; i < children.getLength; i++) {            traverse(children.item(i));         }      }   }

The comment //Additional logic in the preceding code represents different code for each traverse method. Here's an explanation of the underlying logic for each type.

The outerTraverse() "Additional logic" Code Outline

   if (node.getNodeName().equals("job_sum_box")) {         NodeList nodeList = node.getChildNodes();      /*  Code Snipped */      ...      if (nodeList.item(i).getNodeName().         equals("run_condition")) {         searchString = new String            (nodeList.item(i).getChildNodes()            .item(0).getNodeValue().trim());          /* Code Snipped */         ...         if (searchString.lastIndexOf            (jobSumBoxNameinRCondition[k]) != -1) {            /* Code Snipped */             ...         }      }         }

The preceding code identifies the units and checks the text value (the searchString variable in the preceding code fragement) of the run_condition of that unit. The code identifies the unit with the run_condition none as the first unit to be run. The code makes additional checks on the 'process_code' (not shown above) of each unit to verify whether the unit has already been processed.

The code then searches the run_condition of all units for the name of this unit (stored in jobSumBoxNameinRCondition[k]) and processes all matching units in the subsequent loop. The tempJobSumBoxNameinRCondition[] string array stores the unit names for the search in the next loop.

The traverseTwo() "Additional logic" Code outline

   if (node.getNodeName().equals("job_box")) {         NodeList nodeList = node.getChildNodes();         if (nodeList.item(i).getNodeName().         equals("run_condition")) {         searchString = new String            (nodeList.item(i).getChildNodes().            item(0).getNodeValue().trim());            ...               if (searchString.lastIndexOf           (jobBoxNameinRCondition[k]) != -1) {            if (searchString.lastIndexOf("AND") != -1) {               // Logic to handle AND condition                // involving a call to the               // innerTraverse() method            } else {            ...            }         }      }         }

A quick glance through the preceding structure reveals the similarity between the additional logic for outerTraverse() and traverseTwo(). The two major differences are that traverseTwo() searches for the subunits (job_boxes) rather than the units. Also, traverseTwo() contains additional logic to handle multiple AND conditions in the run_condition, which involves a call to the innerTraverse() method.

The traverse() method is a special case of the traverseTwo() method used for the processing of the first subunit in each unit (identified by the none value in its run_condition).

The simple innerTraverse() method verifies that all the subunits contained in the jobNumberStr[] array have already been processed.

Extending the DOMScheduler Program
The DOMScheduler program contains hardcoded values for the upper end values of loopCounter (30) and k (10) in the loops. These values are set arbitrarily to account for a job stream of considerable complexity. You can increase them as necessary to work with more complex job streams.

The sample code used to traverse the DOM tree can be avoided by using a higher level API like JDOM and/or by using the DOM Traversal Module, which is part of the DOM Level 2 Specification (See Resources).

You could enhance the scheduling utility by including support for additional keywords such as OR and for complex run conditions involving combinations of AND and OR conditions. You could also add support for features such as Job Restart and Job Abort.

While the scheduling utility sample code does not use the success_code element, you should add a success_code check. You can store the output of each executable run in the files specified by the std_out_file and std_err_file elements. The sample project supports only a single job stream, but you can modify the project to support multiple job streams.

You must populate the XML file as a one-time setup operation. Standard data warehouse applications usually store process flow information in a relational database. It's possible to create an XML file from database tables directly using standard object-relational mapping techniques and tools, thereby automating the process of populating the XML file. Refer to the resources section for a good link to XML-DBMS transformation tools and techniques.

Alternatively, you may be able to populate the XML file programmatically from an external source using data binding tools such as Java Architecture for XML Binding (JAXB). To store the XML, you can serialize the output DOM Tree and write the information to disk for further analysis.

While the sample code runs as a stand-alone executable, you could implement it as a Windows service. This would let you include additional scheduling information in the XML file, for example, "run this stream every 24 hrs"

You can easily integrate an XML based scheduling utility such as the one described in this article into any operational infrastructure that you may have. Such utilities can be a firm step toward simplifying your daily operations.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: