The DataRush Libraries and Tool Support
DataRush (the libraries, tools, APIs, and the engine itself) is available as a set of JAR files. In fact, this is how DataRush supports application embedding. It also allows you to easily package and deploy DataRush applications to servers in your production environment. Supported environments include Windows, Solaris (x86 and SPARC), Linux, HP-UX, and IBM AIX. (Author Note: Although it's not officially supported, DataRush installed and ran perfectly well on my Intel-based Mac as well.)
DataRush includes a library of precompiled operators that you can use when creating your own operators and assemblies. Operators exist to perform data reads and writes on flat files, XML, and relational databases, along with generic logic processing. These operators serve as building blocks for you to reuse, and they reduce the need to implement these common tasks.
In terms of application development, DataRush comes with an Eclipse plug-in that works with the Eclipse IDE as well as the Eclipse Graphical Editing Framework (GFE). The end result is support for DataRush-specific projects with visual modeling and editing of parallel processing tasks (see Figure 9).
You can build, run, and test outside of Eclipse as well, as DataRush integrates with command-line tools such as Ant and unit-test frameworks such as JUnit. DataRush-specific Ant tasks to build and test custom DataRush applications are included in a JAR file. This allows you to automate all aspects of your application builds, including the execution of DataRush test suites.
A Sample DataRush Application
DataRush comes with multiple sample applications to get you started. This section discusses the "New Fields" sample application, as it contains both XML and Java operators. The application loads data from an input file (specified as a property) and reads three fields that make up a simulated sale: a date, a dollar amount, and a product ID. This portion of the processing is completely specified in the DFXML assemblies. The Java portion is an operator that computes the day of the week based upon a date as input. The output of this operator is combined with the sales data, along with a new record identifier generated at processing time, and then written to an output file.
Listing 3 contains the complete assembly definition DFXML file for this application. At the top of the file, properties are defined that control overall processing. Some of the important properties are:
- inputFileName: the name and path to read the incomplete sales data
- outputFileName: the name and path to write the completed sales data (with fields added)
- startRowID: a starting identifier for new rows written to the output file
- fieldSeparator: the delimiter character or string used as a separator in the input file
The next section in the assembly specification describes the individual assembly operators and one process for the Java object. Some of these are:
- read (operator): uses the ReadDelimitedText operator in the DataRush operator library to read the input file
- genRowID (operator): uses the GeneratedArithmeticSequence operator to generate unique output row identifiers
- dayOfWeek (process and operator): defines a custom operator, implemented as a Java class, that specifies a DATE as its input and an integer as its output, which is linked to the read operator's input
The remaining sections of the specification link operator output and input ports, thereby defining a complete application dataflow.
Build the Sample Application
Building the application is a two-step process, but you can combine both steps via an Ant script or an Eclipse project. First, from within the DataRush samples directory, compile the Java code with the following command:
> javac -d build/classes -classpath ../dfre/lib/dfreapi.jar
Next, run the DataRush Assembler on the assembly specification:
> dfa -d build/classes -sp src src/example/newfields/NewFieldsTextFile.df.xml
Although the specification is split across two .df.xml files, both are assembled into binaries because one references the other.
Run the Sample Application
After a successful application build, you can run the application with the DataRush Engine. You must include a properties file that specifies the path to the input and output files (included with the sample). This is done with the following command:
> dfe -cp build/classes -pf newfields.properties
When executed, the sample application will write its data to the output file, NewFieldsSampleOutput.txt (see Listing 4).
To experience firsthand the benefits of using DataRush and the pipeline parallelism it employs, I ran this sample on both a 1.83 GHz Core Duo processor (dual-core) system and a single core (but faster) 3GHz Pentium 4 system. The application completed in 2.011 seconds on the dual-core machine, while it took 5.1 seconds on the Pentium 4. I was impressed by how even a simple application gained much higher throughput on a multi-core machine.
Parallel Computing Comes to the Development Process
Just as multi-core computing has brought affordable parallel-processing computers to the fore, Pervasive DataRush has brought parallel computing to the Java community in a comprehensive, easy-to-use package. DataRush solves the complex problems associated with developing applications that utilize multi-CPU, symmetrical multi-processing systems, and it comes in a reusable form that you can leverage in all your applications without rewriting all your code.