Operators are components (written in Java or XML) that implement business logic and process data. Assemblies, which define and link operators, are defined with Dataflow XML (DFXML), also called an assembly specification. The combination of assembly definitions, properties, constraints, and operators linked through ports forms the basis of a DataRush parallel-processing application. The DFXML includes sections with XML tags such as <Assembly>, <Operator>, <Process>, <Property>, and others (see Listing 1
To create a complete DataRush application, you compile the appropriate operator Java files and assemble the DFXML files with the DataRush Assembler (dfa). The application is launched with the DataRush Engine (dfe), which takes the dataflow binaries and Java class files and begins the data processing execution. Figure 8 summarizes the entire development, build, and execution process.
|Figure 8. The Development Cycle for a DataRush Application|
An assembly specification (a df.xml file) becomes an assembly when it's compiled with the DataRush Assembler tool, dfa. The artifact is a binary .df file (analogous to a Java .class file), which defines an assembly with operators, which can be executed. You can set properties that were defined in the assembly specification either with default values at compile time or through properties file (a text file with name-value pairs)at runtime. You specify a properties file to use with the optional dfe command line argument, -pf <filename>.
Just as properties allow you to adjust the runtime characteristics of an operator, DataRush customizers allow you to modify and refine an assembly, even after the dfa tool has compiled its specification. For instance, you can alter an application's behavior based on its environment, such as the number of fields or columns in the input data (vertical partitioning), dynamic Java typing based on port data type, and stricter typing based on output port type at runtime. The last item is similar to Java Generics, where an operator port type is defined to be generic, but at runtime a more specific type is declared and enforced.
A customizer class implements the com.pervasive.dataflow.dev.DataflowCustomizer interface, which contains a single method named customize. Since customizers are specific to the assembly they customize, they tend not to be reusable. The sample customizer class in Listing 2, SelectRowsCompositionCustomizer, is named after the assembly it customizes.
In this case, the input port specification is changed from "generic" to "scalar" or "composite" based upon the type of the output port to which it is linked. The DataFlow Assembler calls the customize method to modify the assembly specification before it's compiled.
If you want to migrate to DataRush from existing code, rewriting the entire application to conform to the DataRush specifications may not be feasible. For this reason, DataRush offers two ways to integrate it into existing applications:
- Define and build a separate DataRush application, and then invoke the DataRush Engine command line tool from an external application.
- Embed the actual DataRush APIs into an external application, and then write Java code to call portions of it to invoke assemblies directly.
Although the first method is less intrusive, and it removes DataRush dependencies from your Java code, it's also not as robust. For instance, the second method allows you to write all your code in Java, it gives you better control over dataflows and logic control flow, and it provides access to the reporting, logging, and error handling of the DataRush assembly execution.
You also can integrate DataRush applications with other enterprise systems such as messaging servers or enterprise service bus (ESB) applications. For instance, service components running in an ESB can offload the processing of large volumes of data to a DataRush application on a separate, multi-core server to achieve the parallel processing required. The DataRush application can then signal the ESB service (via a message) when complete, and provide it with the results. As a result, ESB message traffic is reduced (the data to be processed is not propagated across it), DataRush performs the parallel computation of the data for maximum efficiency, and the asynchronous nature of the ESB allows other work to continue in the system, thereby creating further parallelism.