Introducing Pervasive DataRush
Pervasive DataRush is a Java framework with tools support that allows you to build parallel applications quickly. How? Think of a dataflow, in which data-processing operators (small units of business logic) are linked together through ports to form an application. Operators read data from input ports, perform their processing on that data, and write the results to output ports. DataRush links the operator ports with queues to allow the operators to work in parallel. Pipeline parallelism, as implemented in DataRush, is the ability to stitch independent operators together in an overall dataflow graph where each component is run in its own thread (or set of threads). Being pipelined, the overall graph keeps multiple cores busy as data flows through the queues between each operator. This pipelined nature also allows complex applications to be built from simple parts in a flexible way. Through this implementation, DataRush handles all the challenges listed in the previous section.
The dataflow-based architecture on which DataRush is built combines the three algorithms of horizontal partitioning, vertical partitioning, and pipeline parallelism. So DataRush supports both implicit parallelism (using the best algorithms for the problem being solved) and explicit parallelism (controling which aspects of a task can be made parallel).
DataRush applications contain three main component types (see the component diagram in Figure 4
- Process: this basic unit of dataflow programming in DataRush is a single-threaded scalar operator that implements some business logic. A process is analogous to a subtask as described in the previous pipeline example.
- Assembly: itself an operator, an assembly is a multi-threaded composite of other operators and customizers.
- Customizer: this configuration property or subcomponent dynamically controls or extends an assembly.
The relationship between these components in a DataRush application forms a tree where a node is an assembly and a leaf (end-node) is either a process or a customizer (see Figure 5).
Operators define ports for both data input and output, and communicate with one another through links between ports. For instance, the output port of Operator A can be linked to the input port of Operator B (see Figure 6). There are two types of ports. The first, a scalar port, is equivalent to a column of data where all the data is of one type. The second, a composite port, is an aggregate of scalar ports (subports), analogous to a row containing columns of data.
For efficiencyand to maintain data orderingDataRush creates one queue per output port, with only one writer and zero to many readers. An operator port can also be linked to an external data source or target, such as a relational database or a file system. The linking of two operators through their ports defines a simple dataflow. Dataflows get more complex as more operators are linked to one another. The resulting dataflow forms a directed acyclic graph (see Figure 7
The proper use of ports helps DataRush to achieve further parallelism because it allows columns of data to be delivered independently, even if the data is retrieved from different sources. Contrast this with a database that returns entire rows of data, or an object that must populate all its member variables before you can use it. In these examples, your code must wait until all the data is retrieved, either as a row or an object, before it can process any of the data. DataRush, however, allows ports to deliver their data in parallel to other ports so that your code can process data as it becomes available, without waiting or blocking.