RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Implement Parallel Processing in Your Java Applications : Page 3

How do you as a Java developer adapt your applications to the multi-core and parallel computing trends? A new Java framework can help you build parallel applications quickly.

Introducing Pervasive DataRush
Pervasive DataRush is a Java framework with tools support that allows you to build parallel applications quickly. How? Think of a dataflow, in which data-processing operators (small units of business logic) are linked together through ports to form an application. Operators read data from input ports, perform their processing on that data, and write the results to output ports. DataRush links the operator ports with queues to allow the operators to work in parallel. Pipeline parallelism, as implemented in DataRush, is the ability to stitch independent operators together in an overall dataflow graph where each component is run in its own thread (or set of threads). Being pipelined, the overall graph keeps multiple cores busy as data flows through the queues between each operator. This pipelined nature also allows complex applications to be built from simple parts in a flexible way. Through this implementation, DataRush handles all the challenges listed in the previous section.

The dataflow-based architecture on which DataRush is built combines the three algorithms of horizontal partitioning, vertical partitioning, and pipeline parallelism. So DataRush supports both implicit parallelism (using the best algorithms for the problem being solved) and explicit parallelism (controling which aspects of a task can be made parallel).

Click to enlarge
Figure 4. DataRush Component Diagram
DataRush applications contain three main component types (see the component diagram in Figure 4):
  1. Process: this basic unit of dataflow programming in DataRush is a single-threaded scalar operator that implements some business logic. A process is analogous to a subtask as described in the previous pipeline example.
  2. Assembly: itself an operator, an assembly is a multi-threaded composite of other operators and customizers.
  3. Customizer: this configuration property or subcomponent dynamically controls or extends an assembly.

The relationship between these components in a DataRush application forms a tree where a node is an assembly and a leaf (end-node) is either a process or a customizer (see Figure 5).

Click to enlarge
Figure 5. DataRush Component Tree
Click to enlarge
Figure 6. Ports for DataRush Operators

Operators define ports for both data input and output, and communicate with one another through links between ports. For instance, the output port of Operator A can be linked to the input port of Operator B (see Figure 6). There are two types of ports. The first, a scalar port, is equivalent to a column of data where all the data is of one type. The second, a composite port, is an aggregate of scalar ports (subports), analogous to a row containing columns of data.

Click to enlarge
Figure 7. A Directed Acyclic Graph
For efficiency—and to maintain data ordering—DataRush creates one queue per output port, with only one writer and zero to many readers. An operator port can also be linked to an external data source or target, such as a relational database or a file system. The linking of two operators through their ports defines a simple dataflow. Dataflows get more complex as more operators are linked to one another. The resulting dataflow forms a directed acyclic graph (see Figure 7).

The proper use of ports helps DataRush to achieve further parallelism because it allows columns of data to be delivered independently, even if the data is retrieved from different sources. Contrast this with a database that returns entire rows of data, or an object that must populate all its member variables before you can use it. In these examples, your code must wait until all the data is retrieved, either as a row or an object, before it can process any of the data. DataRush, however, allows ports to deliver their data in parallel to other ports so that your code can process data as it becomes available, without waiting or blocking.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date