Pipeline Parallelism and Partitioning
A pipeline in computing terms is a series of related steps, where each step depends upon the others. In many cases, however, some of the steps in the pipeline can execute in parallel. The implementation of such an algorithm is called pipeline parallelism. With this implementation, even when one step in the pipeline depends on data from another, both can execute in parallel, at least partially, if the data is streamed between them while the first step is still generating it.
For instance, consider a process that reads data from a database, formats it, and then displays it. In a classic implementation, an application might sequentially submit a query to a database, format the results, and then make appropriate library calls to display the data. Unfortunately, the data can't be formatted until all it is returned from the database, and it can't be displayed until formatting is complete (see Figure 2). (The data dependency between subtasks is often called a dataflow, and a dataflow graph represents the interaction of all subtasks within a task.) As a result, the processor remains idle while system data I/O takes place, even on a single-processor system. This inefficiency is magnified on a multi-core system where multiple processor cores remain idle.
|Figure 2. Inefficient Sequential Task Processing|
However, as mentioned previously, pipeline parallelism enables the processing of later steps in a pipeline to occur before previous steps have completely finished. Applying this concept to the previous example, suppose pipeline Step A performs a query that returns many rows from a database, Step B requires this data to do some formatting, and Step C displays the results. Using parallelism, you can stream data to the formatting code as it is read from the database so that formatting can begin before all the data is received (see Figure 3). Further, the code that displays the formatted data can begin its work even before all the data is processed by streaming the formatted data to the display code as it is formatted.
This example ensures that all processors in a multi-core machine are used as much as possible. While one step in the pipeline is still working (or waiting for IO), other steps are actively processing the data received so far. As you increase the parallelism in the pipeline and the number of processor cores in the system, overall system performance and throughput increases.
|Figure 3. Pipeline Parallelism Breaks a Single Task into Multiple Work Units|
Combining pipeline parallelism with data partitioning allows data in multiple tables (and in columns within those rows) to be read or written to simultaneously. This combination further increases system efficiency and performance, and its benefits aren't limited to database processing. Data from file processing, web service calls, and calls to legacy systems can be parallel processed using the same patterns.
The performance gains from parallel computing aren't without their challenges, however. Designing and developing software to take advantage of pipeline parallelism and horizontal and vertical partitioning require new algorithms and tools to overcome those challenges. The remainder of the article examines some of the challenges and introduces a tool that addresses them, DataRush from Pervasive Software.
Parallel Processing Challenges
Parallel programming is a discipline that deals with the proper development of parallel-processing applications. Some existing common algorithms and design patterns take into account parallel overhead, task management, resource synchronization, system speedup, scalability, and overall efficiency. The challenges associated with them include:
- Choosing among the algorithms to implement (core load balancing, pipeline parallelism, vertical and horizontal partitioning, and so on)
- Balancing the factors involved in system efficiency
- Understanding the overhead of parallel computing
- Mastering multi-threaded programming, an advanced and error-prone development chore (For instance, you need to manually synchronize thread access to shared resources, while guarding against thread deadlocks.)
- Dynamically scaling the amount of parallelism employed on different target machines (based upon the number of cores available, and other resources)
- Dynamically managing streaming dataflow between tasks based on thread and processor workloads (This usually entails managing inter-task work queues.)
- Applying Amdahl's Law to estimate the amount of speedup from parallel processing, using design-time and runtime factors (such as algorithms used and the number of cores)
Additionally, you need to wrestle with these challenges for each parallel-programming problem you face. Pervasive DataRush is a framework that can help.