ith the processing power of single-core chips running out of steam, hardware manufacturers have turned to multicore processors to make up for the lack of processing power increases in single-core chips. Unfortunately, developing multicore-capable applications is not a transparent or plug-and-play solution. Such applications require developers to use parallel programming techniques—which are not simple and require program design changes. On the plus side, increased funding for parallel processing research may lead to improved parallel processing design, programming, and testing techniques. This article explores an alternative approach that can automatically perform efficient and effective parallel processing of hierarchical data structures.
Before the advent of relational technology, hierarchical structures and their processing were used heavily in applications and databases. However, at that time, hierarchical structures were rigid, so they lost ground to the more flexible relational technology. But because of XML’s ubiquity and flexibility, hierarchical structures are making a comeback. Hierarchical structures are both well-behaved and offer advanced processing capabilities. They also offer one unexplored capability examined here: Hierarchical structures and processing are fully and naturally parallelizable.
Because they’re fully and naturally parallelizable, hierarchical structures can be processed in parallel automatically, which in turn enables automatic, efficient, and optimized use of multicore processors at a level not achieved before. This article shows how to implement a complete and automatic parallel processing solution for hierarchical query processing. Making the process automatic is critical, because hardware manufacturers aren’t likely to stop at two, four, or eight cores in chips; future chips will have 64 or even 128 cores. The only way to use all these cores effectively is to make the process automatic.
|If you’re not familiar with the basics of querying hierarchical structures, I suggest you read these two earlier DevX articles, Navigationless Database XML: Hierarchical Data Processing, and Creating Hierarchical Data Structure Mashups.|
Automatically Parallelizable Hierarchical Query Processing
Notice the naturally parallelizable nature of hierarchical data structures in Figure 1. In this example, the nodes are named A through G, and their data occurrences are numbered. The nodes going down represent pathways. Also note that the pathways naturally spread out, creating more and more pathways—and that those pathways are independent. In other words, each pathway can be processed independently and at its own speed. In this example, each node in every path has two separate data occurrences, creating multiple data occurrences of each pathway.
|Figure 2. Parallel Processing From Start to Finish: The dashed lines in the figure show how the pathways through a hierarchical structure are naturally parellizable from data access through all processing stages.|
|Figure 1. Hierarchical Data Structure: The figure shows that hierarchical pathways are independent, and thus parallelizable.|
These same hierarchical structure pathways are used from data input, through processing, through data memory, to output, as shown in Figure 2.
Each pathway can contain multiple data occurrences as shown—which are also naturally parallelizable giving an extra boost to parallel processing’s advantages.
With the hierarchical data structure known at processing time, the parallel processing strategy can be determined automatically with no user intervention. In addition, this parallel processing strategy can be adjusted dynamically during execution as resources, inefficiencies, and better opportunities are determined. The automatic operation knows the exact parallel processing status and the status of the multiple cores at all times.
Because of the many separate hierarchical pathways, each with a possibility of a significant number of data occurrences, a large number of cores can be used independently and simultaneously, allowing maximum multicore utilization and efficiency. In addition, such automatic parallel processing is both provable and accurate because of hierarchical structures well known behavior and characteristics.
SQL Hierarchical Parallel Processing—A Good Target
Because SQL can automatically process hierarchical structures, it can naturally support a high level of parallel processing, controlled by a parallel processor controller that automatically assigns tasks to multiple cores. Is it worth the effort to build such a controller for SQL hierarchical processing?
SQL has a huge audience. It is by far the most popular database query language. So if SQL can be made to exploit multicore parallelism automatically, by letting developers use a familiar language, then such an application should be considered further.
Hierarchical structures are useful because they organize data in a way that increases in value geometrically as additional nodes are added. Most applications can be written to use hierarchical structures. In addition, the natural creation of hierarchical structures automatically acts as the parallel processing design phase for multicore applications.
Hierarchical XML integration with SQL hierarchical processing is transparent and allows nontechnical users to build powerful queries. Hierarchical optimization enables the use of global views so users don’t need to be concerned with the structure of the data being processed, and also adds significant efficiency and reusability—which should also attract more users.
The combination of hierarchical and parallel processing can also support full multipath hierarchical processing. To obtain even greater benefits, previous SQL linear limited operations such as aggregation, ORDER BY and GROUP BY can be extended to take advantage by operating hierarchically utilizing the multiple pathways: Each path can be carrying out its own different aggregation, ordering or grouping operation in the multipath hierarchical structure, operating independently on their own pathway.
Parallel Processing Review
Current high-level programs capable of automatic parallel processing can only identify a well-known group of parallelizable processes involving sets of data. The remaining program logic is difficult to identify automatically, so it requires a parallel programming and design stage or it is left unparallelized.
There are three basic methods to parallelize a program and make it perform faster: identifying independent tasks, partitioning data, and pipelining. If your application involves large sequential tasks, they can be broken up into smaller processes tasks that can be run in parallel. If your program is data oriented, the data can be organized into separate horizontal or vertical data chunks so the different data ranges can be processed in parallel. These approaches speed up the execution of your application. Pipelining overlaps multiple executions of the entire application—not to achieve faster execution—but to greatly increase the throughput. But, applications that loop back on themselves can use this same technique of pipelining (overlapping), gaining throughput because the execution time is reduced. All three methods can be used on the hierarchical processing application described in this paper.
The more independent tasks that can be run in parallel, the greater is the benefit. However, simply assigning more cores to a given task results in less and less return for each additional core (Amdahl’s law). For example, suppose you have two different sort processes that can be run at the same time. Both have an equivalent amount of data to sort, so the two sorts are well balanced. You could perform the sort tasks in parallel by partitioning the data. Assume that you have four available cores. The application of Amdahl’s law means that allocating three cores to the first sort and only one to the second sort is not as efficient as assigning two cores to each sort.
Spreading cores over independent tasks fits extremely well into hierarchical structure processing. The required tasks can be determined automatically based on the user’s query, and assigned to many processors, each working independently because of the independent pathways.
SQL Hierarchical Engine for Increased Parallel Processing
SQL databases can automatically execute multiple instances of the query processor in parallel in large servers. For personal computers SQL processors can automatically execute certain set operations automatically in parallel as already described. To achieve full and complete parallel processing, another improvement is necessary. When SQL is performing hierarchically, the relational engine can be replaced transparently with a true hierarchical engine. The replacement is transparent because developers still use ANSI SQL to write queries, and the results are the same. However, hierarchical engines can be faster because they eliminate data duplication overhead and most importantly they can perform the different hierarchical pathways using parallel processing much more efficiently and thoroughly.
Figure 2 demonstrates how the structure being processed was retrieved and modeled in the Data Storage Access step, and stored in memory in the Data Memory step that models the data structure.
The Parallel Processing step then processes the data in memory, retrieving and storing the data as needed. Maximum parallel processing can be achieved because the pathways of the modeled structure are independent and can be processed concurrently. The data can even be retrieved from storage in parallel.
SQL Query Input Preprocessing
This section will cover the preprocessor portion of the SQL Driven Hierarchical Engine and the following section will cover the query processing of the SQL Driven Hierarchical Engine. Basically, the preprocessor creates separate processing tasks that can be performed in parallel, while the query processing in the next section dynamically determines the best way to perform the tasks in parallel based on the currently available resources.
|Figure 3. SQL Query Preprocessing: In this figure, the dashed boxes represent operational data, the solid boxes represent software processes and the arrows represent data movement.|
Figure 3 shows the SQL Query preprocessor. The dashed boxes represent operational data, the solid boxes represent software processes and the arrows represent data movement. The software processes take in the operational data and produce new operational data. The process involves generating parallel processing tasks and the runtime control blocks.
Walking through the SQL Query Preprocessing in Figure 3, the SQL Query statement is parsed. This not only produces the required processing in internal form for easy access, but also externalizes the structure of the hierarchical data structure being processed. These are both used together to build tasks that can be executed in parallel to perform the query. The information necessary to know how these tasks can be run together to process the query and how to fully utilize their parallel processing capability is stored in the generated Control Blocks produced by the Control Block Builders. These consist of the data Input, data Memory, and Processing Control Blocks Builders and their produced Control Blocks. How these dynamically produced elements are brought together and controlled is covered in the following Query Driver section.
The SQL Query Preprocessing performed in Figure 3 above can be easily performed using parallel processing. Internalizing the SQL request and determining the data structure being processed can be performed concurrently. The determination of the runtime data structure also naturally performs access optimization by not recording the unnecessary pathways in its control block as if they never existed. Going down further on the parallel paths, the separate Control Block Builders can be performed concurrently. And on the other parallel side, the parallel tasks builder is reentrant and can be invoked in parallel to build multiple tasks concurrently. These tasks consist of pseudo code that allows their operation to be adapted to the current processing environment. The automatic mechanism for dynamically controlling the concurrency level of the parallel processing described in the next section may possibly also be used for the preprocessor’s concurrency processing.
There are three levels of concurrency for hierarchical processing. The first level, used only for very large queries is to distribute pieces of the query to other processors or servers. The hierarchical processing makes this easier to perform because breaking the query that is operating on a hierarchically data modeled structure apart and later reassembling it is easy because it is hierarchically structured. The SQL query statement pieces that are distributed will themselves automatically operate hierarchically at their new location. The second level is the task parallelization along the different pathways and data partitioning. The third level is the parallelization along a pathway to handle multiple data occurrences concurrently. The first and third levels are optional depending on the query and multicore resources.
The previous three levels of concurrency processing all use their own core processor(s), but another technique that will be utilized is overlapping I/O and read ahead that will be used that have a tremendous payback without using an additional dedicated core processor. This technique is used to reduce very expensive wait states on a core processors by having them overlap I/O wait states by starting up an I/O call and then perform processing while the much slower I/O call is being processed. This can actually be considered a concurrent process, because the I/O call is actually performing peripheral device processing on the I/O device.
In memory, the hierarchical data storage has many parallel processing advantages. First, no duplicate data usage is required as is the case with relational processing engines. Figure 1 also shows how the structured data memory is hierarchically laid out. It demonstrates how hierarchical data memory naturally supports the independent pathways used by their specific tasks without conflicting with other tasks shown in Figure 2. In this way, different concurrent tasks can be updating the data memory at the same time. The complexity of the hierarchical structure should also help limit caching conflicts between different cores by keeping the different concurrent tasks area of memory at enough distance from each other.
Hierarchical Query Driver and Parallel Processing Controller
The input query is preprocessed by generating the query tasks and run time control blocks necessary to perform the query described in the previous section. The preprocessed query can be saved and executed later using the SQL SELECT list to control the operation. This section covers the software design needed to execute the preprocessed query in parallel.
|Figure 4. Query Parallel Processor: At the heart of the processor, the Parallel Processor Controller controls dynamic parallelization.|
Figure 4 depicts the design of the Query Parallel Processor. It controls the input data access, processing tasks, memory use, and query result output—all of which can be run in parallel and controlled dynamically using a Parallel Processor Controller (PP CNTL in Figure 4).
The Query Driver coordinates and drives the entire query processing in Figure 4. It knows what must be done from the information supplied in the Query Processing Control Block. It controls the Input Controller, Output Processor, Memory Manager, and the Task Driver. The Input Controller controls the data access routines. The Output Processor outputs the query result data in structured XML. The Memory Manager manages the hierarchical memory linkages and optimizes memory usage. The Task Driver controls the Query Tasks and their concurrent execution by the Task Interpreter, which processes the pseudo code. The Parallel Process Controller controls dynamic parallelization.
The Input Controller and the XML and relational data access routines in Figure 4 access the required data efficiently when needed, retrieving only data along the pathways actually being used (as discussed earlier). The XML access routine avoids the standard DOM and SAX parsers and uses its own access routine, optimized especially for database hierarchical access used by SQL. Multipath data access can be performed in parallel. Other possible data sources can be flat files hierarchically modeled and other physical hierarchical sources such as structured VSAM and IBM’s still popular IMS.
The Memory Manager also manages the hierarchical data linkages as shown in Figure 1, and optimizes memory use by reusing memory as soon as possible. The hierarchical data structure consists of multiple data occurrences at each node. Sometimes all of a node’s data occurrences’ need to be all in memory; at other times only the current node occurrence needs to be in memory and then its memory space can be reused to hold the next data occurrence. If data will be reused multiple times, it is kept until no longer needed, which can help with storage efficiently for large complex queries. The basic idea behind this operation is to be memory efficient, but to never have to retrieve the same data more than once. Proper data management can save a great deal of memory.
The Output Processor formats and outputs the result of the query result in structured XML. This is possible because the processor is aware of the structure being processed, so it knows the output structure as well. Unlike the other tasks shown in Figure 4, normally Output Processing must wait until the query completes. However, Output Processing can be overlapped in a pipeline fashion controlled at each root node data occurrence. This achieves concurrent parallel processing. Basically, the Output Processor can output XML for each data occurrence of the root node overlapping query processing with the output procedure. This works except when a specific ordering at the root level is required, and the data is not already ordered or indexed in order. Overlapping the output is particularly useful when the output is being displayed interactively by reducing the display time lag. This also allows the first screen of data to be presented quickly, without having to wait for the complete query to complete.
The Query Driver in Figure 4 controls the execution order of the query tasks to ensure a correct result. It also controls the level of parallelization by running some tasks concurrently as determined by the Parallel Processor (PP) Controller. The Query Driver has a thread assigned for system tasks that it may loan out for small non-system tasks.
The Parallel Processor Controller in Figure 4 does more than determine which tasks can be run concurrently. It also performs all the required concurrency operations. Other Parallel Processing considerations not already covered consist of:
- Keeping track of available resources and deciding where to allocate scarce resources
- Allocating and assigning threads in pools to avoid the overhead of closing and launching threads
- Dynamic load balancing performed by thread stealing, which also avoids the overhead of closing and launching threads
- Longer paths are given priority, so the processor can perform more work for same amount of overhead
- Assigns threads first to tasks that do not already have a thread (Amdahl’s Law of diminishing returns)
- Limits threads to a given task to three (also because of Amdahl’s Law)
- Waits on events, not by time period, which results in more efficient use of resourses
- Resource synchronization at active node fork points, which are hierarchical processing coordination points
- Considering aggregation operations for partitioning (multiple aggregations are already parallelizable when on different paths, and each can still be partitioned for parallel processing if worthwhile)
- Still operates efficiently on single core systems by taking advantage of read ahead and I/O overlapping
Many applications can be written easily as hierarchical applications to take advantages of their extra capabilities and processing advantages. Parallelizing applications is both difficult and requires extra design and testing. Automatic generation of parallel processing applications relieves programmers of this difficult and error-prone task. Developing an advanced and dynamic parallel processing controller is not justifiable for a single parallel processing application, but is worth the effort when amortized across many different applications or unlimited new queries. Such a processor is generic, allows experimental tuning, and holds out the possibility of future technology improvements as they occur.
Hierarchical multi-core query processing opens the door to extremely powerful and internally complex queries that can be specified without prior knowledge of the queried structure. With the ability to easily specify more complex queries, and utilize more cores, even more powerful parallel processor controllers will arise that can fully and efficiently use the ever-increasing available power of multiple cores.