ML data in standard database processing is not being used fully or correctly in business applications today. Current XML hierarchical database query processing is basically limited to single path linear hierarchical processing. This limitation could be the influence of relational join processing or might occur because the navigation of multiple hierarchical paths is too difficult and error prone to be practical with XML’s manual navigation. XQuery has the same limitation; it also requires XML’s manual navigation. Eliminating manual navigation from database XML processing removes these XML limitations, allowing navigationless structure processing to be unrestricted and performed automatically.
This article describes how business applications can take advantage of powerful new capabilities stemming from properly processed database XML data, using advanced hierarchical processing. It also delves into the underlying hierarchical structure principles and processing. While this level of hierarchical processing is not generally available today, it was very common three decades ago—before the advent of relational databases. However, hierarchical XML data is ubiquitous; therefore, it’s time to apply the full power of such semantically rich structures.
Full hierarchical processing in databases is based on principles that have been proven to produce correct hierarchical processing data results, and can be used to replace the lack of any W3C specification or best-practices reference for performing correct hierarchical processing.
Hierarchical Data Structure: Basics
XML data resides in a hierarchical structure that contains multiple pathways connected by named nodes as in Figure 1, where the nodes have been named alphabetically—”A” through “G.” These names are associated with a node type or definition. Don’t confuse the node type with the node data occurrence; they’re different, even though the difference in meaning is not significant most of the time. The term node is usually sufficient.
|Figure 1. Database Hierarchical Node Type Structure: Each node’s name denotes its type.|
Hierarchical structures require navigation through various node types to access the data they contain. For example, in Figure 1, accessing node D requires first navigating to node B from node A and then to node D. Note that the hierarchical structure represented in Figure 1 can be either a physical hierarchical structure like XML or a logical relational hierarchical structure modeled in SQL. Logical and physical hierarchical structures have exactly the same hierarchical processing principles. Just like physical structures, logical hierarchical structures usually require navigation through intervening nodes—in other words, to get to logical node D, you’d have to navigate through logical nodes A and then B. This is necessary to preserve the semantics of the hierarchical structure. Accessing node B directly could introduce B data values that should have been filtered out by first going through node A. Physical structures are usually inherently protected from this type of invalid hierarchical processing operation.
Figure 1 serves to introduce a little more hierarchical data processing terminology. Nodes C and D are children or siblings of node B, their parent node. All nodes under a particular node such as B, E C, D, F, and G are descendents of node A, making node A their ancestor. Nodes A, B, and C make a path or pathway, indicated as A/B/C. Historically, in hierarchical databases, these hierarchical paths were known as legs. Sibling paths such as A/B/C and A/E/G must begin at a common originating node, node A in this case.
Hierarchical data structures have basic principles, and hierarchical data processing has operational principles. The basic hierarchical data structure principles are that a parent node can contain data without containing children, but no child node can exist without its parent node containing data. Operationally, this principle means that a given data pathway through the structure in Figure 1, such as A/B/C can terminate with node B when node C has no data—in other words, the data occurrence path terminates at the point where data stops. This is allowed because of hierarchical data preservation. In most other systems, this is not the default operation; for example, standard inner join relational processing would slice out the existing partial path data occurrences. Hierarchical data preservation allows the data occurrences along the same hierarchical pathways to vary in number of nodes reached.
Hierarchical Data Structure Processing: Basics
There are two basic types of hierarchical data processing structures: single node types and multiple node types. Both XML and hierarchical databases require multiple node types. Single node type hierarchical structures are fairly simple one-dimensional structures. They are used, for example, in organizational charts where each node represents a person. Usually, these node types can only contain a single data occurrence. Additional data occurrences are handled in single node type structures by creating another of the same node type to contain the child data.
|Figure 2. Hierarchical Structure with Data: This node type structure uses multiple node types suitable for both hierarchical databases and XML.|
Figure 2 shows a multiple node type data structure with its data that’s based on the multiple hierarchical node type structure shown above in Figure 1, but uses multiple node types to define the more complex hierarchical structure required for databases. Each node type can also contain the multiple data occurrences required for complex hierarchical structures in XML. For example, node type B has child data occurrences of C1, C2, C3, C4, D1, D2, D3, and D4. The B1 data occurrence of node type B has C1, C2, D1, and D2 child data occurrences. If the B1 data occurrence is removed, its children data occurrences will also be removed (a cascading delete), along with their related data occurrence descendents. However, node A1’s parent data occurrence is preserved regardless of whether it has child data occurrences.
Node data occurrences D1 and D2 are known as twins. Node data occurrences D3 and D4 are a different set of twins (they are related by a different parent data occurrence). There is no implied order or meaning across sibling twin data occurrences such as C1, C2 and D1, D2. This means that C1, D1 has no implied relationship over C1, D2. These twin relationships are very important in multi-path hierarchical processing. Sibling paths are independent, but their processing still needs to be coordinated.
Hierarchical Database Processing: External Operations
Filtering operations affect the entire hierarchical structure in a hierarchical manner. Filtering multi-path hierarchical queries is a much more complex and powerful operation than filtering single path queries. Standard linear hierarchical queries are limited because they filter data occurrences only upwardly and downwardly. In Figure 3, filtering directly on node E can affect F and G descendent nodes going down the structure and the A ancestor node going up the structure. In this multi-path hierarchical structure, there are other sibling paths connected by node A (A/B/C and A/B/D). When a data occurrence of node A is filtered out, the basic natural hierarchical data preservation rules of a hierarchical structure cause the underlying descendent data occurrences of nodes B, C, and D to be filtered out also. This highlights an important and powerful fact: Every node type in the hierarchical structure is related to every other node type in the structure, which gives full hierarchical processing its unlimited processing power.
|Figure 3. Hierarchical Data Filtering Flow: In a hierarchical database, filtering on node E can affect both descendent and ancestor nodes.|
The following examples use SQL to demonstrate performing hierarchical processing on hierarchical structures. SQL’s relational structures map naturally to hierarchical structures. Relational tables hierarchically linked or joined together form a natural logical hierarchical structure: tables represent the node types and rows represent the data occurrences. SQL’s generic SELECT, FROM, and WHERE commands operate on hierarchical structures naturally. The SELECT operation selects XML and relational data items from a hierarchical structure; the FROM operation models and specifies the modeled hierarchical structure view in SQL; and the WHERE operation specifies hierarchical data filtering.
ANSI SQL supports full hierarchically processing inherently and transparently by using the hierarchically oriented LEFT OUTER JOIN operation, which preserves data on the left side and not the right side. The code below defines the SQL multi-path hierarchical structure view from Figure 1 and is used in all the SQL examples:
CREATE View StructureView AS SELECT A FROM LEFT JOIN B ON A.PKey=B.FKey LEFT JOIN C ON B.PKey=C.FKey LEFT JOIN D ON B.PKey=D.FKey LEFT JOIN E ON A.PKey=E.FKey LEFT JOIN F ON E.PKey=F.FKey LEFT JOIN G ON E.PKey=G.FKey
SQL also supports the required variable length pathway data occurrences required for hierarchical processing. Trailing partial missing pathways are padded with SQL null values. The SELECT operation transfers selected nodes to the output structure, which preserves the basic structure but slices out the unselected nodes. All these hierarchical operations are reflected naturally in the relational rowset result—demonstrating that the SQL-to-hierarchical mapping is seamless, correct, and intuitive.
Hierarchical data processing operations process all requested pathways even if they have been terminated prematurely because of the hierarchical data preservation operation described previously. A data-filtering operation on a node data occurrence removes the node data occurrence and all its attached descendent node data occurrences. In many ways, it is easier to consider the WHERE clause as a data qualifier because WHERE E.e=’E1′ as used above in Figure 3, is actually qualifying the E1 data occurrence and filtering out all other unrelated data occurrence values.
|Author’s Note: This article shows the output of the SQL examples as hierarchical structure results rather than structured XML to make it easier to visualize the hierarchical operations taking effect.|
|Figure 4. Sample Node Structure: This structure results from SQL WHERE filtering using the query “SELECT ALL FROM StructureView WHERE D.d=’D3′ AND G.g=’G2′.”|
SELECT ALL FROM StructureView WHERE D.d=’D3’ AND G.g=’G2’
Compare the result in Figure 4 with the full data view from Figure 2. The query outputs both sides of the structure because it uses the SELECT ALL command. Similarly, both sides were filtered because of the WHERE condition. Node data occurrences B1 and E2, and their descendent occurrences were not output because neither of them met the WHERE clause conditions. Nodes D3 and G2 were qualified directly in the WHERE clause so their related data occurrences C3, C4, and F1, F2 qualified for inclusion as the hierarchical data filtering flow in Figure 3 above dictates. D4 and G1 were not qualified because their data occurrences could never be qualified together with the AND condition. The result is semantically correct.
Here's a different query that excludes specific nodes from the StructureView structure in Figure 2:
SELECT A.a, C.c, D.d, G.g FROM StructureView WHERE F.f='F2'
The output (see Figure 5) excludes all fields in nodes B and E because they aren't in the SELECT list. It slices those node types from the output structure while preserving their lower-level node types and any of their node data descendents (C, D, and G).
|Figure 5. Excluded Node Example: Excluding all fields in nodes B and C with the query "SELECT A.a, C.c, D.d, G.g FROM StructureView WHERE F.f='F2'" causes node promotion and node collection.|
In the output, the preserved node data descendents are attached to the excluded node type's parent node type (see Figure 5), in a process called node promotion. Notice in this example that node A gains an additional sibling pathway. This operation is known as node collection. These hierarchical operations are standard for hierarchical processing and also occur naturally in hierarchical SQL queries.
Hierarchical Database Processing: Internal Operations
XML documents are hierarchically structured, composed of multiple pathways that contain a goldmine of hierarchical semantic information between the structure's pathways. Supporting multi-path queries with their additional semantics provides many advantages, and enables many advanced capabilities. Processing between multiple hierarchical pathways requires a complex coordination logic known as Lowest Common Ancestor (LCA) processing. This enables the correct semantics to be associated with multiple paths of hierarchical queries to produce meaningful hierarchical results. Here's an example:
SELECT ALL FROM StructureView WHERE F.f='F3' AND G.g='G4'
|Figure 6. Increased Data Value Example: The figure shows the structure that results from running the query "SELECT ALL FROM StructureView WHERE F.f='F3' AND G.g='G4'."|
Running the preceding query produces the structure shown in Figure 6.
A query such as this requires processing across paths. When WHERE clause conditions occur on two separate paths, the Lowest Common Ancestor node controls the coordination. Earlier, this article mentioned that sibling segments were independent. In practice, this means that the multiple data occurrences of nodes F and G are not tested together in any specific order as described earlier for twin node occurrences. Instead, they are tested in all combinations under the LCA node (the E node in this case). This limits the range combination to produce a meaningful result.
A second use of LCA logic is when the SELECT and WHERE clause references cross paths, such as when a data field on one path is selected based on data in another path. This is the case in Figure 6, where the LCA sits at the top of all the SELECT list items and the WHERE clause condition—node A in Figure 6. When the LCA data occurrence qualifies, all data occurrences under it qualify. This again follows the hierarchical data filtering flow specified in Figure 3. SQL automatically carries out LCA processing in its relational Cartesian product processing.
XML Markup Vs XML Database Data
Coders and users navigate XML hierarchically today, but that's not always necessary for hierarchical database structures. In the past, database hierarchical structures could be queried without the user specifying any navigation, because every defined node type in the structure was unique in the structure (as we have seen so far) and had only a single path to it as in Figure 6. Those two qualities make querying the structure unambiguous. In such cases, XML navigation can be performed internally automatically.
|Figure 7. Markup Hierarchical Node Structure: In markup-type hierarchical node structures such as XML, you can locate any specific node through navigation.|
In the markup hierarchical node structure shown in Figure 7, the D node type occurs in multiple locations. In other words, locating a specific D node requires navigation, such as the XPath query A/B/D. Alternatively, you could use the XPath //D search operation to locate the closest existing data occurrence of a D node—which could be either A/B/D or A/E/D. Such queries allow fuzzy operations such as finding the next closest matching keyword. Fuzzy operations are OK for markup uses, but are not OK for database operations, which need exact precise processing semantics. Imagine summing the "price" for books and inadvertently adding in the price for magazines when there is no data occurrence of "price" for a book. Database hierarchical processing should not have duplicate node types in a structure and should not have to use XPath search operations in database processing (except when it's isolated in separate search functions). User navigation is not necessary for database processing.
XML's use in databases was an afterthought; XML was designed to process markup data, not to store critical database data. Markup data and database data have very different uses and should not be processed identically. Markup data is more unpredictable and hierarchically forgiving than database data should be because the same node data types can occur in many locations in the hierarchical structure. Markup tag names are used as the node names they represent in their XML hierarchical structure to classify the text (i.e. which person said what) or to specify processing directives to be applied to the text (i.e. bold, underline). For example, in Figure 7, the D node type occurs in multiple locations, which makes the structure ambiguous for database querying and requires navigation to locate a specific D node type. You can temporarily correct this situation for database use by renaming nodes when necessary. SQL supports this.
The ideal solution for correct database XML hierarchical processing is separating database data processing from markup data processing. As you've seen, each has its own particular logic and processing rules. Separating the two achieves correct database results and frees the database to take full advantage of hierarchical data's unique capabilities, such as navigationless hierarchical processing.
Navigationless Hierarchical Processing
The hierarchical database's unambiguous structure enables full nonprocedural navigationless processing, using natural hierarchical processing rules and principles. These hierarchical processing rules and principles can be applied automatically to the hierarchical structure being processed to totally eliminate hierarchical structure manual navigation. You've already seen this in action in the query examples in this article (see the SQL queries in Figure 4, Figure 5, and Figure 6). Navigationless processing enables many XML capabilities not available today that can take XML database processing into the next generation.
Transparent Multi-Path Query Processing
The complex LCA coordination processing required for multi-path processing is too complicated and error prone to be performed by procedural processing and manual navigational specification. Utilizing nonprocedural processing and navigationless access, transparent multi-path hierarchical query processing has the ability to automatically process the entire multiple pathway hierarchical structure (or whatever portion of it is required for nonprocedural navigationless queries). This automatic hierarchical navigationless processing can handle both single and multi-path queries. Users do not need to be concerned with whether query processing requires single or multiple pathways. For example, the SQL query that results in Figure 6 has four different paths: A/B/C, A/B/D, A/E/F, and A/E/G.
Enables XML Query by Non-Technical Users
Currently, only trained XML knowledgeable users can specify queries that access native XML. This is because they require manual navigation and are limited to single path linear pathways. Nonprocedural navigationless operation means that the query user does not require XML training or knowledge of the hierarchical structures to specify the query. Users are not limited to specifying data from only a single path of the structure. Query access and processing limitations disappear.
Supports Unlimited Internal Complexity
The more paths of the hierarchical structure that a query accesses, the more complex the processing logic will be. For current manual navigation and processing of multiple paths, the required processing logic is the responsibility of the query user. Correct hierarchical logic for handling multiple paths must be specified correctly by the user. This complexity typically limits manual navigation of queries to single paths of the structure. In contrast, nonprocedural navigationless queries automatically and transparently perform the navigating and hierarchical processing—regardless of the required complexity.
Correct Hierarchical Results
With manual navigation, query correctness also relies on the user. With nonprocedural navigationless processing, the results of the hierarchical query remain hierarchically correct regardless of the number of different query pathways referenced or the internal processing complexity required. Natural hierarchical rules and principles of hierarchical processing are applied internally, consistently, and automatically. This capability is missing in XML business processing, which requires precise results. It's worth noting that queries such as the one in Figure 6 also increase accuracy, because they are processed automatically.
Dynamically Increases Database's Data Value
The capabilities described in this article mentioned that multi-path queries automatically use the additional semantics that exist between the different legs of the structure being accessed. This dynamically increases the value of the data being processed. For example, any query that accesses data in one path of the structure based on data in another path of the structure takes advantage of the additional semantics that exist between the two pathways to resolve the query. In Figure 2, a condition on node data occurrence E2 qualifies a path on node data occurrence B1. That dynamically establishes a more complex meaning to the result that would not be possible without multi-path processing. You can see the data value increase in Figure 6, which involves very complex multi-path LCA processing.
Increases Number of Different Queries Possible
The same multi-path processing capabilities that increase data value as described above also means that the number of new meaningful queries possible for a given hierarchical structure becomes practically unlimited, because there are so many different combinations of paths possible. Each of these different multi-path queries dynamically increases the value of the data used in the query. As an example of this, the wildly varying SQL query examples in Figure 4, Figure 5, and Figure 6 demonstrate the unlimited number of queries possible for the same structure or view.
Powerful Hierarchical Optimization
The next three navigationless hierarchical processing capabilities require powerful optimization techniques—which itself requires the flexibility of navigationless processing.
A nonprocedural navigationless interface can easily analyze an entire query for optimization. For example, queries requiring localized query access to different portions of the full structure can use a SAX interface instead of a DOM interface. (A DOM interface usually reads the entire structure into memory before processing while a SAX interface is instructed to access only certain node types.) Additionally, optimization can use the knowledge of the entire structure and a knowledge of exactly what is required to create an intelligent access strategy that automatically determines when to save input data for possible reuse and when to reuse data space.
Global Hierarchical Views and Reuse
Global views that describe entire structures have previously been discouraged because of the overhead involved. But query flexibility combined with hierarchical optimization and navigationless processing means that global views can have optimum reuse. That allows users to define an unlimited number of possible queries with no overhead. These users do not have to be aware of the structure in the view or concerned with keeping track of specific smaller views; they need only specify the data they want returned, which can reside anywhere in the global structure view. The hierarchical optimization discussed earlier in this article eliminates the need to access unneeded pathways, which also eliminates any global view overhead. As an example of the multitude of different queries possible, the SQL queries in Figure 4, Figure 5, and Figure 6 differ in data accessed, filtering applied, internal processing, and the resulting output structure—all accomplished using the same global view, with no overhead.
Global Hierarchical Queries
Today, programmers rarely attempt to apply query filtering to an entire XML document except in critical applications. But with nonprocedural navigational processing, filtering does not present a problem. The SQL example accompanying Figure 4 performs a complex filtering on the entire hierarchical structure and then outputs the entire structure. The global optimization possible which was described previously can optimize the query for advanced dynamic memory management which is very important to the efficient processing of this global query. This global query is performed easily in SQL using the SELECT ALL operation as shown in the SQL examples in Figure 4 and Figure 6.
Focused Retrieval with Result Aggregation
Information Retrieval (IR) operational needs are currently missing the capability to correctly identify meaningful XML documents and then to return only the desired data in a meaningful way. The hierarchical data filtering capability enables the location of only meaningful documents, and selects only the desired aggregated results. And finally, the global view, with its powerful hierarchical processing, allows any query to be specified. These capabilities are demonstrated in Figure 5, which locates documents qualified by the filtering condition WHERE F.f ='F2', and then aggregates and structures only the specified data output, based on the semantics of the hierarchical structure processed.
This article argued that structured XML processing in databases today is lacking, because it requires different processing than the default markup processing currently used. Such capabilities already exist, and were thoroughly vetted over three decades ago in hierarchical databases. Using a combination of nonprocedural navigationless processing, dynamic data modeling, dynamic XML-formatted output, and unlimited query capability, the door is open for unlimited new hierarchical processing capabilities that take full advantage of the capabilities inherent in hierarchical structures.