WL ontologies allow you to describe data and relationships between data items. Common examples of this include complex knowledge domains such as pharmacology, but you can use OWL ontologies with simple, straightforward data that most companies already have stored in a relational database package. When you add metadata to existing data, and then use that metadata to query the data collection, you get more value out of that data. Data is the most important asset of many organizations, so the use of standards-based technology to query data collections is becoming more and more attractive.
The primary goal of this article is to put together a demonstration of how you can use OWL to integrate two relational databases, and then perform queries against the aggregate collection to answer realistic questions that you could not answer without the addition of an OWL ontology.
Secondary goals include:
- Using free, portable, open-source software
- Making all the relevant files available so that others could reproduce the results (contained in the
owlrdbms.zip file—see its readme.txt file for instructions on how to use its files to create and load the sample data)
- Using address book data, which has relevance to everyone
- Doing it all with no specialized programming (I did write a short XSLT 1.0 stylesheet utility, which is included in the zip file, but managed to avoid any coding that required compilation and deployment)
For sample data to load into the
MySQL relational database package, I created two collections of address book data, one based on the fields used by the Eudora email program and another based on the Microsoft Outlook 2003 address book structure.
Despite the different field names used by the two packages (for example, businessState
for the work address part of an Outlook address book entry and workState for the corresponding field in Eudora), I wanted to issue a single query against the aggregate collection of data to answer each of the following questions:
- Who works in New York state?
- List any phone numbers (home, mobile, business, and so forth) for Alfred Adams.
- Find all information about Bobby Fischer at 2304 Eighth Lane, even if the other database lists him as Robert L. Fischer of 2304 8th Ln.
After the data was loaded into MySQL, the following steps made these queries possible:
- I used the D2RQ interface (release 0.4) to extract an RDF representation of the relational data.
- Using the open-source ontology editor SWOOP (release 2.3 beta 4), I automatically generated the mundane, repetitive parts of the ontology.
- Continuing with SWOOP, I did some pointing and clicking to identify relationships between fields in the two databases that enable the execution of the use case queries.
- I pulled the RDF/OWL syntax generated by SWOOP into its own file separate from the instance data to let me use it with future versions of the instance data as the address book databases get updated.
- Using the OWL reasoner Pellet (the most recent release is release
1.5.2, but that has problems with the third use case above, so I recommend 1.5.0 if you want to reproduce what is
shown here), I issued queries in the W3C standard SPARQL query language against the data/metadata combination, getting more complete answers than would have been possible just querying against one of the databases without the metadata that identifies the connections between the two.
I did all this under Windows. With the exception of Eudora and Outlook, which I used to create the templates for the sample data, all the software described here is available for Linux as well.
I didn't want to use real address book data, so with a little scripting I generated a good amount of some fake, yet realistic-looking address book data:
- In each of the two email programs, I created a new address book entry and filled out every field.
I saved the address book as a comma separated value (CSV) file and looked over the results.
- Using a Python script that included a list of the most popular given names, surnames, city names, and so forth in the U.S. (the zip file's readme.txt file lists the source of each list), I generated sample data with a structure that corresponded to the exported CSV file.
- I imported my outlook.csv file into Outlook to make sure everything ended up in the right place, and it did (Eudora offered no "import CSV" option).
- Because the SPARQL queries for the second and third use cases look for specific data among the randomly generated data, I added that data to the CSV files (the readme.txt included with the zip file has these lines and describes where to add them).
- I created an SQL script included with owlrdbms.zip to create the two MySQL databases (MySQL 5.0, under Windows) and loaded each CSV file into one of the databases.