Relational Database Integration with RDF/OWL

WL ontologies allow you to describe data and relationships between data items. Common examples of this include complex knowledge domains such as pharmacology, but you can use OWL ontologies with simple, straightforward data that most companies already have stored in a relational database package. When you add metadata to existing data, and then use that metadata to query the data collection, you get more value out of that data. Data is the most important asset of many organizations, so the use of standards-based technology to query data collections is becoming more and more attractive.

The primary goal of this article is to put together a demonstration of how you can use OWL to integrate two relational databases, and then perform queries against the aggregate collection to answer realistic questions that you could not answer without the addition of an OWL ontology.

Secondary goals include:

  • Using free, portable, open-source software
  • Making all the relevant files available so that others could reproduce the results (contained in the owlrdbms.zip file—see its readme.txt file for instructions on how to use its files to create and load the sample data)
  • Using address book data, which has relevance to everyone
  • Doing it all with no specialized programming (I did write a short XSLT 1.0 stylesheet utility, which is included in the zip file, but managed to avoid any coding that required compilation and deployment)

Use Cases
For sample data to load into the MySQL relational database package, I created two collections of address book data, one based on the fields used by the Eudora email program and another based on the Microsoft Outlook 2003 address book structure.

Despite the different field names used by the two packages (for example, businessState for the work address part of an Outlook address book entry and workState for the corresponding field in Eudora), I wanted to issue a single query against the aggregate collection of data to answer each of the following questions:

  • Who works in New York state?
  • List any phone numbers (home, mobile, business, and so forth) for Alfred Adams.
  • Find all information about Bobby Fischer at 2304 Eighth Lane, even if the other database lists him as Robert L. Fischer of 2304 8th Ln.

After the data was loaded into MySQL, the following steps made these queries possible:

  1. I used the D2RQ interface (release 0.4) to extract an RDF representation of the relational data.
  2. Using the open-source ontology editor SWOOP (release 2.3 beta 4), I automatically generated the mundane, repetitive parts of the ontology.
  3. Continuing with SWOOP, I did some pointing and clicking to identify relationships between fields in the two databases that enable the execution of the use case queries.
  4. I pulled the RDF/OWL syntax generated by SWOOP into its own file separate from the instance data to let me use it with future versions of the instance data as the address book databases get updated.
  5. Using the OWL reasoner Pellet (the most recent release is release 1.5.2, but that has problems with the third use case above, so I recommend 1.5.0 if you want to reproduce what is shown here), I issued queries in the W3C standard SPARQL query language against the data/metadata combination, getting more complete answers than would have been possible just querying against one of the databases without the metadata that identifies the connections between the two.

I did all this under Windows. With the exception of Eudora and Outlook, which I used to create the templates for the sample data, all the software described here is available for Linux as well.

Sample Data
I didn’t want to use real address book data, so with a little scripting I generated a good amount of some fake, yet realistic-looking address book data:

  1. In each of the two email programs, I created a new address book entry and filled out every field.
  2. I saved the address book as a comma separated value (CSV) file and looked over the results.

  3. Using a Python script that included a list of the most popular given names, surnames, city names, and so forth in the U.S. (the zip file’s readme.txt file lists the source of each list), I generated sample data with a structure that corresponded to the exported CSV file.
  4. I imported my outlook.csv file into Outlook to make sure everything ended up in the right place, and it did (Eudora offered no “import CSV” option).
  5. Because the SPARQL queries for the second and third use cases look for specific data among the randomly generated data, I added that data to the CSV files (the readme.txt included with the zip file has these lines and describes where to add them).
  6. I created an SQL script included with owlrdbms.zip to create the two MySQL databases (MySQL 5.0, under Windows) and loaded each CSV file into one of the databases.

Using the D2RQ Interface
D2RQ includes several useful utility programs. You send your SPARQL requests to the main one, an HTTP server called d2r-server, which passes along the corresponding SQL statements to MySQL, Oracle, PostgreSQL, or another relational database manager. D2RQ needs a mapping file to convert your SPARQL request into the appropriate SQL query, and another D2RQ utility (generate-mapping) generates this mapping file for you in n3 format. You can use the generated mapping file as-is, but as you will see, various tweaks can improve it.

The following two commands tell generate-mapping to log in to MySQL, look at the data dictionary of the indicated databases, and generate the mapping files that let you issue SPARQL queries against the Eudora and Outlook databases (substitute your own MySQL username and password; also note that a carriage return was added after -p in each command for readability here, but each generate-mapping command is one line):

generate-mapping -o eudoraMapping.ttl -u myID -p myPassword jdbc:mysql://localhost/eudora generate-mapping -o outlookMapping.ttl -u myID -p myPassword jdbc:mysql://localhost/outlook 

After doing this, you can start up the d2r-server program with one of these mapping files to experience what D2RQ can do. For example, enter the following in a command line window:

d2r-server eudoraMapping.ttl

I generated the mapping files and started up the server in the directory where the D2RQ programs were stored, so I didn’t have to include any path names in that command. At this point, if you installed D2RQ using the default configuration, send a browser to the URL http://localhost:2020/ to start interactively using the D2RQserver. You will see a link for “entries,” the single table in the database, and can click on it to start exploring the data. You can also enter your first SPARQL query by going to the SNORQL interface at http://localhost:2020/snorql/ and pasting the following query in, which says “show me all the information about the address book subject that has (328) 618-8442 as a mobile phone number:”

PREFIX eud:   SELECT ?p ?o WHERE {    ?s eud:entries_mobile "(328) 618-8442" .    ?s ?p ?o .  }

Combining the Two Databases
The next step is to go back to the command line window where you started up the D2RQ server with the d2r-server command. You’ll see status messages that scrolled by as it executed your commands. Press Ctrl+C to abort the server. The next time you start it up, you’ll have a new mapping file combined from eudoraMapping.ttl and outlookMapping.ttl that lets you query the two databases at once.

Each of the two mapping files you created assigns the namespace prefix map to a URL for the mapping file. To keep them straight in the combination file, change map: to emap: throughout eudoraMapping.ttl, and to omap: wherever it said map: in outlook.ttl. You don’t have to do this to see the demo in action, because the resulting files are all included in the zip file, but you’ll want to know about the steps I took to get there in case you do something similar with different databases. Some of the files may need to be rebuilt if one of the relevant software packages gets upgraded, because I’ve found that the generated URIs may evolve from one software release to another. You’ll also need to change the d2rq:username and d2rq:password values in the mapping files to the username and password that you’re using to log in to MySQL.

Both ttl files also assigned the prefix vocab to a namespace used for vocabulary terms, so to distinguish them in the combined file, change this prefix to eud throughout eudoraMapping.ttl and to out throughout outlookMapping.ttl. It’s also a good idea to change the namespace URL—which wasn’t a proper one to begin with—from vocab to eudora in one and from vocab to outlook in the other.

Next, combine the two files into one file named comboMapping.ttl. Move all the namespace declarations to the top and remove the redundant ones.

Below is one of the mapping entries in comboMapping.ttl that you need to edit:

emap:entries_email1 a d2rq:PropertyBridge;   d2rq:belongsToClassMap emap:entries;   d2rq:property eud:entries_email1;   d2rq:column "entries.email1";     .		 

The mapping entry assigns various predicate/object pairs to the subject emap:entries_email1, which is a special object created for D2RQ: a d2rq:PropertyBridge. As the D2RQ User Manual and Language Specification tells us, property bridges “relate database table columns to RDF properties.” As you can see in the entry above, the default generated property bridge is d2rq:column, which is for properties with literal values. A selection of alternatives is available, including d2rq:uriPattern for properties with URI values, which replaces the d2rq.column property in the revised version of the emap:entries_email1 entry below. I made the replacement because the RDF data model lets you represent values as either literal strings or URIs, and representing them as URIs is better because it makes it easier to make connections between different bits of data.

Telling D2RQ to treat an email address such as [email protected] as a URI isn’t enough, because software that treats it as a URI won’t like its format—it doesn’t look like a URI. The value of the new d2rq:uriPattern line that replaces the commented-out d2rq:column line below inserts a mailto: prefix so that the value really is a URI.

The final line of the revised version of the mapping entry tells D2RQ a condition for generating one of these mappings: that the data is not an empty string. If any address book entry is missing an email1 value, we don’t want the string mailto: by itself being generated as the address book entry’s entries_email1 value.

emap:entries_email1 a d2rq:PropertyBridge;   d2rq:belongsToClassMap emap:entries;   d2rq:property eud:entries_email1;#  d2rq:column "entries.email1";   d2rq:uriPattern "mailto:@@[email protected]@";   d2rq:condition "entries.email1 <> ''";   .	 

After making these changes to the emap:entries_email1 property bridge, make similar ones to the omap:email2Address one. For a more thorough database integration, you need to do this to all email addresses throughout combomapping.ttl.

At this point, the combomapping.ttl mapping file is ready, and you could start up the D2RQ server as shown below, and then do queries against either database:

d2r-server comboMapping.ttl 

A better alternative is to create an ontology that allows a query tool to see the relationship between the two databases first, so you can do the queries described in the use cases above. SWOOP does the boring part of the ontology generation for you, so you can get to the interesting part. To create a version of this data to load into SWOOP, use another D2RQ utility to dump the data using the new mapping file:

dump-rdf -m comboMapping.rdf -o datadump.rdf 

Creating an Ontology From the Data
If you look at datadump.rdf with a text editor, you’ll see namespace declarations like this:

xmlns:j.0="file:outlook/"xmlns:j.1="file:eudora/" 

You’ll also see rdf:type elements with a value of file:outlook/entries or file:eudora/entries as their rdf:resource attribute value. A URL beginning with file: needs at least two slashes after it, so use your text editor to do a global replacement that adds two slashes after the string file: throughout datadump.rdf.

Now, SWOOP can create a simple ontology from your data dump. Start it, and pick Load/Ontology from the File menu. You’re not really loading an ontology, but instead the RDF file that you pulled from D2RQ: datadump.rdf. It might take a few minutes, so to speed this up you could edit down the datadump.rdf file to only include the data from one record of each database. SWOOP builds an ontology out of the properties that it sees, so it only needs to see one example of each property.

After you’ve loaded the data file, immediately click Save As from the File menu and save the file as postSwoop.rdf.

Look at both datadump.rdf and postSwoop.rdf with a text editor. Although you didn’t do anything to the datadump.rdf “ontology” that you loaded, SWOOP added OWL declarations to postSwoop.rdf for all the properties it found, like the following:

 

The Entity references &eudora; and &outlook; are standing in for the strings file://eudora/ and file://outlook/ to make the rdf:about values proper URLs, which RDF requires for identifiers. Now you can use SWOOP to enhance the ontology so that SPARQL queries against the combined databases can do things that SQL couldn’t do to the MySQL versions of the databases. We’ll start by adding something to the ontology that says that the workState column of the eudora database’s entries table is equivalent to the businessState column of the outlook database’s entries table.

Select the Property Tree tab in the lower left of SWOOP and click on “entries_workState.” If you click the RDF/XML tab near the top, you’ll see an RDF/XML representation of this property’s declaration, which looks like the DatatypeProperty declarations above. Among other things, this shows that this property is part of the file://eudora namespace.

By using SWOOP, you don’t have to deal with RDF/XML, so click the Concise Format tab to return to the display that is easier to read and interact with. Click on Add next to “Equivalent to:,” select “entries_businessState” on the Specify Property dialog box that appears, and click the Add button, and then the Cancel button to show that you’re done with the Specify Property dialog box (see Figure 1). You’ll see “entries_businessState” appear under “Equivalent to:” on the main pane, but it’s not official until you click the Apply Changes button at the bottom. Save the edited ontology by picking Save from the File menu or by pressing Ctrl+S.

 
Figure 1. Using Swoop: After you indicate that the entries_businessState property is equivalent to entries_workState, the EquivalentProperty section of the information for entries_workState reflects your change.

If you search for “equivalent” in postSwoop.rdf, you’ll see how SWOOP saved this; it changed the following:

 

into this:

   

RDF/XML representations of OWL statements are not difficult to read, but they are verbose, which is why a tool like SWOOP makes editing them much easier.

For a more thorough job of integrating the two databases, you also define eudora:entries_firstName as equal to outlook:entries_firstName, eudora:entries_lastName as equal to outlook:entries_lastName, and many other equivalencies. The more relationships you can identify between the two databases, the more tightly they’ll be integrated. The Bobby Fisher use case works better if you define at least these two.

To make it possible to find all of Alfred Adams’ phone numbers, regardless of which ones we have stored for him, we want to indicate that all the phone properties have some semantics in common. To do this, create a new phone property and then make all other phone properties subproperties of that. Click the Add P button with a yellow “P” at the left of the SWOOP workspace to display the New Entity dialog box. Set the Property Type to OWL Datatype Property, because phone numbers are simple strings. Set subProperty-of to None, which is the first choice, because our new property isn’t a subproperty of any other. Based on the URIs that D2RQ generated from my address book, I replaced the default Logical URI on the dialog box with http://localhost/entries/phone. I left the other fields blank. Click the dialog box’s Add & Close button, and “phone” appears at the bottom of the Property Tree. If it doesn’t appear, you can click Remove and start again.

After you add this new property, make sure it’s selected, and then click Add next to “Superproperty of.” OWL has no “superproperty” property—when you tell SWOOP that entries:phone is a superproperty of outlook:entries_homePhone, it stores outlook:entries_homePhone as a subproperty of entries:phone. It’s faster in the SWOOP interface to indicate that entries:phone is a superproperty of 15 other properties than it is to go to each of those 15 and mark it as a subproperty of entries:phone. In the Specify Property dialog box, Ctrl-click lets you select multiple properties as subproperties of the new phone property. Select the following, and then click Add to include them in the main window and then click Cancel when you’re finished with the Specify Property dialog box. You’ll see “entries_otherPhone” twice on the list, once from each of the Eudora and Outlook namespaces. Select both.

entries_businessFaxentries_businessPhoneentries_businessPhone2entries_carPhoneentries_homeFaxentries_homePhoneentries_homePhone2entries_mobileentries_mobilePhoneentries_otherPhoneentries_phoneentries_primaryPhoneentries_workMobileentries_workPhone  

Click Apply Changes and save your work.

To let Pellet discover that Bobby Fischer and Robert L. Fischer are the same person, you’ve already done half the work by adding new information about the emap:entries_email1 and omap:entries_email2Address property bridges in the mapping file. We need to have the ontology specify that the properties used for personal email addresses are inverse functional properties, which means that only one instance of a class can have a particular email1 value. This way, an email1 value of “[email protected]” for both Bobby and Robert means that they’re the same person.

When using OWL DL, a property that is an inverse functional property must be an object property, not a datatype property (that is, a URL and not a string, which is why we tweaked the mapping file for this entry the way we did). Select “entries_email1” in the Property Tree tab and then click Add next to Attributes at the bottom of the main pane. Select Inverse Functional and the Yes button on the Specify Property Attribute dialog box, and then click Apply Changes at the bottom of the main pane and save your work. Follow these same steps to designate “entries_email2Address” as an inverse functional property, and then define “entries_email2Address” as an entries_email1 to be equivalent following the same steps that you used to define the equivalent property pairs earlier.

Querying the Rules and Data
Now that your postSwoop.pdf file has a combination of data and OWL rules, you could issue Pellet SPARQL queries against it, but we want to use this ontology with future versions of the database as the content evolves, so use a text editor to delete all the parts of the postSwoop.pdf that SWOOP didn’t generate. This includes the parts with address book data about specific people; in the dump from D2RQ, there should be an XML comment that says “Instances” to make this easier. After making this deletion, save the file as properties.owl.

Let’s imagine that the data in MySQL was updated, you created a more up-to-date version of datadump.rdf, and you want to run some queries against the combination of the data in datadump.rdf data and the metadata in properties.owl. For the purposes of the demo, you can use the datadump.rdf file left over from before.

To make it easy to combine the two files, I created a short XSLT 1.0 stylesheet (included in the zip file) named rdfcat.xsl. When run against a file like the following, this stylesheet combines the output into a single RDF file:

       

The stylesheet assumes that you’re using an XSLT processor such as libxslt or Saxon that also implements the W3C XInclude specification. The stylesheet also adds an OWL declaration for each resource it finds, calling it a member of the owl:Thing class so that the OWL reasoner doesn’t complain about resources that haven’t been declared as a member of a class.To pull RDF from D2RQ and save it in a file, I created the following rdfcat.xsl file to show the files I wanted concatenated together: The following command line uses libxslt’s xsltproc program and rdfcat.xml to combine the latest version of datadump.rdf with the properties.owl file into a file called combo.rdf:

xsltproc --xinclude rdfcat.xsl rdfcat.xml > combo.rdf 

The combo.rdf file is the combination of ontology metadata and formerly relational data that we’ve been working toward, and we can now run SPARQL queries that implement our use cases on it. The first query asks for all data for all subjects that have a file://eudora/entries_workState value of “NY.”

PREFIX e:   SELECT * WHERE {    ?s e:entries_workState "NY"  }	 

There are many SPARQL engines to choose from out there, but not all implement OWL. Pellet is a free one, so I used that:

pellet -if combo.rdf -ifmt RDF/XML -qf nyworkers.spq > nyworkers.out 

Because of the equivalence relationship that you defined, Pellet should list the subjects for both the e:entries_workState triples and the out:entries_businessState triples that have a value of “NY.” You won’t know if it’s correct unless you check how many of those are in each of your two databases, but I found that the random data generation script put four or five in each database, so if this query retrieves more than six or seven, it’s good news.

Pellet also outputs a few suggestions for OWL statements to make newdatadump.rdf a little more OWL DL compliant. The rdfcat.xsl stylesheet adds rules to account for a few of these, and it could use a few more.

The following shows AAphone.spq, which asks for all phone numbers for Alfred Adams, whether the database has his home phone, work phone, mobile phone, or any other phone numbers. The URL assigned to the e: prefix is the one we used when defining the new phone property, the superproperty of the various phone number properties.

PREFIX e: PREFIX eud:   SELECT ?phoneType ?phone WHERE {        ?s ?phoneType ?phone.        ?s e:phone ?phone.        ?s eud:entries_lastName "Adams".        ?s eud:entries_firstName "Alfred".  }	 

The last use case, stored in FisherData.spq in the zip file, asks for all data about the Bobby Fisher entry, which is from the eudora database. Because we defined eudora:entries_email1 as an inverse functional property, Pellet knows that only one entry can have an email1 value of mailto:[email protected]. Because we defined eudora:entries_email1 as equivalent to out:entries_email2Address, Pellet pulls Robert L. Fisher data from the outlook database for any other properties defined as equivalent to eudora properties.

PREFIX e: PREFIX o:   SELECT * WHERE {     ?p ?o  }	 

More Queries to Try
Metadata for metadata’s sake does not justify the trouble of adding it. The goal with each bit of metadata was to let a user answer real address book questions more easily. Another classic OWL tweak to the ontology is to have the company home pages and other web addresses listed in the data represented as object properties instead of as datatype properties, but I couldn’t think of a query that would then demonstrate how this metadata made the database more useful.

Another nice bit of database integration metadata that OWL can enable is the indication that a field in one database is not equivalent to a certain field in another database, but a subset of it. For example, if an international address book was incorporated into this data, we’d want to show that American zip codes are postal codes, but that all postal codes are not necessarily zip codes. This is done by making the zip code property a subset of the postal code property, much as you did with the phone properties. Then, a query against postal codes would also check zip codes, but not vice versa.

Demonstrating additional property attributes besides the inverse functional property would also be valuable. In a demo similar to this one, I added data indicating which entry represented the spouse of which other entry. With Jane Smith’s entry pointing to Joe Smith as her spouse, “spouse” being defined as a symmetric property, and no home phone number or spouse listed for Joe, I could still query for his home phone number and get Jane’s because I had defined a rule saying that if someone didn’t have a home phone number but their spouse did, then a query for that person’s home phone number should return the spouse’s number. The definition of such rules are not a standard part of OWL, but Pellet supports them, and work toward a standard definition of such a rule language is underway .

To enhance data by adding metadata to it, a full dump of the data is not very practical. Instead of dumping all of the relational data to an RDF representation each time the database is changed in order to allow Pellet to query an up-to-date data/metadata combination, it would be nice to translate SPARQL queries on the fly to SQL queries, letting us issue SPARQL queries directly against the relational data. This, in fact, is what D2RQ does, but D2RQ currently offers no way to load ontology triples into the knowledge base along with the relational data, unless you want to try storing the OWL statements in a relational table, which could be very interesting. D2RQ isn’t the only project making such technology available, but it is free, and as more software supports SPARQL and OWL, the combination will provide us with some great new possibilities in getting more out of our relational databases.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: