Storing and Using RDF in Mulgara

f you’ve been following semantic web technologies, then by now you’re no doubt aware of semantic web data languages like the Resource Description Framework (RDF) and the Web Ontology Language (OWL). When you’re talking about storing semantic web data, usually you’re talking about RDF triples, which are facts about URI-addressable resources.

Why RDF triple stores? To efficiently store and query RDF, you are most likely not going to want to use relational databases directly. It isn’t that there is something inherently wrong with relational databases; it’s just that the more general RDF model doesn’t fit efficiently within a table structure, and the web moves faster than your schemas can keep up.

A relational database works because you tell it what relationships exist in your data. With these relationships fixed in time, you are able to efficiently store and retrieve instance information that fits into that structure. If you ever change your mind about how to structure that data, you need to modify how you describe it to the relational database management system (RDBMS) and massage the old structure into the new. This process can be an expensive and time-consuming effort even if you understand how to do it.

In a single organization this process is difficult, but a sense of common purpose (or a strong-willed and empowered few) can usually bring about a consensus of what the structure should be. As the world changes around you, your needs will change. Even if you could slow down the influence of external events, you see the past and the future through the lens of the present. You are constantly assessing and reassessing your understanding of the world, and you’ll certainly change your understanding of your domains, their terms, their relationships, and so on.

Now imagine trying to bridge the chasms of language, culture, competing values, different business priorities, and downright obstinacy. These will be among your challenges when trying to share data across the web. Attempting to get people to agree to most things is a difficult task. One of the most successful and widely used RDF vocabularies is the Dublin Core Metadata Initiative. This effort defined terms and relationships for publication metadata, and is a reasonably well-understood domain that was considered by a reasonably well-functioning standards body. Still, it took approximately seven years to agree on a set of around fifteen terms!

The point is simply that sharing data on the web requires an extensible, open-world model that allows people to agree to disagree. This characteristic is what RDF was designed to provide. After you commit to using the RDF graph model, you need someplace to put it efficiently?triple stores give you this ability. Several systems work well at this functionality, but the focus here will be on a particular solution, called Mulgara.

Mulgara
The Mulgara Semantic Store is an open source project that is a 100% Pure Java-based RDF quad store. It is scalable, supports transactions, and has a pluggable resolver architecture allowing you to interface with non-RDF data sources from within the RDF model. Mulgara currently supports hundreds of millions of triples within a single database instance, while maintaining decent query and storage performance.

Although Mulgara presently supports only Remote Method Invocation (RMI) and SOAP-based access, the developers are adding a RESTful interface to make it easier to use with other languages, tools, and platforms. The current Mulgara version 1.1 download was released recently and should be the basis of any investigation because it adds several features and bug fixes over earlier releases. After unpacking the release, use this command to run Mulgara:

java -jar mulgara-1.1.0.jar 

You’ll see a bunch of text scroll (see Figure 1). Take note of the last statement about the aliases for the server:

2007-08-14 22:22:52,313 INFO  Database - Host name aliases for this server are: [HarryHood.local, localhost, 127.0.0.1, 10.0.0.20]

If you’re going to start Mulgara on a machine that has a name that might change (for example, a notebook that hops on different networks), you’ll want to bind the server instance to a predictable name because of the presence of the machine name in the model definition. This issue will be resolved in the next version but remains a difficulty for the time being. To bind the server to a particular hostname, enter:

java -jar mulgara-1.1.0.jar -k localhost  -o localhost

This command binds the server name instance to the name localhost and establishes the server name to which HTTP requests will respond. If you want people on other machines to access your Mulgara instance, do not use localhost; choose either a valid name or the IP address (assuming it is not DHCP-assigned). Whatever name you use, substitute it for localhost in all of the examples here.

You can test your configuration by browsing to http://localhost:8080/webui, which takes you to a convenient test web application.


Figure 1. Unpack Mulgara: If you start Mulgara on a machine that has a name that’s likely to change, you need to bind the server instance to a predictable name.

In the Mulgara test web application, insert triples into a model. Models provide ways of organizing RDF statements. (You may be wondering why you’re putting triples into Mulgara if it is a quad store. The fourth element of the quad is the model name). To create a model you can do one of two things.

  1. Type in the Query Text field:
     create ;

and then select the Submit Query button. You should see a result message that says something like:

     Successfully created model rmi://localhost/server1#devx
  1. You can modify the Model URI text field to be rmi://localhost/server1#devx, and select “Step 1. Create a model” from the Example Queries menu to populate the query text, after which you can select the Submit Query button.

After creating the model, insert some RDF triples into it:

insert   '1' into ;insert   'blue' into ;insert   '7' into ;

You should see some indications of successful insertions into the model. You’ve defined three statements referring to two URI subjects and two URI predicates. In general, you’ll probably use terms from an existing vocabulary like Dublin Core rather than fake ones, which might look like this:

insert    'My Cool Webpage' into ;

If you find that you have a lot of redundancy in your terms, you might benefit from using the alias command:

alias  as dc;insert     'My Other Cool Webpage' into ;

Get Out the Triples
Now that you have some triples in your model, you’ll need to know how to get them out. Mulgara supports an RDF query language called iTQL. Soon, it will support the W3C SPARQL query language as well. If you are comfortable with iTQL, you’ll be comfortable with SPARQL when the support is added.

To find all the triples in the model, you can enter this command in the Query Text field and submit the query (if the Model URI text field specifies the correct model, you can simply select “Step 3. List everything in the model”):

select $subject $predicate $object from    where $subject $predicate $object;

Your result should be similar to the result shown in Figure 2.

The names that begin with “$” are variables that you are selecting from the model. While the names can be anything you like, the references in the select clause must match how they are used in the rest of the query in order for the relationships to mean anything. To constrain the query to a particular relationship:

alias  as dc;select $subject $object from  where $subject  $object;

This example says that you want to know which subjects are related to which objects through the Dublin Core title predicate. You could also read this as, “Show me all the things that have titles as well as what those titles are” (see Figure 3).

 
Figure 2. Querying Triples in Your Model: This result reflects an unconstrained query of all of the triples that have been captured in your model.   Figure 3. Querying for Specific Predicates: When querying RDF data, usually you want to ask specific questions about the data: What has this value? What things are connected by this relationship?

To query for elements with particular values, try something like this:

select $subject from  where $subject $predicate 'blue';

As you see, the selection variables drive the results that show up based on the pattern matches. There isn’t time to go through a full iTQL tutorial here. However, if you are running Mulgara, you can access the iTQL documentation page here: http://localhost:8080/itqlcommands/index.html (don’t forget to replace localhost with whatever name you bound your server against.)

Using Mulgara from Java
Even though you’ll still be using iTQL indirectly, you are probably going to want to use Mulgara by going through its Java API. While this API isn’t as rich as, say, the Elmo API from the Sesame project (ahem), it is a useful abstraction when you’re working from within a Java application.

The sample application here will spider Friend-of-a-Friend (FOAF) files on the Web. The FOAF project is another RDF vocabulary for describing social networks, professional relationships, and so on. While it doesn’t have the same user base of some of its closed-model brethren (LinkedIn and MySpace), the community has been growing and continues to do so.

From Java, you communicate with a Mulgara instance through the use of the ItqlInterpreterBean. You can find this class in the driver-1.1.0.jar file in the unpacked distribution. Through this interface you can create models, queries, and data updates on the same machine or different machines as long as the name bound to the server instance is visible from those other machines (that is, not localhost). The application starts by loading a FOAF file into the Mulgara instance and then querying it for all new FOAF file references it has discovered. It will then repeat the process until there are no new entries. Depending on whom you know, this process might go on for hours!

There are plenty of optimizations that you can make to the code shown in Listing 1, but in the interest of keeping it simple they were left out.

After you have your network graph in place, there is a tremendous opportunity to ask interesting questions of what you’ve found. The benefit of the RDF graph model is that you don’t need to know what you might find beforehand. You can simply start to query the results, and see what is there. No need for schemas here!

After poking around a bit, you might see that some people put into their profiles the name of the school they went to. If you want to see a list of where everyone who shares this information went to school, you would use a query like this:

select $school from  where $subject 
$school;

If you want to track alumni from your own school, you can constrain the results to subjects that have a particular URI like this:

select $alumnus from  where $alumnus 
;

If you want to find out what people were interested in, you might try this:

select $who $interest from  where $who 
$interest;

Hopefully, you see the power of the directed graph model as a way of supporting the open-world assumption. You can always add new facts about known subjects without having to migrate your existing data. It becomes very easy to query these complicated datasets with powerful questions, even as you learn what facts are represented in them. The PURL system mentioned in “What’s in a URI?” (DevX, July 19, 2007) is a useful way to name subjects that you want to accumulate facts about.

While the Mulgara project needs to make progress on lowering the bar to adoption, improving its documentation and tutorials, and increasing the number of environments in which it can be used (many of these issues will be addressed in the next release), it is a powerful and scalable data store with many of the features you would expect from a commercial enterprise storage system.

Yet, it is by no means the only solution out there for storing and querying RDF data. Oracle has become a powerful player in the semantic technologies space, beginning with 10g Release 2. Other tools such as Redland, Jena, Sesame, and the Talis Platform are all established solutions that have their own advantages, and you are encouraged to play around with all of them.

RDF is becoming an important data model on the web and in the enterprise. Understanding how you can accumulate, store, and query information in this format is going to become an important part of working with the information systems of the twenty-first century. Mulgara is a great tool for beginning to learn how to do that.

Resources

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: