Java/JRuby Developers, Say Open ‘Sesame’ to the Semantic Web

he core concept of the semantic web is integrating and using data from different sources. Using semantic web technologies such as RDF/RDFS and the SPARQL query language to integrate and use data from disparate sources has some advantages over using a standard relational database. The Resource Description Framework (RDF) uses predicates to define relationships between data objects, and RDF Schema (RDFS), which is written in RDF, offers a modeling language for knowledge representation and ontology development. (See Sidebar 1. Why RDF/RDFS for the Semantic Web?) Used together, these technologies enable you to use information from disparate sources in different formats/schemas without having to convert the data to a “standard format”?as you would with a relational database.

This article introduces Java developers to semantic web application development using Java and JRuby. It demonstrates how to employ the semantic web’s functionality through an application example that processes news articles to identify and store (in an RDF repository) industry terms and the names of people and places. The example uses the Sesame libraries for RDF storage, RDFS inferencing, and running SPARQL queries, and the downloadable source code provides a simple wrapper API for Sesame and some examples of queries against sample RDF data.

Getting Started
You can find many libraries and frameworks in several programming languages for using semantic web technologies. For this short article, I bypassed many good alternatives and chose some favorite tools called the Sesame libraries. At some point, take the time to study the complete Sesame APIs, system configuration, and complete documentation. However, for the purposes of this article, all you need is the downloadable source code, which is a wrapper API for Sesame that includes Sesame and all the libraries that you will need to work through the examples. Specifically, the source code contains:

  • One large JAR file with everything you need for both the Java and JRuby program examples
  • Raw text files from a few Reuters news articles
  • The RDF data files generated by the utility ruby_utilities/raw_data_to_rdf.rb (I wrote the utility raw_data_to_rdf.rb to extract semantic information from the sample news articles and write RDF triples to a data file used in the example programs.)
  • An example of RDF in the more readable N3 format
  • Some JRuby example programs
  • Some RDF data for experimentation
Author’s Note: The program examples are dual licensed. You can use the downloadable source code under either the LGPL or Apache 2 licenses. Sesame itself and the libraries that it requires are licensed under BSD-style and LGPL licenses.

The example uses two data types for object values: URIs and string literals. RDF originally was expressed as XML data files and while the XML format is still widely used for automated processing, the example uses two alternative formats, N-Triples and Notation3 (N3), because they are much easier to read and understand. Sesame can be used to convert between all RDF formats, so why not use more readable and understandable formats?

RDF data consists of a set of triple values:

  • subject
  • predicate
  • object

In the context of this article, a triple might look like this:

  • subject: A URL (or URI) of a news article
  • predicate: A relation like “containsCity”
  • object: A value like “Burlington”
Figure 1. Conceptual Overview of the News-Processing System: When the Ruby script ruby_utilities/raw_data_to_rdf.rb and the file rdf_files/news.nt are created, you can then use only Sesame with the wrapper API for Java and JRuby.

Figure 1 shows a conceptual overview of the application example. It is conceptual because it does not include code for web scraping. Instead, it uses manually copied text from a few news articles (human names, place names, and key terms) for input to the entity extraction utility raw_data_to_rdf.rb. As Figure 1 shows, when the Ruby script ruby_utilities/raw_data_to_rdf.rb and the file rdf_files/news.nt are created, you can then use only Sesame with the wrapper API for Java and JRuby. This wrapper library can write N-Triple data to the more convenient N3 format. (Later, you also will see several N3 examples.)

To build a full production system based on the examples in this article, you will need to write Ruby scripts that web scrape a few select news web sites. These scripts are not difficult to write, but a general web scraper that ignores things like advertisements and navigation HTML is very difficult to write?and beyond the scope of this article.

In order to simplify this system and concentrate only on using RDF/RDFS, the assumption is that news articles exist in the directory raw_data in the Rails application directory and do not provide any web site-specific web scraping code. This directory contains the text of four Reuters news articles for testing. You can replace these files with data from other information sources (e.g., word-processing documents, PDF files, databases, etc.) The utility ruby_utilities/raw_data_to_rdf.rb reads the data in the directory raw_data, uses the Reuters OpenCalais web service to find entities in each article, and then writes RDF triple data to the file rdf_files/news.nt. The OpenCalais web services can be freely used (up to 20K web service calls a day); for my work I use both OpenCalais and my own system to extract information from text.

Modeling with RDF, RDFS
RDFS supports the definition of classes and properties based on set inclusion. (Be aware that classes and properties in RDFS are orthogonal.) The example in this article does not simply use properties to define data attributes for classes, which is different from object modeling and the procedure used by object-oriented programming languages such as Java, Ruby, and Smalltalk. In addition to facilitating the combination of different data sources, you can use RDFS inferencing to effectively generate new RDF?that is, inferencing asserts new RDF triples.

Let’s get started with example RDF data from news articles and then look at example programs that show some basic techniques for building semantic web applications.

Person, Place, and Industry Terms Stored in RDF
Using the Reuters OpenCalais system, I wrote a simple Ruby script ruby_utilities/raw_data_to_rdf.rb that reads text files containing news stories and generates RDF data in N-Triples. This format is composed of a subject, a predicate, an object, and a period.

The downloadable code for this article contains a sample output RDF N-Triples file called rdf_files/news.nt. Listing 1 contains a few lines from that file. You can see triple elements being defined either in specific name spaces or as string literals. The predicate containsIndustryTerm is defined in the namespace of the knowledgebooks.com domain: . Namespaces can be abbreviated using a prefix notation that you will use later when you switch to the N3 RDF format.

So where is RDF data actually stored? Sesame supports an in-memory RDF store, which the Sesame wrapper in the downloadable code uses, as well as several different back-end data store mechanisms. Although these alternative storage back ends can be selected with just a few lines of code (see the Sesame web site for documentation), configuring them is time consuming. I suggest learning the basics of RDF/RDFS modeling and effective SPARQL use and not worrying too much about deployment until you have an interesting application to deploy.

Querying N-Triple RDF Data Using SPARQL
This section shows complete JRuby and Java examples that query N-Triple RDF data using SPARQL. The sections to follow will use just code snippets. Java and/or Ruby programmers should easily make sense of the code in the examples and have few problems using derivative code in their own programs. All the examples use my Sesame wrapper library, which is much simpler than calling the Sesame APIs directly. You eventually may want to use the full Sesame APIs.

Listing 2 is a complete listing of the JRuby example file jruby_sesame_example2.rb. The class TripleStoreSesameManager in Listing 2 is defined in the wrapper library. The method doSparqlQuery requires two arguments: a string containing a valid SPARQL query and an instance of any Ruby class that defines the method processResult. If you have a syntax error in your SPARQL query, the Sesame library will print useful error messages.

You will find some similarity between SPARQL and SQL. The SELECT statement specifies one, two, or three of the triple terms that should be returned with each query result. Here, I wanted to see only the subject and object because the predicate triple term is defined in the WHERE clause to match exactly.

JRuby is a good language for working with Sesame because it is dynamically typed and is very terse. Also, the ability to work interactively in an irb console is a big win. Overall, coding experiments are simpler with JRuby than with Java. That said, once I use a dynamic language like JRuby for code experiments, I usually use Java for production work. Listing 3 shows a similar example to Listing 2 using Java and my Sesame wrapper library.

Because Java is strongly typed, the second argument to the method doSparqlQuery in Listing 2 is defined using the interface ISparqlResultHandler, which defines the method signature for processResult.

The remainder of this article concentrates on N3, a better RDF format, for using data from different sources that use different schemas. It also offers more advanced SPARQL examples.

A Better RDF Format: N3 Syntax
Frameworks like Sesame that provide RDF storage and querying functionality do not care which RDF format you use for input. However, it is wise to use the easiest to read and understand format. For that reason, I use N3 whenever I can. Many years ago, I started working with RDF using its XML serialization format, which I found confusing and generally counterintuitive. N-Triple, Turtle (a simpler form of N3, which this article doesn’t discuss), and N3 are all vastly superior to the RDF XML format.

My wrapper library has an API for calling the Sesame utility code for converting whatever RDF data is in its RDF store to N3. Here is a Java code snippet that shows you how to do this:

public class ConvertTriplesToN3 {    public static void main(String [] args) throws RepositoryException, IOException, RDFParseException, 
RDFHandlerException { TripleStoreSesameManager ts = new TripleStoreSesameManager(); ts.loadRDF("rdf_files/rdfs.nt"); ts.loadRDF("rdf_files/news.nt"); ts.saveRepositoryAsN3("sample_N3.n3"); }}

Unfortunately, Sesame writes out N3 data without using namespace abbreviations. Listing 4 shows a few lines produced by the above code snippet. You also can use utility programs such as CWM to convert different RDF formats.

N3 allows you to collapse many N-Triple RDF statements (again, subject, predicate, object, “.”) into a single N3 statement for N-Triple statements with the same subject. Let’s look at an N3 fragment in more detail (assume that the kb: and rdfs: namespace prefixes are defined):

 kb:containsCity "Burlington" , "Denver" ;	kb:containsRegion "U.S. Midwest" , "Midwest" ;	kb:containsCountry "United States" , "Japan".

Here, the subject is the complete URL for the news article on the web. This article has two objects “Burlington” and “Denver” for the predicate kb:containsCity, multiple objects separated by commas, and the last object is followed by a semicolon, which indicates that the next term will start a new predicate that is followed by one or more objects. Notice that the last line is terminated with a period; that also terminates this N3 statement.

I hand-edited the file rdf_files/news.n3 (very easily with regular expression search and replace) to add namespace abbreviations to the automatically converted N3 file. Listing 5 shows the first few lines of the file news.n3. The first two lines define namespace abbreviations (or “prefixes”). As an example, the abbreviation “rdfs” for RDF Schema and “kb” for my own knowledgebooks.com namespace are used to define a new RDFS property kb:containsPlace, which is a super property of kb:containsCity, kb:containsCountry, and kb:containsState. Note that in this example I did not make kb:containsCity a sub-property of kb:containsPlace. Using namespace abbreviations makes it a lot easier to read RDF.

So how can you use the new super property kb:containsPlace? No triples in the original triple store had a predicate equal to kb:containsPlace; this property is used to assert new triples using RDFS inferencing. Some RDF triple stores pre-calculate asserted triples, while others calculate them as needed during SPARQL query processing. As a semantic web developer, it makes no conceptual difference how the triple store works internally, but you likely will face memory-use versus querying-performance tradeoffs.

As an example of RDFS inferencing, suppose that you have one application that runs fine-grained queries for a news article containing a specific state and another application that searches for all news stories that contain any references to physical locations. The first application could query matching kb:containsState and triple objects against a string literal for the state name (or you might use 50 URIs to represent states). The second application can use the super property in a SPARQL query like this:

sparql_query = "PREFIX kb:   
SELECT ?subject ?object WHERE { ?subject kb:containsPlace ?object . }";

This query matches all articles with a predicate equal to containsRegion, containsCountry, or containsState. Notice the SPARQL syntax for using namespace abbreviations (or prefixes) using the PREFIX keyword.

Author’s Note: There is no difference performing SPARQL queries against different RDF formats. An RDF storage repository like Sesame stores RDF in an efficient internal format. Developers may have a tendency to think of formats as XML RDF or N3 RDF, but once data has been read into a repository, it does not matter which original RDF format was used. It is also important to remember that a single N3 statement generally will define many RDF triples (all with the same subject).

By using RDFS (in this case, defining the super property containsPlace), you can change the way you access RDF data without converting it. In a relational database application, you would need to use either special queries (that would have to change if you wanted to add a new sub property to containsPlace) or new tables or database views. Yes, a relational database solves this reuse problem also, but with much less flexibility than RDFS.

When you have multiple data sources using different schemas/formats, then RDF with RDFS provides even more flexibility, as you will soon see.

Matching Data Using Regular Expressions
With some loss of efficiency, you can use regular expression matching in SPARQL queries. As an example of this technique, Listing 6 uses the RDF file rdf_files/oil_example.n3. The Listing 6 SPARQL query seeks to find all RDF triples that contain the word “oil” in the object field, where the predicate field of the triple is equal to kb:containsIndustryTerm. To apply regular expression matching, use one of the previous JRuby example programs (see Listing 2) and change the name of the N3 file loaded and the SPARQL query string to this:

tsm.loadRDF("rdf_files/oil_example.n3")sparql_query =  "PREFIX kb:     SELECT ?subject ?object   WHERE { ?subject kb:containsIndustryTerm ?object FILTER regex(?object, "oil") . }"

Here, I added a filter term after ?object that restricts ?object values to strings containing “oil.” Two RDF triples match, so two lines (each with the article URL and the object value) get printed out:

  [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_flooding_dc_16/, oil]  [http://news.yahoo.com/s/nm/20080616/ts_nm/usa_politics_dc_2/, oil prices]

Now, consider a similar but more interesting example: augmenting the regular expression example to find all triples for matched articles. Given the article URLs that were found in the previous example, you can collect a set of all RDF triples with subjects equal to any of the matched article URLs by changing the SPARQL query string to this:

sparql_query ="PREFIX kb:  SELECT  ?subject ?predicate ?object2WHERE {    ?subject kb:containsIndustryTerm ?object FILTER regex(?object, "oil") .    ?subject ?predicate ?object2 .}"

This query has two WHERE clauses: The first matches all triples with a predicate term equal to kb:containsIndustryTerm, and the second matches all triples where the subject matches the first WHERE clause. Results will each contain three subject/predicate/object values.

Merging Data from Different Sources That Use Different RDF Schemas
Your semantic web application will need to use data from different sources, and the following example shows you how to implement that functionality. In addition to using the rdf_files/news.n3 file from the previous examples, this example will also use rdf_files/news_2.n3, which uses a very different schema:

@prefix ex:   [email protected] rdfs:   . 
kb:about "Academy Award red carpet got wet in the rain" ; ex:author "Joy Smith" ; ex:location "United States" , "Los Angeles" ; kb:keyword "entertainment" , "movies" .
kb:about "Oil prices rise" ; ex:author "Sam Suvy" ; ex:location "United States" , "Chicago" ; kb:keyword "cars" , "fuel", "oil" .

Looking at this new RDF file, you will see some similarities with the previous example’s RDF file news.n3:

  • The news_2.n3 file uses a property location that is similar to the properties in news.n3: containsCity, containsCountry, and containsState. These properties are defined in different namespaces, but that is not a problem (more on this shortly).
  • The news_2.n3 file uses a property keyword that is similar to the property containsIndustryTerm in news.n3. It might make sense to perform fuzzy matches between keyword object values and containsIndustryTerm object values.

The issue of handling locations can be solved by simply adding another property statement:

ex:location rdfs:subPropertyOf kb:containsPlace .

Now any SPARQL queries run against kb:containsPlace without your having to modify any data. For the second similarity in both information sources having lists of keywords or industry standard terms, you can add another statement:

ex:keyword rdfs:subPropertyOf kb:containsIndustryTerm .

I prefer using my own knowledgebooks.com namespace in SPARQL queries, but if I wanted to use the ex:keyword property, I could have just reversed the subject and object in this RDF statement.

Using Classes in RDFS Modeling
You may be surprised that all of the examples so far have dealt with RDFS properties and not RDFS classes. As previously mentioned, RDFS properties and RDFS classes are orthogonal in the sense that properties are not used to define attributes (or class variables) for RDFS classes. You can add and use properties with classes in an ad hoc way, extending classes and the use of properties at any time. The following example for using classes in RDFS modeling uses the N3 file rdf_files/class_example.n3:

@prefix kb:   [email protected] rdf:  [email protected] rdfs:   [email protected] foaf:  .foaf:Person rdfs:subClassOf foaf:Agent .kb:KnowledgeEngineer rdfs:subClassOf foaf:Person . a kb:KnowledgeEngineer .

Notice that the predicate uses the abbreviation a, which means that the subject URI is a member of the class kb:KnowledgeEngineer. The following SPARQL query will print out all the subjects and predicates for triples whose object is equal to foaf:Agent:

require "java"require "sesame_wrapper.jar"require 'pp'include_class "TripleStoreSesameManager"include_class "DefaultSparqlResultHandler"tsm = TripleStoreSesameManager.newtsm.loadRDF("rdf_files/class_example.n3")sparql_query =   PREFIX foaf:     SELECT ?subject ?predicate WHERE { ?subject ?predicate foaf:Agent . }";tsm.doSparqlQuery(sparql_query, DefaultSparqlResultHandler.new)  

Listing 7 shows the output from running this example. Notice a couple of interesting things:

  • The subject URI is of type foaf:Agent. By logical inference, my URI is of type kb:KnowledgeEngineer, which is of type foaf:Person, which is of type foaf:Agent.
  • Both foaf:Person and are of type foaf:Agent.

It is often interesting and useful to make “broad” SPARQL queries like this example to see the triples that Sesame (or any other RDF triple store) asserts through inference.

Where to Go from Here
Now that you have seen how to employ the semantic web’s functionality using Java and JRuby, you can write derivative code from the examples in this article to build your own semantic web programs.

I suggest that you look in two directions for starting your own semantic web projects:

  • Publish your own data sources as RDF, and then provide consumers of your data with RDFS and example SPARQL queries to help them get started.
  • Identify sources of RDF data than can enhance your own web applications. Use SPARQL queries to collect data for your own use.
Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts