he core concept of the semantic web is integrating and using data from different sources. Using semantic web technologies such as RDF/RDFS and the SPARQL query language to integrate and use data from disparate sources has some advantages over using a standard relational database. The Resource Description Framework (RDF) uses predicates to define relationships between data objects, and RDF Schema (RDFS), which is written in RDF, offers a modeling language for knowledge representation and ontology development. (See Sidebar 1. Why RDF/RDFS for the Semantic Web?
) Used together, these technologies enable you to use information from disparate sources in different formats/schemas without having to convert the data to a "standard format"as you would with a relational database.
This article introduces Java developers to semantic web application development using Java and JRuby. It demonstrates how to employ the semantic web's functionality through an application example that processes news articles to identify and store (in an RDF repository) industry terms and the names of people and places. The example uses the Sesame libraries for RDF storage, RDFS inferencing, and running SPARQL queries, and the downloadable source code provides a simple wrapper API for Sesame and some examples of queries against sample RDF data.
You can find many libraries and frameworks in several programming languages for using semantic web technologies. For this short article, I bypassed many good alternatives and chose some favorite tools called the Sesame libraries. At some point, take the time to study the complete Sesame APIs, system configuration, and complete documentation. However, for the purposes of this article, all you need is the downloadable source code, which is a wrapper API for Sesame that includes Sesame and all the libraries that you will need to work through the examples. Specifically, the source code contains:
- One large JAR file with everything you need for both the Java and JRuby program examples
- Raw text files from a few Reuters news articles
- The RDF data files generated by the utility ruby_utilities/raw_data_to_rdf.rb (I wrote the utility raw_data_to_rdf.rb to extract semantic information from the sample news articles and write RDF triples to a data file used in the example programs.)
- An example of RDF in the more readable N3 format
- Some JRuby example programs
- Some RDF data for experimentation
|Author's Note: The program examples are dual licensed. You can use the downloadable source code under either the LGPL or Apache 2 licenses. Sesame itself and the libraries that it requires are licensed under BSD-style and LGPL licenses.|
The example uses two data types for object values: URIs and string literals. RDF originally was expressed as XML data files and while the XML format is still widely used for automated processing, the example uses two alternative formats, N-Triples and Notation3 (N3), because they are much easier to read and understand. Sesame can be used to convert between all RDF formats, so why not use more readable and understandable formats?
RDF data consists of a set of triple values:
In the context of this article, a triple might look like this:
- subject: A URL (or URI) of a news article
- predicate: A relation like "containsCity"
- object: A value like "Burlington"
|Figure 1. Conceptual Overview of the News-Processing System: When the Ruby script ruby_utilities/raw_data_to_rdf.rb and the file rdf_files/news.nt are created, you can then use only Sesame with the wrapper API for Java and JRuby.|
Figure 1 shows a conceptual overview of the application example. It is conceptual because it does not include code for web scraping. Instead, it uses manually copied text from a few news articles (human names, place names, and key terms) for input to the entity extraction utility raw_data_to_rdf.rb. As Figure 1 shows, when the Ruby script ruby_utilities/raw_data_to_rdf.rb and the file rdf_files/news.nt are created, you can then use only Sesame with the wrapper API for Java and JRuby. This wrapper library can write N-Triple data to the more convenient N3 format. (Later, you also will see several N3 examples.)
To build a full production system based on the examples in this article, you will need to write Ruby scripts that web scrape a few select news web sites. These scripts are not difficult to write, but a general web scraper that ignores things like advertisements and navigation HTML is very difficult to writeand beyond the scope of this article.
In order to simplify this system and concentrate only on using RDF/RDFS, the assumption is that news articles exist in the directory raw_data in the Rails application directory and do not provide any web site-specific web scraping code. This directory contains the text of four Reuters news articles for testing. You can replace these files with data from other information sources (e.g., word-processing documents, PDF files, databases, etc.) The utility ruby_utilities/raw_data_to_rdf.rb reads the data in the directory raw_data, uses the Reuters OpenCalais web service to find entities in each article, and then writes RDF triple data to the file rdf_files/news.nt. The OpenCalais web services can be freely used (up to 20K web service calls a day); for my work I use both OpenCalais and my own system to extract information from text.