Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Extracting Meaning from Text with OpenCalais R3 : Page 2

Formally-structured text that is published can be summarized and combined with other text to provide new insights.

To use the public web service, post the URL-encoded license, content, and parameters to http://api.opencalais.com/enlighten/rest/. If successful, the response is an RDF/XML file. You can parse the file directly or import it into an RDF store. Sesame, a leading RDF framework, provides parsers and storage for RDF content. The following Java code, which you can find in the Crawler.java in the downloadable code, imports the results.

private Reader post(CharSequence text) throws IOException { StringBuilder sb = new StringBuilder(text.length() + 1024); sb.append("licenseID=").append(encode(licenseID)); sb.append("&content=").append(encode(text)); sb.append("&paramsXML=").append(encode(getParamsXML())); URLConnection connection = new URL(API_URL).openConnection(); connection.addRequestProperty("Content-Type", "application/x-www-form-urlencoded"); connection.addRequestProperty("Content-Length", valueOf(sb.length())); connection.setDoOutput(true); OutputStream out = connection.getOutputStream(); OutputStreamWriter writer = new OutputStreamWriter(out); writer.write(sb.toString()); writer.flush(); return new InputStreamReader(connection.getInputStream()); } private Repository createRepository() throws RepositoryException { File dataDir = new File("data"); Sail store = new NativeStore(dataDir); Repository repository = new SailRepository(store); repository.initialize(); return repository; } private void add(Reader reader) throws RepositoryException, IOException, RDFParseException { RepositoryConnection con = repository.getConnection(); try { con.add(reader, "", RDFFormat.RDFXML); } finally { con.close(); } }

Visualizing Relationships
After you import a collection of document metadata into an RDF store, you can synthesize it to derive new assets of information based on extracted data. Aduna's Cluster Map technolog can visualize the relationships between documents (through named entities) and between named entities (through facts and events).

Figure 2, a Document Cluster Map, shows the highlighted document from un.org, which contains references to the industry terms "greenhouse gas emissions," "food crisis," and "food security." Figure 3, a Named Entity Cluster Map, shows the named entity "George W. Bush" holds the position of President of the "United States." It also shows 107 countries and people have or hold the position of President. Using the Named Entity Cluster Map, the foreign minister of France is seen as Bernard Kouchner and the President as Nicolas Sarkozy. Although this information did not originate from the same document, by extracting the meaning and relationships of the named entities, you can create new information assets that combine the entity information.

Figure 2. Document Cluster Map: Shows the references to the document.
Figure 3. Named Entity Cluster Map: Shows the relationships of different entities.

The download archive includes a simplistic web crawler and two interactive visualization tools that you can use to explore these relationships. Executing the Main class with a list of URLs that you can import into the local RDF store opens two windows: Document, and Named Entity Cluster Map. The relationships appear in the side pane, while the selected relationships are shown graphically using Aduna's Cluster Map technology, which displays whether and how sets overlap (similar to Venn diagrams and Euler diagrams). In the command line, you can prefix each URL by '1' to indicate that embedded links should be followed once, or '0' to include only the explicit URL.


OpenCalais now makes it easy to extract meaningful structured information that would otherwise be out of reach from automated processes and aggregation tools. By embedding OpenCalais within other applications, such as a Cluster Map, new information assets can be created to expose information and link it back to relevant documents. With such tools available, distinct silos of previously unprocessed data can be combined in new ways to create derived data, niche content sets, related links, and headline summaries. For more information visit OpenCalais and Aduna's Cluster Map.

James Leigh is an independent software consultant based in Toronto, has experience modeling business problems and concepts in software, and specializes in performance and technology integration. James has a background in semantic web technologies and decentralized networks. He is an active member in the OpenRDF community, and he's a developer of Sesame and Elmo.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date