devxlogo

Extracting Meaning from Text with OpenCalais R3

Extracting Meaning from Text with OpenCalais R3

A big challenge companies face today is that most information, both online and archived, is only available as published text and does not contain any formal structure suitable for synthesizing. In a formal structure, information can be summarized, used to help locate meaningful text, and combined with other text to provide new insights. This article shows how to convert unstructured written text into structured data using OpenCalais, which is a public general-purpose text-extraction service that uses a combination of statistical and grammatical analysis to extract meaning. OpenCalais is not the only solution available for extracting meaning from text, but it is the only publicly available web service.

Information Extraction
The simplest way to categorize a document or paragraph is to use word associations. For example, if the words “earnings” and “acquired” are used in a document, it is likely a document about business finances. Furthermore, if the word “Reuters” is mostly used only in business finance documents, then other documents containing this word are likely to also be about business finances. This technique is called statistical analysis and is commonly used for document categorization. Statistical analysis is an OpenCalais technique to categorize documents and identify what the text is referring to.

 
Figure 1. Phrase Tree: A phrase tree identifies the the different parts of a sentence.

Statistical analysis alone, although useful for finding related documents, does not provide any new insights, as the information is still buried within documents. To uncover the meaning within written text, you must parse it into a more formal structure. This formal structure is commonly referred to as a “phrase tree.” A phrase tree is created using phrase structure rules, which is based on the grammar of the language to break a sentence (or phrase) into noun phrases and verb phrases, then into nouns, verbs, adjectives, and adverbs. These rules may be of the form S->NP VP, which means that a sentence is made up of a noun phrase followed by a verb phrase. The rules also use common verbs to identify structure or try to fit the sentence into a preconfigured structure such as (NP(ADJ N) VP(V) AP(ADV)). As an example, the sentence, “Michael W. studies linguistics at McGill University” contains three nouns and one verb. Figure 1 shows the phrase tree. With the nouns and verbs identified, further statistical analysis classifies the type of named entity.

The monster in the closet of statistical analysis is that multiple meanings for the same words cloud the analysis, causing misinterpretations that are difficult to correct. OpenCalais addresses this by combining statistical analysis with complex heuristic rules. These heuristic rules combine lexicon and pattern matching to influence or control the result. For example, you can identify the term “IBM” as a company, or the pattern “Oct 31st” as a date. Heuristic rules are used to disambiguate commonly used terms that are potentially confusing to analyze. For example, the sentence, “I deposited $100 in the bank” should not be associated with “The river deposited sediment along the bank” despite both sentences containing the word deposited and bank. Heuristic rules are also used to better identify similarly named entities. For example, if an acronym matches a company name in a document and is not otherwise ambiguous, then they are referring to the same named entity. (“Hewlett-Packard is the leading consumer notebook PC brand” and “HP had a market share of 35 per cent in lap top space last year” both refer to the same company.) These types of rules, although overly simplified, can help to better parse and categorize the sentences.

Heuristic rules in OpenCalais are further used not just to identify associations, but to extract meaning from the text as well. OpenCalais uses heuristic rules to identify facts and events to create new information derived from multiple documents. OpenCalais does this by identifying commonly used verbs to describe facts or events. The pattern “X was acquired by Y” indicates an acquisition event between the X and Y companies. However, these rules can also match more complicated expressions. For example, “EMI said in September it had opened formal talks to buy Warner Music” can also be recognized as a past acquisition in September. For acquisitions, OpenCalais recognizes variations, including: “announced,” “planned,” “cancelled,” “postponed,” and “rumored.” Each of these is triggered by a variety of English verbs and tenses.

Many other facts and events are extracted from the text; “‘This is not a victimless crime,’ said Jim Kendall, president of the Washington Association of Internet Service Providers” extracts both a quote and professional position information. “Mahathir was to be accompanied by his wife Siti Hasmah Mohamad Ali” extracts the relationship of wife between the two named entities. “Internet age bellwether Cisco Systems Inc. (CSCO) also released disappointing news. The company said third-quarter revenue would slide 30% to $4.69 billion from $6.7 billion in the second quarter” extracts lower revenues for a named entity in Q3.

Gathering Information
Thomson Reuters offers a public OpenCalais web service with a no-cost license; applications can connect and use the service free of charge to extract meaning from any text. The web service is geared towards general-purpose use, and works well for commonly understood documents. Thomson Reuters also offers subscription licenses, for customization to particular vocabularies. These web services allow any text to be uploaded via an HTTP POST and respond with an RDF/XML file that describes the document. The response contains the original document (called DocInfo) with a category (called DocCat), instance information of referenced named entities with relevance score, and events and facts that are found in the document. OpenCalais R3 brings improvements to named entity extraction and categorization. You can find detailed entity and event types on the Calais web site.

To use the public web service, post the URL-encoded license, content, and parameters to http://api.opencalais.com/enlighten/rest/. If successful, the response is an RDF/XML file. You can parse the file directly or import it into an RDF store. Sesame, a leading RDF framework, provides parsers and storage for RDF content. The following Java code, which you can find in the Crawler.java in the downloadable code, imports the results.

	private Reader post(CharSequence text) throws IOException {		StringBuilder sb = new StringBuilder(text.length() + 1024);		sb.append("licenseID=").append(encode(licenseID));		sb.append("&content=").append(encode(text));		sb.append("¶msXML=").append(encode(getParamsXML()));		URLConnection connection = new URL(API_URL).openConnection();		connection.addRequestProperty("Content-Type",				"application/x-www-form-urlencoded");		connection.addRequestProperty("Content-Length", valueOf(sb.length()));		connection.setDoOutput(true);		OutputStream out = connection.getOutputStream();		OutputStreamWriter writer = new OutputStreamWriter(out);		writer.write(sb.toString());		writer.flush();		return new InputStreamReader(connection.getInputStream());	}	private Repository createRepository() throws RepositoryException {		File dataDir = new File("data");		Sail store = new NativeStore(dataDir);		Repository repository = new SailRepository(store);		repository.initialize();		return repository;	}	private void add(Reader reader)			throws RepositoryException, IOException, RDFParseException {		RepositoryConnection con = repository.getConnection();		try {			con.add(reader, "", RDFFormat.RDFXML);		} finally {			con.close();		}	}

Visualizing Relationships
After you import a collection of document metadata into an RDF store, you can synthesize it to derive new assets of information based on extracted data. Aduna’s Cluster Map technolog can visualize the relationships between documents (through named entities) and between named entities (through facts and events).

Figure 2, a Document Cluster Map, shows the highlighted document from un.org, which contains references to the industry terms “greenhouse gas emissions,” “food crisis,” and “food security.” Figure 3, a Named Entity Cluster Map, shows the named entity “George W. Bush” holds the position of President of the “United States.” It also shows 107 countries and people have or hold the position of President. Using the Named Entity Cluster Map, the foreign minister of France is seen as Bernard Kouchner and the President as Nicolas Sarkozy. Although this information did not originate from the same document, by extracting the meaning and relationships of the named entities, you can create new information assets that combine the entity information.


Figure 2. Document Cluster Map: Shows the references to the document.
 
Figure 3. Named Entity Cluster Map: Shows the relationships of different entities.

The download archive includes a simplistic web crawler and two interactive visualization tools that you can use to explore these relationships. Executing the Main class with a list of URLs that you can import into the local RDF store opens two windows: Document, and Named Entity Cluster Map. The relationships appear in the side pane, while the selected relationships are shown graphically using Aduna’s Cluster Map technology, which displays whether and how sets overlap (similar to Venn diagrams and Euler diagrams). In the command line, you can prefix each URL by ‘1’ to indicate that embedded links should be followed once, or ‘0’ to include only the explicit URL.

Conclusion

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist