A big challenge companies face today is that most information, both online and archived, is only available as published text and does not contain any formal structure suitable for synthesizing. In a formal structure, information can be summarized, used to help locate meaningful text, and combined with other text to provide new insights. This article shows how to convert unstructured written text into structured data using OpenCalais, which is a public general-purpose text-extraction service that uses a combination of statistical and grammatical analysis to extract meaning. OpenCalais is not the only solution available for extracting meaning from text, but it is the only publicly available web service.
The simplest way to categorize a document or paragraph is to use word associations. For example, if the words "earnings" and "acquired" are used in a document, it is likely a document about business finances. Furthermore, if the word "Reuters" is mostly used only in business finance documents, then other documents containing this word are likely to also be about business finances. This technique is called statistical analysis and is commonly used for document categorization. Statistical analysis is an OpenCalais technique to categorize documents and identify what the text is referring to.
Figure 1. Phrase Tree: A phrase tree identifies the the different parts of a sentence.|
Statistical analysis alone, although useful for finding related
documents, does not provide any new insights,
as the information is still buried within documents. To uncover the meaning within written text, you must parse it
into a more formal structure. This formal structure is commonly referred to as a "phrase tree." A phrase tree is
created using phrase structure rules, which is based on the grammar of the language to break a sentence (or phrase)
into noun phrases and verb phrases, then into nouns, verbs, adjectives, and adverbs. These rules may be of the form
, which means that a sentence is made up of a noun phrase followed by a verb phrase. The rules also use common
verbs to identify structure or try to fit the sentence into a preconfigured structure such as
(NP(ADJ N) VP(V) AP(ADV
)). As an example, the sentence, "Michael W. studies linguistics at McGill University" contains three nouns and one verb.
shows the phrase tree. With the nouns and verbs identified, further statistical analysis classifies the type of named entity.
The monster in the closet of statistical analysis is that multiple meanings for the same words cloud the analysis,
causing misinterpretations that are difficult to correct. OpenCalais addresses this by combining statistical analysis
with complex heuristic rules. These heuristic rules combine lexicon and pattern matching to influence or control the
result. For example, you can identify the term "IBM" as a company, or the pattern "Oct 31st" as a date. Heuristic
rules are used to disambiguate commonly used terms that are potentially confusing to analyze. For example, the
sentence, "I deposited $100 in the bank" should not be associated with "The river deposited sediment along the bank"
despite both sentences containing the word deposited and bank. Heuristic rules are also used to better identify similarly named entities. For example, if an acronym matches a company name in a document and is not otherwise ambiguous, then they are referring to the same named entity. ("Hewlett-Packard is the leading consumer notebook PC brand" and "HP had a market share of 35 per cent in lap top space last year" both refer to the same company.) These types of rules, although overly simplified, can help to better parse and categorize the sentences.
Heuristic rules in OpenCalais are further used not just to identify associations, but to extract meaning from the text as well. OpenCalais uses heuristic rules to identify facts and events to create new information derived from multiple documents. OpenCalais does this by identifying commonly used verbs to describe facts or events. The pattern "X was acquired by Y" indicates an acquisition event between the X and Y companies. However, these rules can also match more complicated expressions. For example, "EMI said in September it had opened formal talks to buy Warner Music" can also be recognized as a past acquisition in September. For acquisitions, OpenCalais recognizes variations, including: "announced," "planned," "cancelled," "postponed," and "rumored." Each of these is triggered by a variety of English verbs and tenses.
Many other facts and events are extracted from the text; "'This is not a victimless crime,' said Jim Kendall, president of the Washington Association of Internet Service Providers" extracts both a quote and professional position information. "Mahathir was to be accompanied by his wife Siti Hasmah Mohamad Ali" extracts the relationship of wife between the two named entities. "Internet age bellwether Cisco Systems Inc. (CSCO) also released disappointing news. The company said third-quarter revenue would slide 30% to $4.69 billion from $6.7 billion in the second quarter" extracts lower revenues for a named entity in Q3.
Thomson Reuters offers a public OpenCalais web service with a no-cost license; applications can connect
and use the service free of charge to extract meaning from any text. The web service is geared towards general-purpose use, and works well for commonly understood documents. Thomson Reuters also offers subscription licenses, for
customization to particular vocabularies. These web services allow any text to be uploaded via an HTTP POST and
respond with an RDF/XML file that describes the document. The response contains the original document
(called DocInfo) with a category (called DocCat), instance information of referenced named entities with relevance
score, and events and facts that are found in the document. OpenCalais R3 brings improvements to named entity
extraction and categorization. You can find detailed entity and event types on the
Calais web site.