Extracting Meaning from Text with OpenCalais R3

Extracting Meaning from Text with OpenCalais R3

A big challenge companies face today is that most information, both online and archived, is only available as published text and does not contain any formal structure suitable for synthesizing. In a formal structure, information can be summarized, used to help locate meaningful text, and combined with other text to provide new insights. This article shows how to convert unstructured written text into structured data using OpenCalais, which is a public general-purpose text-extraction service that uses a combination of statistical and grammatical analysis to extract meaning. OpenCalais is not the only solution available for extracting meaning from text, but it is the only publicly available web service.

Information Extraction
The simplest way to categorize a document or paragraph is to use word associations. For example, if the words “earnings” and “acquired” are used in a document, it is likely a document about business finances. Furthermore, if the word “Reuters” is mostly used only in business finance documents, then other documents containing this word are likely to also be about business finances. This technique is called statistical analysis and is commonly used for document categorization. Statistical analysis is an OpenCalais technique to categorize documents and identify what the text is referring to.

 
Figure 1. Phrase Tree: A phrase tree identifies the the different parts of a sentence.

Statistical analysis alone, although useful for finding related documents, does not provide any new insights, as the information is still buried within documents. To uncover the meaning within written text, you must parse it into a more formal structure. This formal structure is commonly referred to as a “phrase tree.” A phrase tree is created using phrase structure rules, which is based on the grammar of the language to break a sentence (or phrase) into noun phrases and verb phrases, then into nouns, verbs, adjectives, and adverbs. These rules may be of the form S->NP VP, which means that a sentence is made up of a noun phrase followed by a verb phrase. The rules also use common verbs to identify structure or try to fit the sentence into a preconfigured structure such as (NP(ADJ N) VP(V) AP(ADV)). As an example, the sentence, “Michael W. studies linguistics at McGill University” contains three nouns and one verb. Figure 1 shows the phrase tree. With the nouns and verbs identified, further statistical analysis classifies the type of named entity.

The monster in the closet of statistical analysis is that multiple meanings for the same words cloud the analysis, causing misinterpretations that are difficult to correct. OpenCalais addresses this by combining statistical analysis with complex heuristic rules. These heuristic rules combine lexicon and pattern matching to influence or control the result. For example, you can identify the term “IBM” as a company, or the pattern “Oct 31st” as a date. Heuristic rules are used to disambiguate commonly used terms that are potentially confusing to analyze. For example, the sentence, “I deposited $100 in the bank” should not be associated with “The river deposited sediment along the bank” despite both sentences containing the word deposited and bank. Heuristic rules are also used to better identify similarly named entities. For example, if an acronym matches a company name in a document and is not otherwise ambiguous, then they are referring to the same named entity. (“Hewlett-Packard is the leading consumer notebook PC brand” and “HP had a market share of 35 per cent in lap top space last year” both refer to the same company.) These types of rules, although overly simplified, can help to better parse and categorize the sentences.

Heuristic rules in OpenCalais are further used not just to identify associations, but to extract meaning from the text as well. OpenCalais uses heuristic rules to identify facts and events to create new information derived from multiple documents. OpenCalais does this by identifying commonly used verbs to describe facts or events. The pattern “X was acquired by Y” indicates an acquisition event between the X and Y companies. However, these rules can also match more complicated expressions. For example, “EMI said in September it had opened formal talks to buy Warner Music” can also be recognized as a past acquisition in September. For acquisitions, OpenCalais recognizes variations, including: “announced,” “planned,” “cancelled,” “postponed,” and “rumored.” Each of these is triggered by a variety of English verbs and tenses.

Many other facts and events are extracted from the text; “‘This is not a victimless crime,’ said Jim Kendall, president of the Washington Association of Internet Service Providers” extracts both a quote and professional position information. “Mahathir was to be accompanied by his wife Siti Hasmah Mohamad Ali” extracts the relationship of wife between the two named entities. “Internet age bellwether Cisco Systems Inc. (CSCO) also released disappointing news. The company said third-quarter revenue would slide 30% to $4.69 billion from $6.7 billion in the second quarter” extracts lower revenues for a named entity in Q3.

Gathering Information
Thomson Reuters offers a public OpenCalais web service with a no-cost license; applications can connect and use the service free of charge to extract meaning from any text. The web service is geared towards general-purpose use, and works well for commonly understood documents. Thomson Reuters also offers subscription licenses, for customization to particular vocabularies. These web services allow any text to be uploaded via an HTTP POST and respond with an RDF/XML file that describes the document. The response contains the original document (called DocInfo) with a category (called DocCat), instance information of referenced named entities with relevance score, and events and facts that are found in the document. OpenCalais R3 brings improvements to named entity extraction and categorization. You can find detailed entity and event types on the Calais web site.

To use the public web service, post the URL-encoded license, content, and parameters to http://api.opencalais.com/enlighten/rest/. If successful, the response is an RDF/XML file. You can parse the file directly or import it into an RDF store. Sesame, a leading RDF framework, provides parsers and storage for RDF content. The following Java code, which you can find in the Crawler.java in the downloadable code, imports the results.

	private Reader post(CharSequence text) throws IOException {		StringBuilder sb = new StringBuilder(text.length() + 1024);		sb.append("licenseID=").append(encode(licenseID));		sb.append("&content=").append(encode(text));		sb.append("¶msXML=").append(encode(getParamsXML()));		URLConnection connection = new URL(API_URL).openConnection();		connection.addRequestProperty("Content-Type",				"application/x-www-form-urlencoded");		connection.addRequestProperty("Content-Length", valueOf(sb.length()));		connection.setDoOutput(true);		OutputStream out = connection.getOutputStream();		OutputStreamWriter writer = new OutputStreamWriter(out);		writer.write(sb.toString());		writer.flush();		return new InputStreamReader(connection.getInputStream());	}	private Repository createRepository() throws RepositoryException {		File dataDir = new File("data");		Sail store = new NativeStore(dataDir);		Repository repository = new SailRepository(store);		repository.initialize();		return repository;	}	private void add(Reader reader)			throws RepositoryException, IOException, RDFParseException {		RepositoryConnection con = repository.getConnection();		try {			con.add(reader, "", RDFFormat.RDFXML);		} finally {			con.close();		}	}

Visualizing Relationships
After you import a collection of document metadata into an RDF store, you can synthesize it to derive new assets of information based on extracted data. Aduna’s Cluster Map technolog can visualize the relationships between documents (through named entities) and between named entities (through facts and events).

Figure 2, a Document Cluster Map, shows the highlighted document from un.org, which contains references to the industry terms “greenhouse gas emissions,” “food crisis,” and “food security.” Figure 3, a Named Entity Cluster Map, shows the named entity “George W. Bush” holds the position of President of the “United States.” It also shows 107 countries and people have or hold the position of President. Using the Named Entity Cluster Map, the foreign minister of France is seen as Bernard Kouchner and the President as Nicolas Sarkozy. Although this information did not originate from the same document, by extracting the meaning and relationships of the named entities, you can create new information assets that combine the entity information.


Figure 2. Document Cluster Map: Shows the references to the document.
 
Figure 3. Named Entity Cluster Map: Shows the relationships of different entities.

The download archive includes a simplistic web crawler and two interactive visualization tools that you can use to explore these relationships. Executing the Main class with a list of URLs that you can import into the local RDF store opens two windows: Document, and Named Entity Cluster Map. The relationships appear in the side pane, while the selected relationships are shown graphically using Aduna’s Cluster Map technology, which displays whether and how sets overlap (similar to Venn diagrams and Euler diagrams). In the command line, you can prefix each URL by ‘1’ to indicate that embedded links should be followed once, or ‘0’ to include only the explicit URL.

Conclusion

devx-admin

devx-admin

Share the Post:
Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

Copilot Revolution

Microsoft Copilot: A Suit of AI Features

Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

AI Girlfriend Craze

AI Girlfriend Craze Threatens Relationships

The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

AIOps Innovations

Senser is Changing AIOps

Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions

Bebop Charging Stations

Check Out The New Bebob Battery Charging Stations

Bebob has introduced new 4- and 8-channel battery charging stations primarily aimed at rental companies, providing a convenient solution for clients with a large quantity of batteries. These wall-mountable and

Malyasian Networks

Malaysia’s Dual 5G Network Growth

On Wednesday, Malaysia’s Prime Minister Anwar Ibrahim announced the country’s plan to implement a dual 5G network strategy. This move is designed to achieve a more equitable incorporation of both

Advanced Drones Race

Pentagon’s Bold Race for Advanced Drones

The Pentagon has recently unveiled its ambitious strategy to acquire thousands of sophisticated drones within the next two years. This decision comes in response to Russia’s rapid utilization of airborne

Important Updates

You Need to See the New Microsoft Updates

Microsoft has recently announced a series of new features and updates across their applications, including Outlook, Microsoft Teams, and SharePoint. These new developments are centered around improving user experience, streamlining