Automate Metadata Extraction for Corporate Search and Mashups

here are some exciting developments in automated metadata extraction and its implication for better semantic search and corporate mashups. Advanced open source tools created by linguists to recognize the meaning of words in documents are now becoming an order of magnitude more cost effective to use. The arrival of the Apache Unstructured Information Management Architecture (UIMA—pronounced “you-ee-ma”) framework makes these tools accessible by non-programmers. The addition of semantically precise metadata to documents opens the door for new semantic web applications; including better document search and document mashups.

Web Search Drives Expectations

Many people wonder at the power and precision of web search engines like Google. But if you use the Google search engine to find a Microsoft Word document that is inside a corporate web site, you might have less than stellar results. There is a simple reason for this: most internal documents don’t have the rich web linking that public web sites have. Search engines such as Google use the number of links that point to a document to help rank the search results. Without those links, the documents are not likely to be found.

The Challenges with Keyword Search

Most internal corporate search engines use “keyword search” technology. It involves finding all the important words in a document and putting them in a central index. When you search a set of keywords it finds all documents that have those exact words.

The problem is that when people use keywords, they frequently don’t use the exact same words that are in the document. For example, you might search for “Sue Smith” but the document was written by “Susan Smith.” Because “Sue” is not an exact match with “Susan” the keyword search fails.

Enter Semantic Search

How do you program your search engine to understand that “Sue” and “Susan” are semantically close? There are three key steps to this problem. First, you need to look at the structure of a sentence and infer the part-of-speech (POS) that each word plays. This is called syntactic analysis. When you read the sentence:

Our client is going to sue your company.

You can infer from the syntax of the sentence that the word “sue” is being used as a verb. You would not infer it is a person’s name. On the other hand, if you read the following sentence, it is obvious that Sue Smith is a person’s name:

This proposal was written by Sue Smith for the Johnson Corporation.

The process of identifying the part-of-speech for text is called POS tagging. The output of a typical POS tagger for these sentences is shown below.

   Our client is going to sue your company.    Our   client   is   going   to   sue   your   company   .

Note that the word “sue” is identified as a verb (POS=”vb“). (Here is a full list of parts-of-speech codes.)

Now lets see how the second sentence comes through the POS tagger.

   This proposal was written by Sue Smith for the Johnson Corporation.    This   proposal   was   written   by   Sue   Smith   for   the   Johnson   Corporation   .

Note that the word “Sue” is identified as POS=”np” for “proper noun or part of name phrase.” When you find two adjacent NPs in a text, you can then use an algorithm to look these NPs up in a table. Then, you can use the company directory to find that “Sue” is a nickname for “Susan” and that “Sue Smith” is probably “Susan Smith” or EmpID=”1234747.” You can then mark this as a person entity.

   This proposal was written by Sue Smith for the Johnson Corporation.    This   proposal   was   written   by         Sue      Smith      for   the   Johnson   Corporation   .

You can repeat this filter process to find all organizations such as the “Johnson Corporation,” so the result has the following additional markup:

      Johnson      Corporation

With each step, you are using information from a prior analysis to make the next step of analysis easier. Such incremental enrichment of data streams is at the heart of almost all text-mining processes today. So you might be wondering how complex is it to identify all the parts.

The annotation results may be difficult to read, but by storing this additional data in a new class of databases (called native XML databases), you can quickly search millions of documents in milliseconds.

Faceted Search

 
Figure 1. Faceted Search: Here is an example of faceted search from The Office of the Historian.

When you search for a document, and you get a document “hit,” that document is sometimes the exact one you were searching for (think of Google’s “I’m feeling lucky” feature). However, more often than not, the hit is not the document you were seeking. One nice feature of documents with enriched metadata is the ability to provide other search suggestions based on various facets of the document. This is called faceted search. For example, the document viewer might show a series of links that suggests “Other documents authored by Sue Smith,” or “Other documents about the Johnson Corporation,” or even the most general, “Find similar documents.”

One example of faceted search is taken from the U.S. State Department History web site (see Figure 1). When you view a document, all the people and terms in that document are listed in the right-hand margin. If you click any of these items, you’ll see the official description of the person or the definition of the term.

Adding Document Classification Annotations

Document similarity or “semantic nearness” is a very complex concept. Classification requires you to compare all the entities and words in each document with all the other documents in a collection, and then come up with a mathematical weight to sort all the other documents in a collection that is based on the content of a document. The weights might change every time a single document is added or removed. However, mathematical models based on standard algorithms can perform this calculation for you. One simple approach is to use the same annotation methods to classify a document into one or more mathematical groupings.

Can You Afford Semantic Search?

Now that you have read about a few of the concepts behind entity extraction, you might be asking “So what is this going to cost?” This is a very reasonable question, because many of these technologies have been out of reach for all but the most well-funded organizations. As it turns out, many researchers at universities and large corporations have been doing entity extraction for years, and much of their work is available for free or at very low cost. But to use their algorithms you had to hire a half-dozen people with experience in search, information retrieval, Java, perl, python, and linguistics. Two years ago this all started to change.

The Apache Foundation decided that unstructured analysis was a critical component for many emergent semantic web applications. With assistance from IBM, the Apache Foundation adopted an innovative Unstructured Information Management Architecture. UIMA is now spurring the growth of a new low-cost high-quality component for creating interchangeable components. UIMA tools have already started to lower the cost of high-quality entity extraction. (Find out the details of Apache UIMA. Be aware that the Apache version of UIMA is still in the “incubator” stage, and although completely functional, it may require turning on non-UNIX systems (see Figure 2).

Pipe and Enrichment Patterns

UIMA is based around a pipeline approach to performing unstructured analysis in a universal pattern. These pipelines are very similar to the concept of UNIX pipes: small modular tools that read data in from one representation, enrich the data, and then send the output to other tools. This is known as the enrichment pattern and is well documented as a named pattern in the book, “Enterprise Integration Patterns” by Gregor Hohpe et. al (see Figure 3).


Figure 2. Apache UIMA: The Unstructured Information Management Architecture logo.
 
Figure 3. Integration Patterns: The Content Enrichment Pattern from the book, “Enterprise Integration Patterns” is shown here.

Computational linguists found that if everyone used common computer language-independent in-memory standards for storing document annotations they could achieve both high-performance and precision. They called this standard structure a “Common Annotation Structure” or CAS for short. It turns out that by combining the in-memory CAS, with a robust set of tools, computational linguists could all share a common platform of tools. And because with CAS the documents stayed in the same memory location during analysis, the amount of input and output data transfers were cut down significantly.

UIMA Architectural Overview

As you can see from Figure 4, Apache UIMA is a complete architecture for analyzing a wide variety of data. It includes a large library of core tools as well as a large and growing library of applications in the UIMA sandbox.

At the core of UIMA is a set of supporting tools that take a large collection of documents and processes them using a sequential set of operations. A typical set of these operations is shown in Figure 5.


Figure 4. Architecture: The overall UIMA structure.
 
Figure 5. Text Mining: A typical UIMA text mining Pipeline.

One point to keep in mind is that unlike water in a physical pipe, in a UIMA system the documents don’t physically move. They remain in a central area of computer RAM and a set of annotators simply adds new annotations to the RAM. Each subsequent component leverages the annotations left by prior components.

In the long term, UIMA’s design supports a potentially large library of pipes that all fit together. Developers should be able to configure and customize these pipes to solve a wide variety of problems, with very little programming. A typical project requires the modification of only a small set of XML configuration files. These configuration files have Eclipse forms front ends, so that the average person can setup and install a UIMA pipeline without ever knowing XML syntax or learning how to use an XML editor. But as of today, the number of components is somewhat small, and you might need to create some customized components to meet your organization’s needs. In the next section you’ll see how to use the Eclipse IDE to create customized components.

Eclipse/Maven Integration

 
Figure 6. UIMA Components:The Eclipse screen image with UIMA components.

UIMA is written with the latest and greatest open source standards. Figure 6 shows a screen image of a typical Eclipse development environment configured to support UIMA.

On the left side of the screen you see the standard Eclipse Project Explorer. In the middle, you see an Eclipse form for managing the UIMA component configuration file. On the right, you see a view of the Maven dependency graph. The next section discusses details of several of these screens. But first, the following discusses the lifecycle of a typical UIMA project.

To create a new UIMA component, first create a new UIMA project using a Maven archetype, which contain the correct sets of files and folders. After you create your project, you can start to add UIMA components. Figure 7 shows a sample of the possible UIMA components.

You will note from Figure 7 that UIMA has default configuration “Descriptor Files” for many UIMA components. This includes components for reading in a large number of documents (known as a Collection Reader), and creating an Analysis Engine that sends the output to another component (called a Consumer).

After you create a Descriptor File you are presented with one or more Eclipse forms that allow you to setup and change the descriptor (see Figure 8).

On the bottom of Figure 8 you can see several tabs. Each tab has a group of logically related information that you can change. The first tab (Overview) describes that this is a Java primitive component, and shows other high-level properties of the analyzer, including name, version, vendor, and description.


Figure 7. UIMA Screen Images:The Eclipse UIMA new items are shown here.
 
Figure 8. UIMA Screen Images: The Eclipse UIMA Descriptor form.

Annotation Probability

You need to remember that each word in free-form text might have one or more parts of speech. The word “still” might be used as a noun (the still made whisky), an adverb (be still my friend), or an adjective (the still lake was like glass).

 
Figure 9. Entity Extraction: An example of the entity extraction using OpenCalais and RDF.

In the realm of entity extraction, it is much easier to identify the correct part of speech than others.

Take the following sentence for example:

If George Washington were alive today he could get on a plane in the morning and fly from Seattle, Washington to Washington D.C. and deliver a lecture at George Washington University in the evening.

In this sentence, the word “Washington” is used to identify four distinct entities: a person, two cities, and an organization. If you are given groups of nouns, you can quickly look them up in dictionaries and find out their type. The free Thomson Reuters web service Open Calais (see Figure 9) provides an example.

Note that the entities are automatically identified in the body of the document and displayed in color-coded values on the left side. The analysis correctly shows two cities (green), one organization (pink), and one person (purple) correctly identified in the body of the text. After these entities are identified, you can create searches that find documents that are related to Seattle, Washington, but exclude Washington D.C. documents. This feature is very difficult to do with keyword-only search.

A UIMA component already exists to send your document text to Open Calais (a free web service) for annotation. See the Open Calais web site for service license details.

Extending Annotations to Images

The general concept of annotations is not restricted to only text; the use of annotations also applies to sound, images, and video. One example of this is the tagging of photos. UIMA components can be designed to recognize faces in a picture. A square region around the face can annotate each face. After the faces have been detected, another pattern matching process can suggest possible “face matches” to each region to a known library of faces.

Storing Annotations and Metadata

Storing annotations can be a problem: if you have binary files such as images you may have to create a separate file or database record for each document to store the metadata for that document. Search tools then search the metadata records to help you find the documents you are looking for.

The author’s personal preference is to store as much of the documents and document metadata in some type of XML store as possible. Databases such as the MarkLogic native XML database or the open source eXist database are now widely used in enterprise-class solutions. These systems keep the documents in their original markup format and yet are designed to automatically create fast indexed search.

XML: Many Standards for Industry-Specific Tasks

If you are lucky enough to have pure XML documents as an input and an output format your job can be very easy. XML document encoding standards such as DITA, TEI, and DocBook already have well documented standards on how key entities such as people, places, terms, and dates should be encoded. If you work in specific areas such as the management of historical documents, your colleagues might already be using TEI documents, and have shared tools ready for you to use.

With these XML standards the annotations can easily be added without disrupting the use of the documents by other systems. By storing these documents in a native XML database, or an RDBMS with an XML data type, a very simple XQuery or XSLT can be used to report on these entities. For example, the XPath expression //persName finds all the named people in a TEI file. Because native XML databases use indexes, you may find that extremely high-performance libraries for processing these standard formats already exist. Your work may be limited to drag-and-drop or copy operations to WEBDAV folders.

RDF: With Complexity Comes Worldwide Semantics

You have several options if you are using RDF to encode your annotations. You can use tags such as the “Friend-of-a-Friend” (FOAF) to describe your annotations; or you can use microformats to annotate your HTML tags. One of the challenges of using RDF is that RDF tags can quickly become bulky and difficult for ordinary XML tools to report on and index. Luckily, RDF has its own query language called SPARQL that makes it easy to query not only documents in your own web site, but also to query other web sites that store RDF. An excellent example of this is DBPedia.org. DBPedia scrapes RDF assertions from Wikipedia and other sources on a regular basis.

RDF also presents challenges for the displaying extracted entities. A new client-side JavaScript tool called RDFQuery (based on JQuery) is being written to make this process easier. The primary author (Jeni Tennison) has posted this code on GoogleCode. RDF, and the use of microformat analyzers, such as the Operator add-on for Firefox makes it easier for anyone to “repurpose” your documents in ways that you might not anticipate.

Leveraging Annotations

The last part of the process is to understand how to leverage these annotations to create true value to your organization. This business value goes far beyond helping your users find the right documents. Adding document metadata gives you new leverage to repurpose documents.

The Future: Corporate Document Mashups

Although this article has discussed the process of automated entity extraction in the context of increasing the precision of corporate search, you gain the potential to have much more than that. Now, you have a robust architecture for using low-cost libraries of interoperable tools to perform highly specific analysis on components. In the future, these will be tools that automatically suggest document taxonomies for classifying documents or tools that perform statistical database profiling to suggest data element mappings to your data warehouse. UIMA is starting to open the door to automated metadata extraction for many types of entities in your organization and not just documents.

New generations of tools and skills allow software developers to quickly create new application mashups that would formally have been extremely expensive.

One excellent example of this is the XQuery Wikibook, which is used to create innovative new mashups of data that in the past would have taken weeks worth of coding. For example, what if you would like to see a timeline view of the albums of your favorite rock band? What if you also wanted the cover art of the albums on the timeline view?

With RDF and DBPedia you can write this application with just a few lines of XQuery and SPARQL. Here’s an excellent example.

Now imagine creating a timeline view of a project just from a Microsoft Word document. Timelines can be used to help your viewers quickly get a feel for the time ranges discussed in any very long document. After you have data annotations in documents you will find there are many new ways to mashup your documents that were not possible in the past.

Lower Barriers to Getting on the Semantic Bus

Automated metadata discovery and metadata extraction technologies lower the barrier to getting on the (virtual) Semantic Bus. A Semantic Bus is a place where you interchange information between systems that have precise meaning. The concept of the Semantic Bus is not entirely new. People have been discussing Enterprise Service Bus (ESB) for a long time. But the Semantic Bus is also similar to another bus: The Magic School Bus on PBS. Just like the Magic School Bus, the Semantic Bus takes you to new places that are limited more by your creativity than your IT budget. Have a great ride!

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Related Posts

DevX is the leading provider of technical information, tools, and services for professionals developing corporate applications.

Join Our Newsletter

Subscribe to receive our latest blog posts directly in your inbox!

© All Rights Reserved.