Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Automate Metadata Extraction for Corporate Search and Mashups

Learn how to extract document semantics with Apache UIMA.


advertisement
here are some exciting developments in automated metadata extraction and its implication for better semantic search and corporate mashups. Advanced open source tools created by linguists to recognize the meaning of words in documents are now becoming an order of magnitude more cost effective to use. The arrival of the Apache Unstructured Information Management Architecture (UIMA—pronounced "you-ee-ma") framework makes these tools accessible by non-programmers. The addition of semantically precise metadata to documents opens the door for new semantic web applications; including better document search and document mashups.

Web Search Drives Expectations

Many people wonder at the power and precision of web search engines like Google. But if you use the Google search engine to find a Microsoft Word document that is inside a corporate web site, you might have less than stellar results. There is a simple reason for this: most internal documents don't have the rich web linking that public web sites have. Search engines such as Google use the number of links that point to a document to help rank the search results. Without those links, the documents are not likely to be found.

The Challenges with Keyword Search

Most internal corporate search engines use "keyword search" technology. It involves finding all the important words in a document and putting them in a central index. When you search a set of keywords it finds all documents that have those exact words.



The problem is that when people use keywords, they frequently don't use the exact same words that are in the document. For example, you might search for "Sue Smith" but the document was written by "Susan Smith." Because "Sue" is not an exact match with "Susan" the keyword search fails.

Enter Semantic Search

How do you program your search engine to understand that "Sue" and "Susan" are semantically close? There are three key steps to this problem. First, you need to look at the structure of a sentence and infer the part-of-speech (POS) that each word plays. This is called syntactic analysis. When you read the sentence:

Our client is going to sue your company.

You can infer from the syntax of the sentence that the word "sue" is being used as a verb. You would not infer it is a person's name. On the other hand, if you read the following sentence, it is obvious that Sue Smith is a person's name:

This proposal was written by Sue Smith for the Johnson Corporation.

The process of identifying the part-of-speech for text is called POS tagging. The output of a typical POS tagger for these sentences is shown below.

<AnnotationResult> <Sentence>Our client is going to sue your company. </Sentence> <token POS="pp$">Our</token> <token POS="nn">client</token> <token POS="bez">is</token> <token POS="vbg">going</token> <token POS="to">to</token> <token POS="vb">sue</token> <token POS="pp$">your</token> <token POS="nn">company</token> <token POS=".">.</token> </AnnotationResult>

Note that the word "sue" is identified as a verb (POS="vb"). (Here is a full list of parts-of-speech codes.)

Now lets see how the second sentence comes through the POS tagger.

<AnnotationResult> <Sentence>This proposal was written by Sue Smith for the Johnson Corporation. </Sentence> <token POS="dt">This</token> <token POS="nn">proposal</token> <token POS="bedz">was</token> <token POS="vbn">written</token> <token POS="in">by</token> <token POS="np">Sue</token> <token POS="np">Smith</token> <token POS="in">for</token> <token POS="at">the</token> <token POS="np">Johnson</token> <token POS="nn">Corporation</token> <token POS=".">.</token> </AnnotationResult>

Note that the word "Sue" is identified as POS="np" for "proper noun or part of name phrase." When you find two adjacent NPs in a text, you can then use an algorithm to look these NPs up in a table. Then, you can use the company directory to find that "Sue" is a nickname for "Susan" and that "Sue Smith" is probably "Susan Smith" or EmpID="1234747." You can then mark this as a person entity.

<AnnotationResult> <Sentence>This proposal was written by Sue Smith for the Johnson Corporation. </Sentence> <token POS="dt">This</token> <token POS="nn">proposal</token> <token POS="bedz">was</token> <token POS="vbn">written</token> <token POS="in">by</token> <person EmpID="1234747"> <token POS="np">Sue</token> <token POS="np">Smith</token> </person> <token POS="in">for</token> <token POS="at">the</token> <token POS="np">Johnson</token> <token POS="nn">Corporation</token> <token POS=".">.</token> </AnnotationResult>

You can repeat this filter process to find all organizations such as the "Johnson Corporation," so the result has the following additional markup:

<org CompanyID="347474"> <token POS="np">Johnson</token> <token POS="nn">Corporation</token> </org>

With each step, you are using information from a prior analysis to make the next step of analysis easier. Such incremental enrichment of data streams is at the heart of almost all text-mining processes today. So you might be wondering how complex is it to identify all the parts.

The annotation results may be difficult to read, but by storing this additional data in a new class of databases (called native XML databases), you can quickly search millions of documents in milliseconds.

Faceted Search

 
Figure 1. Faceted Search: Here is an example of faceted search from The Office of the Historian.
When you search for a document, and you get a document "hit," that document is sometimes the exact one you were searching for (think of Google’s "I'm feeling lucky" feature). However, more often than not, the hit is not the document you were seeking. One nice feature of documents with enriched metadata is the ability to provide other search suggestions based on various facets of the document. This is called faceted search. For example, the document viewer might show a series of links that suggests "Other documents authored by Sue Smith," or "Other documents about the Johnson Corporation," or even the most general, "Find similar documents."

One example of faceted search is taken from the U.S. State Department History web site (see Figure 1). When you view a document, all the people and terms in that document are listed in the right-hand margin. If you click any of these items, you’ll see the official description of the person or the definition of the term.

Adding Document Classification Annotations

Document similarity or "semantic nearness" is a very complex concept. Classification requires you to compare all the entities and words in each document with all the other documents in a collection, and then come up with a mathematical weight to sort all the other documents in a collection that is based on the content of a document. The weights might change every time a single document is added or removed. However, mathematical models based on standard algorithms can perform this calculation for you. One simple approach is to use the same annotation methods to classify a document into one or more mathematical groupings.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap