advertisement
Login | Register   
  Include Code  Search Tips
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Partners & Affiliates
advertisement
advertisement
advertisement
advertisement
Average Rating: 2.6/5 | Rate this item | 9 users have rated this item.
Automate Metadata Extraction for Corporate Search and Mashups (cont'd)

Extending Annotations to Images

The general concept of annotations is not restricted to only text; the use of annotations also applies to sound, images, and video. One example of this is the tagging of photos. UIMA components can be designed to recognize faces in a picture. A square region around the face can annotate each face. After the faces have been detected, another pattern matching process can suggest possible "face matches" to each region to a known library of faces.
advertisement

Storing Annotations and Metadata

Storing annotations can be a problem: if you have binary files such as images you may have to create a separate file or database record for each document to store the metadata for that document. Search tools then search the metadata records to help you find the documents you are looking for.

The author’s personal preference is to store as much of the documents and document metadata in some type of XML store as possible. Databases such as the MarkLogic native XML database or the open source eXist database are now widely used in enterprise-class solutions. These systems keep the documents in their original markup format and yet are designed to automatically create fast indexed search.

XML: Many Standards for Industry-Specific Tasks

If you are lucky enough to have pure XML documents as an input and an output format your job can be very easy. XML document encoding standards such as DITA, TEI, and DocBook already have well documented standards on how key entities such as people, places, terms, and dates should be encoded. If you work in specific areas such as the management of historical documents, your colleagues might already be using TEI documents, and have shared tools ready for you to use.

With these XML standards the annotations can easily be added without disrupting the use of the documents by other systems. By storing these documents in a native XML database, or an RDBMS with an XML data type, a very simple XQuery or XSLT can be used to report on these entities. For example, the XPath expression //persName finds all the named people in a TEI file. Because native XML databases use indexes, you may find that extremely high-performance libraries for processing these standard formats already exist. Your work may be limited to drag-and-drop or copy operations to WEBDAV folders.

RDF: With Complexity Comes Worldwide Semantics

You have several options if you are using RDF to encode your annotations. You can use tags such as the "Friend-of-a-Friend" (FOAF) to describe your annotations; or you can use microformats to annotate your HTML tags. One of the challenges of using RDF is that RDF tags can quickly become bulky and difficult for ordinary XML tools to report on and index. Luckily, RDF has its own query language called SPARQL that makes it easy to query not only documents in your own web site, but also to query other web sites that store RDF. An excellent example of this is DBPedia.org. DBPedia scrapes RDF assertions from Wikipedia and other sources on a regular basis.

RDF also presents challenges for the displaying extracted entities. A new client-side JavaScript tool called RDFQuery (based on JQuery) is being written to make this process easier. The primary author (Jeni Tennison) has posted this code on GoogleCode. RDF, and the use of microformat analyzers, such as the Operator add-on for Firefox makes it easier for anyone to "repurpose" your documents in ways that you might not anticipate.

Leveraging Annotations

The last part of the process is to understand how to leverage these annotations to create true value to your organization. This business value goes far beyond helping your users find the right documents. Adding document metadata gives you new leverage to repurpose documents.

The Future: Corporate Document Mashups

Although this article has discussed the process of automated entity extraction in the context of increasing the precision of corporate search, you gain the potential to have much more than that. Now, you have a robust architecture for using low-cost libraries of interoperable tools to perform highly specific analysis on components. In the future, these will be tools that automatically suggest document taxonomies for classifying documents or tools that perform statistical database profiling to suggest data element mappings to your data warehouse. UIMA is starting to open the door to automated metadata extraction for many types of entities in your organization and not just documents.

New generations of tools and skills allow software developers to quickly create new application mashups that would formally have been extremely expensive.

One excellent example of this is the XQuery Wikibook, which is used to create innovative new mashups of data that in the past would have taken weeks worth of coding. For example, what if you would like to see a timeline view of the albums of your favorite rock band? What if you also wanted the cover art of the albums on the timeline view?

With RDF and DBPedia you can write this application with just a few lines of XQuery and SPARQL. Here’s an excellent example.

Now imagine creating a timeline view of a project just from a Microsoft Word document. Timelines can be used to help your viewers quickly get a feel for the time ranges discussed in any very long document. After you have data annotations in documents you will find there are many new ways to mashup your documents that were not possible in the past.

Lower Barriers to Getting on the Semantic Bus

Automated metadata discovery and metadata extraction technologies lower the barrier to getting on the (virtual) Semantic Bus. A Semantic Bus is a place where you interchange information between systems that have precise meaning. The concept of the Semantic Bus is not entirely new. People have been discussing Enterprise Service Bus (ESB) for a long time. But the Semantic Bus is also similar to another bus: The Magic School Bus on PBS. Just like the Magic School Bus, the Semantic Bus takes you to new places that are limited more by your creativity than your IT budget. Have a great ride!
Previous Page: Can You Afford Semantic Search?  
Dan McCreary is a data strategy consultant living in Minneapolis, Minnesota. Dan helps organizations create enterprisewide metadata strategies. He is interested in XForms, semantic web technologies, and declarative systems. Contact Dan at dan@danmccreary.com.
Page 1: Do What Google Can'tPage 3: Extending Annotations to Images
Page 2: Can You Afford Semantic Search? 
Please rate this item (5=best)
 1  2  3  4  5
advertisement