Text Analytics is a powerful mechanism used to extract structured data from unstructured or semi structured text. This is done by creating rules. These rules are used by the extraction programs to extract the relevant information.
This article will delve into using Annotation Query language, or AQL, which is used for text analytics along with IBM InfoSphere BigInsights.
IBM InfoSphere is a platform used to analyze the business insights within a huge volume of data that is of diversified range. Usually these types of data are ignored because it becomes almost impossible to process using the traditional DBMS or RDBMS tools. Annotation query language or AQL is a query language used in IBM InfoSphere as a component to build extractors that can extract structured information from unstructured or semi structured content.
Components of Text Analytics
- Input collection formats – Input collection is either a document, or a set of documents, that is used as an input text from which we are supposed to extract the information. Usually an input collection must be one of the following formats:
- UTF-8 encoded text file having any of the following extensions
- .htm or .html or .xhtml
- A directory containing UTF-8 encoded text files.
- An archive file with the following extensions that contains UTF-8 encoded text files
- UTF-8 encoded comma separated file.
- A plain JSON file.
- Regular Expression - Regular expressions are most commonly used as a text search mechanism. We can use regular expression builders that are used to construct regular expressions and sub expressions.
- Multilingual Support – Text analytics components have support for most common languages that are used for written communications. Text analytics is based on two major techniques — tokenization and parts of speech.
- Patterns – The pattern discovery feature groups input contexts that are similar or have a common pattern.
- Annotation Query Language or AQL – AQL is the primary language used for text analytics. This is used to build extractors that are then used to extract relevant information from unstructured textual components. This is more like SQL language.
Aspects of Text Analytics
- Declarative language – A declarative language is used to identify and extract textual information from existing text content. AQL enables us to have our own collections of records or views that match a specified rule. These views are the main output of any AQL extractor. Views are used to display reports in IBM BigSheets. IBM BigSheets is the reporting and dashboard component of IBM InfoSphere BigInsights platform.
- User defined dictionaries – A dictionary has the ability to identify certain text from an input text to extract the business insights. In AQL we can have our customized dictionary, which will be helpful to get the desired result in an efficient manner.
- User defined rules – With the help of patterns and regular expressions we can specify rules or mechanisms that we can use to segregate the data from a large set of data.
Let's consider the following example. We can mention certain keywords that may or may not appear within a given range of one another, for example, the three words "Apple", "Mac" and "Steve." If all these words appear within a defined range it becomes obvious that we are talking about Apple computers, which was founded by Steve Jobs, and Mac is used as the operating system here. But if the word "Waugh" appears right after the word "Steve" and the other two key words "Apple" and "Mac" are not present, then it becomes clear that we are talking about the famous Australian cricketer Steve Waugh.
- Tracking – The process of text analysis is an iterative process. It becomes necessary to modify the rules and other user defined dictionaries based on the results what we get out of the existing rules.
Text Analytics Process
The text analytics process is carried out in the following four steps:
- Step 1. Collecting and preparing sample data – Any application based on text analytics is developed with the help of some sample data. This sample data is created by having a subset of the bigger data that we have collected. Depending upon the format of our input data we need to prepare one or multiple formats of data supported by BigInsights. In the example mentioned above, we look for the input keywords "Apple", "Mac" and "Steve." These input parameters help the application to gather data from the websites that have these keywords mentioned.
- Step 2. Developing the text extractor and test the same - BigInsights Plugins are available for the most commonly used Java IDE — Eclipse. Using the Eclipse-based wizards we can easily develop the text extractors and test them. The BigInsights information center has all the information on the prerequisite software that is required to develop the text extractors. On a broad level, the following steps needs to be carried out to create a text extractor on eclipse, once the BigInsights plugin is installed successfully.
- Create a new BigInsights project.
- Import the sample data that is required for testing. The sample data in our example is typically in a JSON array format. For our testing purposes, let us use the BigSheets export facility to export some records (around 10000) of data in a CSV file. Then we run the Jaql script. This script converts the CSV file into an appropriate delimited file format that is readable by BigInsights. This new file is then used as input file to the Eclipse analytical tool.
- Create the artifacts that are required by the application, such as AQL modules, AQL scripts, user defined dictionaries, and so on.
- Now test your code against the sample documents based on the input collection provided. The built in features such as annotation explorer and the log pane are used to inspect the results. This test should be carried out iteratively.
- Step 3. Publish and deploy – The application is ready to be deployed and published when we are satisfied with the results produced by the text extractor. Usually it is published in the application catalog of a cluster. In order to deploy the published application we use the BigInsights web console. We should use a login id that has the administrative privileges.
- Step 4. Run the text extractor – After deploying the text extractor successfully, it is now time to execute it. BigInsights has the ability to invoke the text extractors using Java API with the help of Jaql and BigSheets. The advantage of using BigSheets is that there is no additional coding or scripting required here. Any Business Analyst can take up this task.
There is nothing special about AQL views. These are similar to the standard views in a relational database. Each AQL view has a name, and consists of rows and columns. In AQL, views are always materialized. All the AQL statements operate on views. Here we have one special view called Document. This view is mapped to one input document at the time from your collection at runtime. This view is very helpful to extract the subset from the large set of data.
Text analytics is at the heart of any analytics application. So it is very important to learn the tools and frameworks required to develop text analytics applications. IBM InfoSphere BigInsights is one of the best tools available for text analytics.
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.