Text Analytics with AQL and IBM InfoSphere BigInsights

Text Analytics with AQL and IBM InfoSphere BigInsights

Text Analytics is a powerful mechanism used to extract structured data from unstructured or semi structured text. This is done by creating rules. These rules are used by the extraction programs to extract the relevant information.

This article will delve into using Annotation Query language, or AQL, which is used for text analytics along with IBM InfoSphere BigInsights.

IBM InfoSphere is a platform used to analyze the business insights within a huge volume of data that is of diversified range. Usually these types of data are ignored because it becomes almost impossible to process using the traditional DBMS or RDBMS tools. Annotation query language or AQL is a query language used in IBM InfoSphere as a component to build extractors that can extract structured information from unstructured or semi structured content.

Components of Text Analytics

  • Input collection formats?? Input collection is either a document, or a set of documents, that is used as an input text from which we are supposed to extract the information. Usually an input collection must be one of the following formats:
    • UTF-8 encoded text file having any of the following extensions
      • .txt
      • .htm or .html or .xhtml
      • .xml
    • A directory containing UTF-8 encoded text files.
    • An archive file with the following extensions that contains UTF-8 encoded text files
      • .tar
      • .zip
      • .gz
    • UTF-8 encoded comma separated file.
    • A plain JSON file.
  • Regular Expression?- Regular expressions are most commonly used as a text search mechanism. We can use regular expression builders that are used to construct regular expressions and sub expressions.
  • Multilingual Support?? Text analytics components have support for most common languages that are used for written communications. Text analytics is based on two major techniques ? tokenization and parts of speech.
  • Patterns?? The pattern discovery feature groups input contexts that are similar or have a common pattern.
  • Annotation Query Language or AQL?? AQL is the primary language used for text analytics. This is used to build extractors that are then used to extract relevant information from unstructured textual components. This is more like SQL language.

Aspects of Text Analytics

  • Declarative language?? A declarative language is used to identify and extract textual information from existing text content. AQL enables us to have our own collections of records or views that match a specified rule. These views are the main output of any AQL extractor. Views are used to display reports in IBM BigSheets. IBM BigSheets is the reporting and dashboard component of IBM InfoSphere BigInsights platform.
  • User defined dictionaries?? A dictionary has the ability to identify certain text from an input text to extract the business insights. In AQL we can have our customized dictionary, which will be helpful to get the desired result in an efficient manner.
  • User defined rules?? With the help of patterns and regular expressions we can specify rules or mechanisms that we can use to segregate the data from a large set of data.Let’s consider the following example. We can mention certain keywords that may or may not appear within a given range of one another, for example, the three words “Apple”, “Mac” and “Steve.” If all these words appear within a defined range it becomes obvious that we are talking about Apple computers, which was founded by Steve Jobs, and Mac is used as the operating system here. But if the word “Waugh” appears right after the word “Steve” and the other two key words “Apple” and “Mac” are not present, then it becomes clear that we are talking about the famous Australian cricketer Steve Waugh.
  • Tracking?? The process of text analysis is an iterative process. It becomes necessary to modify the rules and other user defined dictionaries based on the results what we get out of the existing rules.

Text Analytics Process

The text analytics process is carried out in the following four steps:

  • Step 1. Collecting and preparing sample data?? Any application based on text analytics is developed with the help of some sample data. This sample data is created by having a subset of the bigger data that we have collected. Depending upon the format of our input data we need to prepare one or multiple formats of data supported by BigInsights. In the example mentioned above, we look for the input keywords “Apple”, “Mac” and “Steve.” These input parameters help the application to gather data from the websites that have these keywords mentioned.
  • Step 2. Developing the text extractor and test the same?- BigInsights Plugins are available for the most commonly used Java IDE ? Eclipse. Using the Eclipse-based wizards we can easily develop the text extractors and test them. The BigInsights information center has all the information on the prerequisite software that is required to develop the text extractors. On a broad level, the following steps needs to be carried out to create a text extractor on eclipse, once the BigInsights plugin is installed successfully.
    • Create a new BigInsights project.
    • Import the sample data that is required for testing. The sample data in our example is typically in a JSON array format. For our testing purposes, let us use the BigSheets export facility to export some records (around 10000) of data in a CSV file. Then we run the Jaql script. This script converts the CSV file into an appropriate delimited file format that is readable by BigInsights. This new file is then used as input file to the Eclipse analytical tool.
    • Create the artifacts that are required by the application, such as AQL modules, AQL scripts, user defined dictionaries, and so on.
    • Now test your code against the sample documents based on the input collection provided. The built in features such as annotation explorer and the log pane are used to inspect the results. This test should be carried out iteratively.
  • Step 3. Publish and deploy?? The application is ready to be deployed and published when we are satisfied with the results produced by the text extractor. Usually it is published in the application catalog of a cluster. In order to deploy the published application we use the BigInsights web console. We should use a login id that has the administrative privileges.
  • Step 4. Run the text extractor?? After deploying the text extractor successfully, it is now time to execute it. BigInsights has the ability to invoke the text extractors using Java API with the help of Jaql and BigSheets. The advantage of using BigSheets is that there is no additional coding or scripting required here. Any Business Analyst can take up this task.

AQL Views

There is nothing special about AQL views. These are similar to the standard views in a relational database. Each AQL view has a name, and consists of rows and columns. In AQL, views are always materialized. All the AQL statements operate on views. Here we have one special view called Document. This view is mapped to one input document at the time from your collection at runtime. This view is very helpful to extract the subset from the large set of data.

Summary

Text analytics is at the heart of any analytics application. So it is very important to learn the tools and frameworks required to develop text analytics applications. IBM InfoSphere BigInsights is one of the best tools available for text analytics.

?

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. 

devx-admin

devx-admin

Share the Post:
Bold Evolution

Intel’s Bold Comeback

Intel, a leading figure in the semiconductor industry, has underperformed in the stock market over the past five years, with shares dropping by 4% as

Semiconductor market

Semiconductor Slump: Rebound on the Horizon

In recent years, the semiconductor sector has faced a slump due to decreasing PC and smartphone sales, especially in 2022 and 2023. Nonetheless, as 2024

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security.

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like

Bold Evolution

Intel’s Bold Comeback

Intel, a leading figure in the semiconductor industry, has underperformed in the stock market over the past five years, with shares dropping by 4% as opposed to the 176% return

Semiconductor market

Semiconductor Slump: Rebound on the Horizon

In recent years, the semiconductor sector has faced a slump due to decreasing PC and smartphone sales, especially in 2022 and 2023. Nonetheless, as 2024 approaches, the industry seems to

Elevated Content Deals

Elevate Your Content Creation with Amazing Deals

The latest Tech Deals cater to creators of different levels and budgets, featuring a variety of computer accessories and tools designed specifically for content creation. Enhance your technological setup with

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security. These carefully designed learning courses

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers to better manage smaller unmanned

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like artificial intelligence (AI), semiconductors, and

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20 percent, which matches the lowest

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private, and SaaS applications. The suite

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have already been acquired by clients.

2D Semiconductor Revolution

Disrupting Electronics with 2D Semiconductors

The rapid development in electronic devices has created an increasing demand for advanced semiconductors. While silicon has traditionally been the go-to material for such applications, it suffers from certain limitations.

Cisco Growth

Cisco Cuts Jobs To Optimize Growth

Tech giant Cisco Systems Inc. recently unveiled plans to reduce its workforce in two Californian cities, with the goal of optimizing the company’s cost structure. The company has decided to

FAA Authorization

FAA Approves Drone Deliveries

In a significant development for the US drone industry, drone delivery company Zipline has gained Federal Aviation Administration (FAA) authorization, permitting them to operate drones beyond the visual line of

Mortgage Rate Challenges

Prop-Tech Firms Face Mortgage Rate Challenges

The surge in mortgage rates and a subsequent decrease in home buying have presented challenges for prop-tech firms like Divvy Homes, a rent-to-own start-up company. With a previous valuation of

Lighthouse Updates

Microsoft 365 Lighthouse: Powerful Updates

Microsoft has introduced a new update to Microsoft 365 Lighthouse, which includes support for alerts and notifications. This update is designed to give Managed Service Providers (MSPs) increased control and

Website Lock

Mysterious Website Blockage Sparks Concern

Recently, visitors of a well-known resource website encountered a message blocking their access, resulting in disappointment and frustration among its users. While the reason for this limitation remains uncertain, specialists

AI Tool

Unleashing AI Power with Microsoft 365 Copilot

Microsoft has recently unveiled the initial list of Australian clients who will benefit from Microsoft 365 (M365) Copilot through the exclusive invitation-only global Early Access Program. Prominent organizations participating in

Microsoft Egnyte Collaboration

Microsoft and Egnyte Collaboration

Microsoft has revealed a collaboration with Egnyte, a prominent platform for content cooperation and governance, with the goal of improving real-time collaboration features within Microsoft 365 and Microsoft Teams. This

Best Laptops

Top Programming Laptops of 2023

In 2023, many developers prioritize finding the best laptop for programming, whether at home, in the workplace, or on the go. A high-performing, portable, and user-friendly laptop could significantly influence

Renaissance Gaming Magic

AI Unleashes A Gaming Renaissance

In recent times, artificial intelligence has achieved remarkable progress, with resources like ChatGPT becoming more sophisticated and readily available. Pietro Schirano, the design lead at Brex, has explored the capabilities

New Apple Watch

The New Apple Watch Ultra 2 is Awesome

Apple is making waves in the smartwatch market with the introduction of the highly anticipated Apple Watch Ultra 2. This revolutionary device promises exceptional performance, robust design, and a myriad

Truth Unveiling

Unveiling Truths in Bowen’s SMR Controversy

Tony Wood from the Grattan Institute has voiced his concerns over Climate and Energy Minister Chris Bowen’s critique of the Coalition’s support for small modular nuclear reactors (SMRs). Wood points

Avoiding Crisis

Racing to Defy Looming Financial Crisis

Chinese property developer Country Garden is facing a liquidity challenge as it approaches a deadline to pay $15 million in interest associated with an offshore bond. With a 30-day grace

Open-Source Development

Open-Source Software Development is King

The increasingly digital world has led to the emergence of open-source software as a critical factor in modern software development, with more than 70% of the infrastructure, products, and services

Home Savings

Sensational Savings on Smart Home Security

For a limited time only, Amazon is offering massive discounts on a variety of intelligent home devices, including products from its Ring security range. Running until October 2 or while

Apple Unleashed

A Deep Dive into the iPhone 15 Pro Max

Apple recently unveiled its groundbreaking iPhone 15 Pro and iPhone 15 Pro Max models, featuring a revolutionary design, extraordinary display technology, and unrivaled performance. These new models are the first