Automate and Speed Up Web Searches with Bots

bot is a computer program that surfs the web in much the same way as a human user; however, bots are automated. A bot can access a large number of sites much more quickly than a person. You might be surprised at how many uses there are for bots, both good and bad. Some positive uses for bots include:

  • Researching and gathering information
  • Keeping search engines up to date
  • Monitoring web sites looking for bad links and other problems

Unfortunately, bots are also commonly used for less positive purposes, such as:

  • Posting spam comments to forums and blogs
  • Harvesting e-mail addresses for spammers
  • Finding web sites that have known security flaws

HTML-based bots have direct access and close ties to the HTML that makes up web sites, and thus often break when changes occur to the web sites they were designed to access. Therefore, modern sites whose designers want to allow programmatic access typically offer web services that provide specific data on request, reducing or eliminating the need to parse the HTML to extract content. As such web services become more common, they’re also reducing the need for HTML-based bots.

A hybrid bot uses both web services and traditional HTML techniques to find and extract data.

However, web services don’t always expose exactly the data you need, so HTML bots remain useful, particularly as the data-seeking and extracting portion of hybrid bots. A hybrid bot uses both web services and traditional HTML techniques to find and extract data. Because bots often have similar needs, several companies have released generic APIs that encapsulate common bot services. For example, the Yahoo Search API is a great choice for hybrid bots. You can use the API to locate web pages that match your criteria, and then use a traditional HTML bot to scan those pages for whatever content you wish.

Introducing the “Year-Born” Bot
In this article you’ll see how to create a hybrid bot that scans pages for information about a famous person of your choice. Specifically, this bot attempts to obtain the person’s birth year. The bot works by first using the Yahoo Search API to call Yahoo’s web services to find web sites that contain the person’s name. The bot then loops through the page hit list, scanning the HTML of each page looking for the person’s birth year.

This program executes these tasks in three distinct phases. In phase one, the application submits the name of the famous person to Yahoo, obtaining the results by calling the search function of the YahooSearch class.

   search = new YahooSearch();      if (YearBornBot.LOG) {      System.out.println(         "Getting search results form Yahoo.");   }   Collection c = search.search(name);   int i = 0;

In phase two, it checks each URL in the collection returned from Yahoo. Because some of the returned URLs may no longer be valid, the application uses a try/catch block within the loop to catch errors. It skips any invalid URLs, and the loop continues to the next iteration.

The program passes each valid URL found to the checkURL method, which searches for birth years. I’ll cover the checkURL method in more detail later in this article.

   if (YearBornBot.LOG) {      System.out.println("Scanning URL's from Yahoo.");   }      for (URL u : c) {      try {         i++;         if (YearBornBot.LOG) {            System.out.println(               "Scanning URL: " + i + "/" + c.size()               + ":" + u);         }         checkURL(u);      } catch (IOException e) {   }

After processing all the URLs, phase three of the application calls the getResult function to determine which birth year is the famous person’s actual birth year.

   int resultYear = getResult();   if (resultYear == -1)   {     System.out.println(       "Could not determine when " + name +        " was born.");   } else   {     System.out.println(name + " was born in "         + resultYear);   }

In some cases, the program may not be able to determine the person’s birth year. When that result occurs, the application displays a message to inform the user that it was unable to determine the birth year.

Getting the Code
The year born bot uses two general-purpose classes introduced in my book “HTTP Programming Recipes for Java Bots:”

  • ParseHTML—An HTML parser.
  • URLUtility—Utilities for building URL responses.

You can download those classes from my web site. The download includes a JAR file that contains the two classes, as well as complete source code. You’ll also find numerous other bot programming examples in the download; however, the sample code for this article is not included in that download?you can download it here.

Searching Yahoo
Yahoo provides a web service API through which developers can access their search features. In the downloadable code, you’ll find a YahooSearch class that wraps that API to simplify searching Yahoo.

The search function of the YahooSearch class begins by building the correct URL to the web service:

   // build the URL   ByteArrayOutputStream bos = new ByteArrayOutputStream();   FormUtility form = new FormUtility(bos, null);   form.add("appid", "YahooDemo");   form.add("results", "100");   form.add("query", searchFor);   form.complete();

The query string contains three parameters: appid, results, and query. The appid specifies the id that you were given when you registered with Yahoo.

Author’s Note: For the sample program, you can use the “YahooDemo” id, but make sure you obtain a real Yahoo ID from http://developer.yahoo.com/ before using any of this code for production purposes.

The results parameter specifies how many results you want Yahoo to return. This value is an approximation?often Yahoo returns more results than requested. Finally, the searchFor parameter holds the search term that you are submitting:

   // submit the search   URL url = new URL(      "http://search.yahooapis.com/WebSearchService/V1/webSearch?"      + bos.toString());   bos.close();

After submitting the search the sample application parses the results, which are in XML. The only data that’s relevant to this application from the XML stream are the URLs of the pages that Yahoo located. These URLs will be between a beginning and ending tag. The application uses an HTML parser to find all the URLs and the code simply records every URL it finds.

   InputStream is = url.openStream();   ParseHTML parse = new ParseHTML(is);   StringBuilder buffer = new StringBuilder();   boolean capture = false;

Using the ParseHTML class, loop over every character in the HTML file. Tags will automatically be detected and parsed to zero. If a value of zero is encountered, then handle the tag:

   // parse the results   int ch;   while ((ch = parse.read()) != -1) {      if (ch == 0) {         HTMLTag tag = parse.getTag();

When you find a tag the following characters will be part of that URL, and you can collect them in a buffer:

         if (tag.getName().equalsIgnoreCase("url")) {            buffer.setLength(0);            capture = true;

When you find the ending tag, the URL is complete, and you can add it to the URL list and continue processing other URLs:

         } else if (tag.getName().equalsIgnoreCase("/url")) {            System.out.println(buffer.toString());            result.add(new URL(buffer.toString()));            buffer.setLength(0);            capture = false;         }

Otherwise, as long as you’re capturing URL data, when you find a regular character rather than a tag, append the character to the buffer.

      } else {         if (capture) {            buffer.append((char) ch);         }      }   }   return result;

At the end of the loop, return the results.

Checking a Search Result
After capturing the list of URLs the Year-Born bot must process them, downloading each one and parsing the contents into sentences. When a sentence contains both the word “born” and a number that looks like a year, the application assumes the number is a birth year. The application considers numbers between 1 and 3000 (non-inclusive) as possible birth years. People born before 1 AD are out of the scope for the program. The application does all this work in the checkURL() method, which begins by creating a StringBuilder that will hold each sentence as the page gets parsed.

   int ch;   StringBuilder sentence = new StringBuilder();

The code opens a connection to the URL and constructs a ParseHTML object to parse the HTML found at that location. Remember that you’re not really interested in the HTML, you’re only interested in the text contents; the main purpose of the parser is to strip the HTML tags away from the text:

   URLConnection http = url.openConnection();   http.setConnectTimeout(1000);   http.setReadTimeout(1000);   InputStream is = http.getInputStream();   ParseHTML html = new ParseHTML(is);
When a sentence contains both the word “born” and a number that looks like a year, the application assumes the number is a birth year.

To strip the HTML markup, you loop through all the characters in the HTML file. The html.read() method in the following code fragment returns a value of zero when it encounters an HTML tag. Any HTML tags are ignored. The program makes a special check for periods, which denote the end of a sentence. The method accumulates all characters in the StringBuilder:

   do   {     ch = html.read();     if ((ch != -1) && (ch != 0))     {       if (ch == '.')       {

After accumulating a complete sentence (a series of characters ending in a period), you can check the sentence to see if it contains a birth year. This is not the most accurate way to break up sentences, but it works well enough for this article. If a few sentences run on?such as those ending in punctuation other than periods?or are cut short, it really does not impact the final output of the program. This program is all about finding many birth years and then using a sort of “majority rules” approach to determining the correct one. If you lose a few in the noise, it does not hurt.

If the sentence contains a valid birth year, the program records the year and continues:

         String str = sentence.toString();         int year = extractBirth(str);         if ((year > 1) && (year < 3000))         {           System.out.println("URL supports year: " + year);           increaseYear(year);         }         sentence.setLength(0);       } else         sentence.append((char) ch);     }   } while (ch != -1);

This process of reading and analyzing sentences continues until the loop reaches the end of the HTML document. The extractBirth() method determines whether a given sentence contains a number that might be a birth year.

Extracting a Birth Year
Each "sentence" that is found must be scanned for a birth year. To do this, the extractBirth() method breaks each sentence into "words", which are defined as sequences of characters separated by spaces:

   boolean foundBorn = false;   int result = -1;      StringTokenizer tok = new StringTokenizer(sentence);   while (tok.hasMoreTokens())   {

The following code checks each word to see if it is a number. If so, it records the number and the program sentence parsing continues. If it finds more than one number in a sentence, it returns only the last number:

     String word = tok.nextToken();        try     {       result = Integer.parseInt(word);     } catch (NumberFormatException e)     {

If the word is not a number, the method checks to see if it's the word "born," and sets a Boolean flag when it finds that word:

     if (word.equalsIgnoreCase("born"))       foundBorn = true;     }   }      if (!foundBorn)     result = -1;      return result;

If the extractBirth() method finds both a number and the word "born," then the number is a potential birth year, so it returns the number. If it finds only one, or neither, it returns a -1, letting the calling code know whether to save the return value as a possible birth year.

The process of retrieving the HTML, stripping the markup, breaking it into sentences, and checking the words in each sentence for birth dates and the word "born" continues until all the URLs retrieved from the Yahoo search have been processed.

Finding the Correct Birth Year
When the program finishes scanning the URLs identified with a famous person, you are left with a list of potential birth years. The program also tracks how many times each potential birth year occurs. It calls the getResult function to determine which year had the largest number of "votes."

The function begins by creating two variables. The result variable holds the year with the largest count. The second variable, named maxCount, holds the number of votes held by the current value of the result variable.

   int result = -1;   int maxCount = 0;
Bots are usually designed to access specific data. If you need to obtain data, and that data is available on the Internet, you can probably construct a bot to obtain it.

Next, it creates a Set that contains each birth year, and counts the occurrences of each. At the end, the result variable will hold the birth year with the largest count:

   Set set = results.keySet();   for (int year : set)   {     int count = results.get(year);     if (count > maxCount)     {       result = year;       maxCount = count;     }   }      return result;

If no birth years were found, then the result variable remains set to its initial value of -1, which informs the calling method that no birth year was found.

Going on From Here
This article showed you how to create a bot that makes use of the Yahoo web services API. This bot uses the Yahoo API to find likely pages to visit. Subsequently, it uses regular Java HTTP programming to access and analyze the data contained in those pages.

Bots are usually designed to access specific data. If you need to obtain data, and that data is available on the Internet, you can probably construct a bot to obtain it. Using the Java HTTP functions a Java program can perform any task that a regular web user would. Creating the bot is simply a matter of reproducing the correct HTTP requests in your bot and writing the appropriate code for data recognition, extraction, and analysis.

Fortunately, as you have seen, much of the code?the initial search, URL gathering, HTML stripping, sentence collection, and word tokenizing?is boilerplate; you'd write the same type of code to search for any type of data. That also means it's reusable. The only part that's not reusable is the code that identifies and analyzes the specific data you're looking for. By replacing that code with your own custom code, you have all the basic tools you need to construct your own bots to search and extract data from the web.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: