RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Automate and Speed Up Web Searches with Bots : Page 3

Use web service APIs along with these tools and techniques to construct your own hybrid search bots and automate your web data-collection tasks.

Checking a Search Result
After capturing the list of URLs the Year-Born bot must process them, downloading each one and parsing the contents into sentences. When a sentence contains both the word "born" and a number that looks like a year, the application assumes the number is a birth year. The application considers numbers between 1 and 3000 (non-inclusive) as possible birth years. People born before 1 AD are out of the scope for the program. The application does all this work in the checkURL() method, which begins by creating a StringBuilder that will hold each sentence as the page gets parsed.

   int ch;
   StringBuilder sentence = new StringBuilder();
The code opens a connection to the URL and constructs a ParseHTML object to parse the HTML found at that location. Remember that you're not really interested in the HTML, you're only interested in the text contents; the main purpose of the parser is to strip the HTML tags away from the text:

   URLConnection http = url.openConnection();
   InputStream is = http.getInputStream();
   ParseHTML html = new ParseHTML(is);
When a sentence contains both the word "born" and a number that looks like a year, the application assumes the number is a birth year.
To strip the HTML markup, you loop through all the characters in the HTML file. The html.read() method in the following code fragment returns a value of zero when it encounters an HTML tag. Any HTML tags are ignored. The program makes a special check for periods, which denote the end of a sentence. The method accumulates all characters in the StringBuilder:

     ch = html.read();
     if ((ch != -1) && (ch != 0))
       if (ch == '.')
After accumulating a complete sentence (a series of characters ending in a period), you can check the sentence to see if it contains a birth year. This is not the most accurate way to break up sentences, but it works well enough for this article. If a few sentences run on—such as those ending in punctuation other than periods—or are cut short, it really does not impact the final output of the program. This program is all about finding many birth years and then using a sort of "majority rules" approach to determining the correct one. If you lose a few in the noise, it does not hurt.

If the sentence contains a valid birth year, the program records the year and continues:

         String str = sentence.toString();
         int year = extractBirth(str);
         if ((year > 1) && (year < 3000))
           System.out.println("URL supports year: " + year);
       } else
         sentence.append((char) ch);
   } while (ch != -1);
This process of reading and analyzing sentences continues until the loop reaches the end of the HTML document. The extractBirth() method determines whether a given sentence contains a number that might be a birth year.

Extracting a Birth Year
Each "sentence" that is found must be scanned for a birth year. To do this, the extractBirth() method breaks each sentence into "words", which are defined as sequences of characters separated by spaces:

   boolean foundBorn = false;
   int result = -1;
   StringTokenizer tok = new StringTokenizer(sentence);
   while (tok.hasMoreTokens())
The following code checks each word to see if it is a number. If so, it records the number and the program sentence parsing continues. If it finds more than one number in a sentence, it returns only the last number:

     String word = tok.nextToken();
       result = Integer.parseInt(word);
     } catch (NumberFormatException e)
If the word is not a number, the method checks to see if it's the word "born," and sets a Boolean flag when it finds that word:

     if (word.equalsIgnoreCase("born"))
       foundBorn = true;
   if (!foundBorn)
     result = -1;
   return result;
If the extractBirth() method finds both a number and the word "born," then the number is a potential birth year, so it returns the number. If it finds only one, or neither, it returns a -1, letting the calling code know whether to save the return value as a possible birth year.

The process of retrieving the HTML, stripping the markup, breaking it into sentences, and checking the words in each sentence for birth dates and the word "born" continues until all the URLs retrieved from the Yahoo search have been processed.

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date