Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Automate and Speed Up Web Searches with Bots

Use web service APIs along with these tools and techniques to construct your own hybrid search bots and automate your web data-collection tasks.


advertisement
bot is a computer program that surfs the web in much the same way as a human user; however, bots are automated. A bot can access a large number of sites much more quickly than a person. You might be surprised at how many uses there are for bots, both good and bad. Some positive uses for bots include:

  • Researching and gathering information
  • Keeping search engines up to date
  • Monitoring web sites looking for bad links and other problems
Unfortunately, bots are also commonly used for less positive purposes, such as:

  • Posting spam comments to forums and blogs
  • Harvesting e-mail addresses for spammers
  • Finding web sites that have known security flaws
HTML-based bots have direct access and close ties to the HTML that makes up web sites, and thus often break when changes occur to the web sites they were designed to access. Therefore, modern sites whose designers want to allow programmatic access typically offer web services that provide specific data on request, reducing or eliminating the need to parse the HTML to extract content. As such web services become more common, they're also reducing the need for HTML-based bots.

A hybrid bot uses both web services and traditional HTML techniques to find and extract data.
However, web services don't always expose exactly the data you need, so HTML bots remain useful, particularly as the data-seeking and extracting portion of hybrid bots. A hybrid bot uses both web services and traditional HTML techniques to find and extract data. Because bots often have similar needs, several companies have released generic APIs that encapsulate common bot services. For example, the Yahoo Search API is a great choice for hybrid bots. You can use the API to locate web pages that match your criteria, and then use a traditional HTML bot to scan those pages for whatever content you wish.

Introducing the "Year-Born" Bot
In this article you'll see how to create a hybrid bot that scans pages for information about a famous person of your choice. Specifically, this bot attempts to obtain the person's birth year. The bot works by first using the Yahoo Search API to call Yahoo's web services to find web sites that contain the person's name. The bot then loops through the page hit list, scanning the HTML of each page looking for the person's birth year.

This program executes these tasks in three distinct phases. In phase one, the application submits the name of the famous person to Yahoo, obtaining the results by calling the search function of the YahooSearch class.

search = new YahooSearch(); if (YearBornBot.LOG) { System.out.println( "Getting search results form Yahoo."); } Collection c = search.search(name); int i = 0;

In phase two, it checks each URL in the collection returned from Yahoo. Because some of the returned URLs may no longer be valid, the application uses a try/catch block within the loop to catch errors. It skips any invalid URLs, and the loop continues to the next iteration.

The program passes each valid URL found to the checkURL method, which searches for birth years. I'll cover the checkURL method in more detail later in this article.

if (YearBornBot.LOG) { System.out.println("Scanning URL's from Yahoo."); } for (URL u : c) { try { i++; if (YearBornBot.LOG) { System.out.println( "Scanning URL: " + i + "/" + c.size() + ":" + u); } checkURL(u); } catch (IOException e) { }

After processing all the URLs, phase three of the application calls the getResult function to determine which birth year is the famous person's actual birth year.



int resultYear = getResult(); if (resultYear == -1) { System.out.println( "Could not determine when " + name + " was born."); } else { System.out.println(name + " was born in " + resultYear); }

In some cases, the program may not be able to determine the person's birth year. When that result occurs, the application displays a message to inform the user that it was unable to determine the birth year.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap