Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Automate and Speed Up Web Searches with Bots : Page 2

Use web service APIs along with these tools and techniques to construct your own hybrid search bots and automate your web data-collection tasks.


advertisement
Getting the Code
The year born bot uses two general-purpose classes introduced in my book "HTTP Programming Recipes for Java Bots:"

  • ParseHTML—An HTML parser.
  • URLUtility—Utilities for building URL responses.
You can download those classes from my web site. The download includes a JAR file that contains the two classes, as well as complete source code. You'll also find numerous other bot programming examples in the download; however, the sample code for this article is not included in that download—you can download it here.

Searching Yahoo
Yahoo provides a web service API through which developers can access their search features. In the downloadable code, you'll find a YahooSearch class that wraps that API to simplify searching Yahoo.

The search function of the YahooSearch class begins by building the correct URL to the web service:

// build the URL ByteArrayOutputStream bos = new ByteArrayOutputStream(); FormUtility form = new FormUtility(bos, null); form.add("appid", "YahooDemo"); form.add("results", "100"); form.add("query", searchFor); form.complete();

The query string contains three parameters: appid, results, and query. The appid specifies the id that you were given when you registered with Yahoo.

Author's Note: For the sample program, you can use the "YahooDemo" id, but make sure you obtain a real Yahoo ID from http://developer.yahoo.com/ before using any of this code for production purposes.

The results parameter specifies how many results you want Yahoo to return. This value is an approximation—often Yahoo returns more results than requested. Finally, the searchFor parameter holds the search term that you are submitting:

// submit the search URL url = new URL( "http://search.yahooapis.com/WebSearchService/V1/webSearch?" + bos.toString()); bos.close();

After submitting the search the sample application parses the results, which are in XML. The only data that's relevant to this application from the XML stream are the URLs of the pages that Yahoo located. These URLs will be between a beginning <url> and ending </url> tag. The application uses an HTML parser to find all the URLs and the code simply records every URL it finds.



InputStream is = url.openStream(); ParseHTML parse = new ParseHTML(is); StringBuilder buffer = new StringBuilder(); boolean capture = false;

Using the ParseHTML class, loop over every character in the HTML file. Tags will automatically be detected and parsed to zero. If a value of zero is encountered, then handle the tag:

// parse the results int ch; while ((ch = parse.read()) != -1) { if (ch == 0) { HTMLTag tag = parse.getTag();

When you find a <url> tag the following characters will be part of that URL, and you can collect them in a buffer:

if (tag.getName().equalsIgnoreCase("url")) { buffer.setLength(0); capture = true;

When you find the ending </url> tag, the URL is complete, and you can add it to the URL list and continue processing other URLs:

} else if (tag.getName().equalsIgnoreCase("/url")) { System.out.println(buffer.toString()); result.add(new URL(buffer.toString())); buffer.setLength(0); capture = false; }

Otherwise, as long as you're capturing URL data, when you find a regular character rather than a tag, append the character to the buffer.

} else { if (capture) { buffer.append((char) ch); } } } return result;

At the end of the loop, return the results.



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap