Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Create a LAMP Search Engine Using Multithreaded Perl : Page 3

Explore the multithreaded capabilities of Perl while building a LAMP Web crawler with all the necessary components of a basic search engine.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

Step 2: Code Snippets and Explanations
Now it's time to dive into the Perl crawler code, which includes the directives for threading:

use threads; # threading routines use threads::shared; # variable sharing routines

You create threads by calling threads->new(). The first argument to threads->new() is a subroutine reference where the new thread will begin execution. Start 10 crawler threads here (these are the work horses):

my %CrawlerThreads = (); for (0..9){ #grabs the content and links, writes to db $CrawlerThreads->[$_] = threads->new(\&URLCrawler, $url_id_list, $s); print "Crawler " . ($_ + 1) . " created.\n"; }

  • The buildWordListFromDb{} sub loads all the words from your database dictionary table into memory. You use this throughout the lifecycle of the instance.
  • The URLArrayMonitor{} checks for when 99 or fewer URLs are on the stack and you haven't been instructed to stop. If so, it gets another 200 (unscanned) URLs from the database and pushes them on the stack.
  • The URLCrawler{} sub pops a URL off the stack and gets the content and title from the page.
  • The getWordsFromContent{} sub parses the Web pages content, grabs the words, and writes them to the ASSOC_URL_DICTIONARY database table.
  • The getLinks{} sub extracts the links found from this URL and inserts them into the database (the URLArrayMonitor thread will pick these up at a later time).
  • The MonitorStatus{} sub periodically checks the THREAD_INSTRUCTION database table for the instruction to begin killing the threads.
  • The text{} sub extracts the text from the parser.
  • To start the crawler, simply type the following:

    perl crawlThread.pl

    Comment and Contribute






    (Maximum characters: 1200). You have 1200 characters left.



    Thanks for your registration, follow us on our social networks to keep up-to-date