devxlogo

Create a LAMP Search Engine Using Multithreaded Perl

Create a LAMP Search Engine Using Multithreaded Perl

ntil recently, people argued that Perl did not have stable multithreaded capabilities. This article presents a Perl application for which using multithreading capabilities makes sense: a Web crawler with all the necessary components of a basic search engine. The downloadable code includes the MySQL database creation scripts, Perl code, and PHP interface files.

The application requirements for the example are:

  1. All open source
  2. Small footprint
  3. Ability to score content
  4. Multithreaded application

To exemplify the point of a small footprint, the crawler, search engine, and database run on a very old Pentium 166MHz with 32MB RAM. It’s not fast by any means, but the amount of performance you can get running Linux on such old hardware is amazing.

Figure 1 depicts the architecture utilized in this example by showing the different components that make up the search engine (e.g., the dictionary hash, the multithreaded crawler, and the PHP front-end used to search the database).

Figure 1: Architecture Utilized for Search Engine

The article breaks out into the following steps:

  1. Preliminary Setup
  2. Code Snippets and Explanations
  3. Web Interface—Searching the content

The Code Snippets and Explanations section describes the components listed in Figure 1.Step 1: Preliminary Setup
To begin, you must set up your Linux system with the proper software:

  1. Apache
  2. MySQL
  3. PHP
  4. Perl (with threading support)

Apache
Although you may use other versions of Apache, the PHP application is developed using version 2.0.47.

MySQL
If you do not already have MySQL installed on your Linux system, you can obtain it directly from MySQL.com. If you run the crawler on the same box as your database, install the MySQL server, client, and development libraries, for example:

  • rpm -ivh MySQL-server-4.0.18-0.i386.rpm
  • rpm -ivh MySQL-client-4.0.18-0.i386.rpm
  • rpm -ivh MySQL-devel-4.0.18-0.i386.rpm

Once you’ve installed the MySQL server, you must create the database used in this example and apply some additional security. At the command prompt, type the following commands:

  1. mysql -u root
  2. create database search;
  3. GRANT ALL PRIVILEGES ON *.* TO search@localhost IDENTIFIED BY 'clamchowder' WITH GRANT OPTION;
  4. GRANT USAGE ON *.* TO search@localhost;

Tweak the above privileges to meet your specific needs and be sure to choose a more secure password.

See also  Custom Java Web Development - The Heartbeat of Modern Web Development

Run the script (included in the code download) at the command prompt to create the physical database structure and load some initial words into your dictionary:

  • mysql -u root -p search < /search.sql

Once completed, you can issue the show tables command from within MySQL:

mysql> show tables;+-----------------------+| Tables_in_search |+-----------------------+| assoc_url_dictionary  || dictionary           || thread_instruction   || url                  |+-----------------------+4 rows in set (0.01 sec)

PHP

  1. Obtain PHP from PHP.net.
  2. Once downloaded, extract the files by typing: bzip2 -dc php-4.3.6.tar.bz2 | tar -xvf-
  3. After you've extracted the files, configure PHP with Apache 2 and MySQL support. Change the path in the 'apxs2' parameter if your Apache bin directory is different: ./configure --with-mysql --with-apxs2=/usr/local/apache2/bin/apxs
  4. Add "AddType application/x-httpd-php .php" to your Apache httpd.conf file. This lets Apache know what to do with PHP files.
  5. To actually compile the PHP source, type: make
  6. To install PHP: make install
  7. Now set up your php.ini. You may edit your .ini file to set different PHP options: cp php.ini-dist /usr/local/lib/php.ini
  8. Modify the "register_globals" parameter to "On" in /usr/local/lib/php.ini

Perl (with Thread Support)

  1. Download Perl from Perl.org.
  2. Unpack it.

Pay attention to the following items when you configure Perl on your system:

  1. Be sure to use the "-Dusethreads" option. This is necessary for multithreading in Perl. To configure Perl with threading support on your system, type: sh Configure –Dusethreads
  2. Choose to install libperl.so when prompted. This is a Perl interpreter for the Apache Web server.

After the install of Perl is complete, you can install other needed components by typing:

  • perl -MCPAN -e 'install HTML::Tagset'
  • perl -MCPAN -e 'install HTML::LinkExtor'
  • perl -MCPAN -e 'install HTML::Parser'
  • perl -MCPAN -e 'install Bundle::LWP'
  • perl -MCPAN -e 'install LWP::Parallel::UserAgent'
  • perl -MCPAN -e 'install Net::MySQL'
  • perl -MCPAN -e 'install DBI'
  • perl -MCPAN -e 'install DBD::mysql'

To clear up some possible confusion, the difference between the DBD- MySQL and DBI modules is as follows:

  • The DBI (Database Driver) is a module that enables Perl programs to attach to databases.
  • The DBD-mysql is a database driver for MySQL.
See also  Custom Java Web Development - The Heartbeat of Modern Web Development

DBI uses the DBD as a translator to talk to MySQL.

Step 2: Code Snippets and Explanations
Now it's time to dive into the Perl crawler code, which includes the directives for threading:

use threads;  # threading routinesuse threads::shared;  # variable sharing routines 

You create threads by calling threads->new(). The first argument to threads->new() is a subroutine reference where the new thread will begin execution. Start 10 crawler threads here (these are the work horses):

my %CrawlerThreads = ();for (0..9){    #grabs the content and links, writes to db    $CrawlerThreads->[$_] = threads->new(&URLCrawler, $url_id_list, $s);    print "Crawler " . ($_ + 1) . " created.
";}
  • The buildWordListFromDb{} sub loads all the words from your database dictionary table into memory. You use this throughout the lifecycle of the instance.
  • The URLArrayMonitor{} checks for when 99 or fewer URLs are on the stack and you haven't been instructed to stop. If so, it gets another 200 (unscanned) URLs from the database and pushes them on the stack.
  • The URLCrawler{} sub pops a URL off the stack and gets the content and title from the page.
  • The getWordsFromContent{} sub parses the Web pages content, grabs the words, and writes them to the ASSOC_URL_DICTIONARY database table.
  • The getLinks{} sub extracts the links found from this URL and inserts them into the database (the URLArrayMonitor thread will pick these up at a later time).
  • The MonitorStatus{} sub periodically checks the THREAD_INSTRUCTION database table for the instruction to begin killing the threads.
  • The text{} sub extracts the text from the parser.
  • To start the crawler, simply type the following:

    perl crawlThread.pl

    Step 3: Web Interface—Searching the Content
    In this basic search engine example, the Perl application runs behind the curtains while the PHP Web application provides the interface for searching your MySQL database (populated by the crawler) and returning user-friendly data to the user. Figure 2 shows a screen shot that depicts a search for the word "database" and the results.

    See also  Custom Java Web Development - The Heartbeat of Modern Web Development
    Figure 2: Search Results for the Word "Database"

    The PHP source code is included in the code download.

    Perl's Multithreaded Capabilities
    You've seen how the multithreaded capabilities of Perl allow for some exciting possibilities. A complete tutorial on threads could fill a book, but the examples provided should give you a good understanding and get you on your way. In addition, you received the basic pieces of a Web search engine.

    By running the crawlThread.pl program on multiple machines that all feed from the same MySQL database, you could scale the search engine horizontally. For additional information and the authoritative answers on how Perl threads behave, consult the documentation bundled with the Perl distribution. Before unleashing the code described in this article and traversing the Internet, you should add some logic in the crawler to ignore pages that you do not want crawled.

    devxblackblue

    About Our Editorial Process

    At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

    See our full editorial policy.

    About Our Journalist