Home » Create a LAMP Search Engine Using Multithreaded Perl

Create a LAMP Search Engine Using Multithreaded Perl

ntil recently, people argued that Perl did not have stable multithreaded capabilities. This article presents a Perl application for which using multithreading capabilities makes sense: a Web crawler with all the necessary components of a basic search engine. The downloadable code includes the MySQL database creation scripts, Perl code, and PHP interface files.

The application requirements for the example are:

All open source
Small footprint
Ability to score content
Multithreaded application

To exemplify the point of a small footprint, the crawler, search engine, and database run on a very old Pentium 166MHz with 32MB RAM. It’s not fast by any means, but the amount of performance you can get running Linux on such old hardware is amazing.

Figure 1 depicts the architecture utilized in this example by showing the different components that make up the search engine (e.g., the dictionary hash, the multithreaded crawler, and the PHP front-end used to search the database).


Figure 1: Architecture Utilized for Search Engine

The article breaks out into the following steps:

Preliminary Setup
Code Snippets and Explanations
Web Interface—Searching the content

The Code Snippets and Explanations section describes the components listed in Figure 1.Step 1: Preliminary Setup
To begin, you must set up your Linux system with the proper software:

Apache
MySQL
PHP
Perl (with threading support)

Apache
Although you may use other versions of Apache, the PHP application is developed using version 2.0.47.

MySQL
If you do not already have MySQL installed on your Linux system, you can obtain it directly from MySQL.com. If you run the crawler on the same box as your database, install the MySQL server, client, and development libraries, for example:

rpm -ivh MySQL-server-4.0.18-0.i386.rpm
rpm -ivh MySQL-client-4.0.18-0.i386.rpm
rpm -ivh MySQL-devel-4.0.18-0.i386.rpm

Once you’ve installed the MySQL server, you must create the database used in this example and apply some additional security. At the command prompt, type the following commands:

mysql -u root
create database search;
GRANT ALL PRIVILEGES ON *.* TO search@localhost IDENTIFIED BY 'clamchowder' WITH GRANT OPTION;
GRANT USAGE ON *.* TO search@localhost;

Tweak the above privileges to meet your specific needs and be sure to choose a more secure password.

Run the script (included in the code download) at the command prompt to create the physical database structure and load some initial words into your dictionary:

mysql -u root -p search < /search.sql

Once completed, you can issue the show tables command from within MySQL:

mysql> show tables;+-----------------------+| Tables_in_search |+-----------------------+| assoc_url_dictionary  || dictionary           || thread_instruction   || url                  |+-----------------------+4 rows in set (0.01 sec)

PHP

Obtain PHP from PHP.net.
Once downloaded, extract the files by typing: bzip2 -dc php-4.3.6.tar.bz2 | tar -xvf-
After you've extracted the files, configure PHP with Apache 2 and MySQL support. Change the path in the 'apxs2' parameter if your Apache bin directory is different: ./configure --with-mysql --with-apxs2=/usr/local/apache2/bin/apxs
Add "AddType application/x-httpd-php .php" to your Apache httpd.conf file. This lets Apache know what to do with PHP files.
To actually compile the PHP source, type: make
To install PHP: make install
Now set up your php.ini. You may edit your .ini file to set different PHP options: cp php.ini-dist /usr/local/lib/php.ini
Modify the "register_globals" parameter to "On" in /usr/local/lib/php.ini

Perl (with Thread Support)

Download Perl from Perl.org.
Unpack it.

Pay attention to the following items when you configure Perl on your system:

Be sure to use the "-Dusethreads" option. This is necessary for multithreading in Perl. To configure Perl with threading support on your system, type: sh Configure –Dusethreads
Choose to install libperl.so when prompted. This is a Perl interpreter for the Apache Web server.

After the install of Perl is complete, you can install other needed components by typing:

perl -MCPAN -e 'install HTML::Tagset'
perl -MCPAN -e 'install HTML::LinkExtor'
perl -MCPAN -e 'install HTML::Parser'
perl -MCPAN -e 'install Bundle::LWP'
perl -MCPAN -e 'install LWP::Parallel::UserAgent'
perl -MCPAN -e 'install Net::MySQL'
perl -MCPAN -e 'install DBI'
perl -MCPAN -e 'install DBD::mysql'

To clear up some possible confusion, the difference between the DBD- MySQL and DBI modules is as follows:

The DBI (Database Driver) is a module that enables Perl programs to attach to databases.
The DBD-mysql is a database driver for MySQL.

DBI uses the DBD as a translator to talk to MySQL.

Step 2: Code Snippets and Explanations
Now it's time to dive into the Perl crawler code, which includes the directives for threading:

use threads;  # threading routinesuse threads::shared;  # variable sharing routines

You create threads by calling threads->new(). The first argument to threads->new() is a subroutine reference where the new thread will begin execution. Start 10 crawler threads here (these are the work horses):

my %CrawlerThreads = ();for (0..9){    #grabs the content and links, writes to db    $CrawlerThreads->[$_] = threads->new(&URLCrawler, $url_id_list, $s);    print "Crawler " . ($_ + 1) . " created.
";}

The buildWordListFromDb{} sub loads all the words from your database dictionary table into memory. You use this throughout the lifecycle of the instance.

The URLArrayMonitor{} checks for when 99 or fewer URLs are on the stack and you haven't been instructed to stop. If so, it gets another 200 (unscanned) URLs from the database and pushes them on the stack.

The URLCrawler{} sub pops a URL off the stack and gets the content and title from the page.

The getWordsFromContent{} sub parses the Web pages content, grabs the words, and writes them to the ASSOC_URL_DICTIONARY database table.

The getLinks{} sub extracts the links found from this URL and inserts them into the database (the URLArrayMonitor thread will pick these up at a later time).

The MonitorStatus{} sub periodically checks the THREAD_INSTRUCTION database table for the instruction to begin killing the threads.

The text{} sub extracts the text from the parser.

To start the crawler, simply type the following:

perl crawlThread.pl

Step 3: Web Interface—Searching the Content
In this basic search engine example, the Perl application runs behind the curtains while the PHP Web application provides the interface for searching your MySQL database (populated by the crawler) and returning user-friendly data to the user. Figure 2 shows a screen shot that depicts a search for the word "database" and the results.


Figure 2: Search Results for the Word "Database"

The PHP source code is included in the code download.

Perl's Multithreaded Capabilities
You've seen how the multithreaded capabilities of Perl allow for some exciting possibilities. A complete tutorial on threads could fill a book, but the examples provided should give you a good understanding and get you on your way. In addition, you received the basic pieces of a Web search engine.

By running the crawlThread.pl program on multiple machines that all feed from the same MySQL database, you could scale the search engine horizontally. For additional information and the authoritative answers on how Perl threads behave, consult the documentation bundled with the Perl distribution. Before unleashing the code described in this article and traversing the Internet, you should add some logic in the crawler to ignore pages that you do not want crawled.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

Create a LAMP Search Engine Using Multithreaded Perl

Create a LAMP Search Engine Using Multithreaded Perl

About Our Editorial Process

About Our Journalist

Charlie Frank

Harris’s VP choice may shape climate agenda

Salesforce and Workday announce AI partnership

Pil partners with WaveBL for eBL digitization

Musk activates internet in Gaza hospital

Experts debate AI impact on cybersecurity

Palantir and C3.ai: high-potential AI stocks

Telefónica unveils new quantum security solution

Musk updates Tesla Roadster production timeline

Employees report AI increases their workload

Protect your online privacy with VPN

Amd announces Ryzen AI 9 HX 375

US faces hurdles to meet climate goals

Elon Musk’s xAI launches Memphis supercomputer

Switzerland mandates open-source software for government

Reddit blocks most search engines except Google

Monday sets record for hottest day

IBM stock rises on strong Q2 earnings

Wiz declines $23 billion offer from Alphabet

Military crackdown leaves 200 dead in Bangladesh

Elon Musk attends Netanyahu’s address to Congress

Ai-powered GR Supras complete tandem drift

Mega-cap tech stocks under pressure

New IBM cybersecurity certificate at community colleges

Eviden unveils Qaptiva™ quantum emulator for researchers

Telefónica Tech secures global BBVA cybersecurity deal