Step 1: Preliminary Setup
To begin, you must set up your Linux system with the proper software:
- Perl (with threading support)
Although you may use other versions of Apache, the PHP application is developed using version 2.0.47.
If you do not already have MySQL installed on your Linux system, you can obtain it directly from MySQL.com. If you run the crawler on the same box as your database, install the MySQL server, client, and development libraries, for example:
- rpm -ivh MySQL-server-4.0.18-0.i386.rpm
- rpm -ivh MySQL-client-4.0.18-0.i386.rpm
- rpm -ivh MySQL-devel-4.0.18-0.i386.rpm
Once you've installed the MySQL server, you must create the database used in this example and apply some additional security. At the command prompt, type the following commands:
mysql -u root
create database search;
GRANT ALL PRIVILEGES ON *.* TO search@localhost IDENTIFIED BY 'clamchowder' WITH GRANT OPTION;
GRANT USAGE ON *.* TO search@localhost;
Tweak the above privileges to meet your specific needs and be sure to choose a more secure password.
Run the script (included in the code download) at the command prompt to create the physical database structure and load some initial words into your dictionary:
mysql -u root -p search < /search.sql
Once completed, you can issue the show tables command from within MySQL:
mysql> show tables;
| Tables_in_search |
| assoc_url_dictionary |
| dictionary |
| thread_instruction |
| url |
4 rows in set (0.01 sec)
Perl (with Thread Support)
- Obtain PHP from PHP.net.
- Once downloaded, extract the files by typing:
bzip2 -dc php-4.3.6.tar.bz2 | tar -xvf-
- After you've extracted the files, configure PHP with Apache 2 and MySQL support. Change the path in the 'apxs2' parameter if your Apache bin directory is different:
./configure --with-mysql --with-apxs2=/usr/local/apache2/bin/apxs
- Add "
AddType application/x-httpd-php .php" to your Apache httpd.conf file. This lets Apache know what to do with PHP files.
- To actually compile the PHP source, type:
- To install PHP:
- Now set up your php.ini. You may edit your .ini file to set different PHP options:
cp php.ini-dist /usr/local/lib/php.ini
- Modify the "register_globals" parameter to "On" in /usr/local/lib/php.ini
- Download Perl from Perl.org.
- Unpack it.
Pay attention to the following items when you configure Perl on your system:
- Be sure to use the "-Dusethreads" option. This is necessary for multithreading in Perl. To configure Perl with threading support on your system, type:
sh Configure –Dusethreads
- Choose to install libperl.so when prompted. This is a Perl interpreter for the Apache Web server.
After the install of Perl is complete, you can install other needed components by typing:
perl -MCPAN -e 'install HTML::Tagset'
perl -MCPAN -e 'install HTML::LinkExtor'
perl -MCPAN -e 'install HTML::Parser'
perl -MCPAN -e 'install Bundle::LWP'
perl -MCPAN -e 'install LWP::Parallel::UserAgent'
perl -MCPAN -e 'install Net::MySQL'
perl -MCPAN -e 'install DBI'
perl -MCPAN -e 'install DBD::mysql'
To clear up some possible confusion, the difference between the DBD- MySQL and DBI modules is as follows:
- The DBI (Database Driver) is a module that enables Perl programs to attach to databases.
- The DBD-mysql is a database driver for MySQL.
DBI uses the DBD as a translator to talk to MySQL.