hanks to the driving technologies of the 90s?namely, the Internet and enterprise databases?data collection and storage have become easier and cheaper than ever before. This has caused many organizations to shift their focus to data extraction and analysis components. In this new era, “traditional” tabular reporting packages now seem lacking. Expectations like real-time information are hard to live up to?especially for many smaller organizations. Developments at the data extraction layer are lagging?despite the fact that newer data analysis paradigms, like OLAP and data mining, have been growing in popularity at the enterprise level. Moreover, many existing data extraction, reporting, and analysis processes depend on bulky, expensive, proprietary software, or else have been painstakingly developed with many person-hours. These processes also tend to be geared towards the enterprise. So what if you’re in need of an ad hoc solution that will quickly solve small, day-to-day problems? Perl and R can help.
What is R?
R is an open-source, object-oriented system for statistical computation and graphics compatible with the commercial S-Plus program. I stumbled across R a couple years ago when my employer was an SPSS shop. At the time, I didn’t feel the to learn about the product. Now I’m working for a smaller company (This system possesses tremendous potential for individuals and organizations of all sizes to analyze data and easily create insightful graphics. Plus, R is free, cross-platform, and relatively lightweight (R can be used to import data from a database or from a text file, or even to download data sets directly from a URL as if it were a simple text file on your local computer. SPSS base can’t do this! The only catch is that the URL must be to a page containing only plain text, not HTML. To workaround this, you can use an intermediary Web page that strips out the HTML and retains only the data, in delimited text format. If you come from the ASP/VB world, you know this type of “cleansing” is tricky. This is where Perl’s text processing capabilities come in handy.
|Figure 1: I used the Boston Red Sox stat page as my sample Web page.
A Working Example
Whenever I’m trying to learn any new programming language or statistical technique, I look at baseball stats. It often helps to apply things in a context that makes the most intuitive sense. My R + Perl example retrieves real-time batting statistics for any given team, downloads them into R, and then performs some basic graphing and statistics. The Boston Red Sox is the guinea pig team and here’s the sample Web page.
Perl Extracts the Data
The first order of business is to get Perl to fetch the HTML and parse its contents?while retaining only the HTML table that actually contains the players’ batting statistics. Perl’s excellent documentation paves the way. A simple search on Google Groups for “Perl extract HTML from Web page” will lead you to the core Perl LWP module (Perl’s online library). Use the single line $html=get($URL); to fetch the contents of the above URL into a string variable for further processing.
The next step is to parse the returned HTML inside of this $html variable. This is done using the HTML::TableExtract module (contributed by Matthew P. Sisk). This module zooms into any HTML table(s) of interest in the HTML string and strips out the formatting. It’s fairly straightforward to download this module and leverage the example. The table is the sixth one in the page and the only HTML table needed. You only want to start retaining text beginning with the third row in the table. The first two rows are used for formatting and contain HTML combo boxes. Likewise, you don’t want the last four rows because they contain totals and other formatting. The rows in this module actually get parsed into an array of table cell objects. Using Perl’s join() function, it’s easy to concatenate array elements into a string. To finish the script and run from your desktop, post it online, along with the TableExtract.pm file. Post it in the CGI-BIN folder and connect to it via you browser. Click here to see how mine looked.
|Figure 2: Post the script online along with the TableExtract.pm file.
R Stores the Data
Use R to download the data set located at the URL. The line bb reads the contents into an R data frame at the given URL. What’s a data frame? It’s a spreadsheet-type object with columns and rows as well as columns and, optionally, row labels.
Tell the data frame to expect column headers and the text file to be comma-delimited. Once the data has downloaded into the bb data frame, issue print(bb);, which displays the contents of the object. This tells you whether the data has come in as expected.
R Displays the Data
Next, request some simple graphs. Ask R to perform descriptive statistics on all of the columns in the bb data frame by simply calling the summary(); function. If you think about how such aggregate computations would have to be setup in other packages, for instance SQL query, it’s rather impressive.
|Figure 3: Here’s what the R output looks like.
Next up, use the lm() function to perform a simple linear regression call. For this example, ask R to predict RBI from SLG. Save the results into a linear model object which you can check by calling the same summary() function as above. Note that while a “data frame” object and a “linear model” object probably aren’t the same thing, R handles them both with the same generic function. R takes advantage of a lot of built-in, generic functions that are smart. They implicitly know the type of object being passed in and act accordingly.
R, like Perl, internalizes a lot of default function parameters and is astute in its assumptions. Consequently, similar to how Perl can name a Web page’s HTML tune in one line, R can produce a presentation quality plot by simply calling the plot() function. You can copy this output into a report or PowerPoint presentation. These graphs can also be saved out into many different formats, including postscript, PDF, PNG, BMP, or JPG.
Motivated readers that dive in will quickly discover that you can feed batch files to the terminal version of R by executing the command-line syntax:
Rterm -q --vanilla
/do this once your Windows system PATH variable is updated to include C:Program FilesR
|Figure 4: The R graphs.
This takes a SourceFile.r and feeds it to R. The output from your data analyses gets written out as text to OutFile.txt and any plots get saved out into a multipage postscript (.ps) file?one page per graph. I found an excellent open source program to view these post script files?as well as convert them to PDF files (or 101 other formats)?called GhostScript. There's actually two pieces to GhostScript on Windows. Ghostscript (the engine and accessible via command-line) and GSview (a graphical interface into Ghostscript).
This, of course, opens up even more possibilities! For example, I prefer to use TextPad for all of my coding independent of the programming language. Thus, I configured two new tools in TextPad?one for Perl and one for R.
Perl and R Working Together
This article has barely scratched the surface of these two open source languages. They can work synergistically to offer serious, viable solutions to complicated data extraction and analysis challenges. I encourage you not to wait two years before you explore Perl and R. They're both free, so what do you have to lose?