Browse DevX
Sign up for e-mail newsletters from DevX


Perl + R: Open Source Programs Simplify Data Manipulation : Page 2

Extracting data from huge data stores for real-time reporting and analysis can be frustrating, especially if your data processes rely on bulky, expensive apps designed for a large-scale enterprise. Perl and R are two open source technologies that can be used to simplify everyday data extraction needs, cheaply and easily.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Perl Extracts the Data
The first order of business is to get Perl to fetch the HTML and parse its contents—while retaining only the HTML table that actually contains the players' batting statistics. Perl's excellent documentation paves the way. A simple search on Google Groups for "Perl extract HTML from Web page" will lead you to the core Perl LWP module (Perl's online library). Use the single line $html=get($URL); to fetch the contents of the above URL into a string variable for further processing.

The next step is to parse the returned HTML inside of this $html variable. This is done using the HTML::TableExtract module (contributed by Matthew P. Sisk). This module zooms into any HTML table(s) of interest in the HTML string and strips out the formatting. It's fairly straightforward to download this module and leverage the example. The table is the sixth one in the page and the only HTML table needed. You only want to start retaining text beginning with the third row in the table. The first two rows are used for formatting and contain HTML combo boxes. Likewise, you don't want the last four rows because they contain totals and other formatting. The rows in this module actually get parsed into an array of table cell objects. Using Perl's join() function, it's easy to concatenate array elements into a string. To finish the script and run from your desktop, post it online, along with the TableExtract.pm file. Post it in the CGI-BIN folder and connect to it via you browser. Click here to see how mine looked.

Figure 2: Post the script online along with the TableExtract.pm file.

R Stores the Data
Use R to download the data set located at the URL. The line bb <- read.table(URL, header=TRUE, sep=","); reads the contents into an R data frame at the given URL. What's a data frame? It's a spreadsheet-type object with columns and rows as well as columns and, optionally, row labels.

Tell the data frame to expect column headers and the text file to be comma-delimited. Once the data has downloaded into the bb data frame, issue print(bb);, which displays the contents of the object. This tells you whether the data has come in as expected.

R Displays the Data
Next, request some simple graphs. Ask R to perform descriptive statistics on all of the columns in the bb data frame by simply calling the summary(); function. If you think about how such aggregate computations would have to be setup in other packages, for instance SQL query, it's rather impressive.

Figure 3: Here's what the R output looks like.

Next up, use the lm() function to perform a simple linear regression call. For this example, ask R to predict RBI from SLG. Save the results into a linear model object which you can check by calling the same summary() function as above. Note that while a "data frame" object and a "linear model" object probably aren't the same thing, R handles them both with the same generic function. R takes advantage of a lot of built-in, generic functions that are smart. They implicitly know the type of object being passed in and act accordingly.

R, like Perl, internalizes a lot of default function parameters and is astute in its assumptions. Consequently, similar to how Perl can name a Web page's HTML tune in one line, R can produce a presentation quality plot by simply calling the plot() function. You can copy this output into a report or PowerPoint presentation. These graphs can also be saved out into many different formats, including postscript, PDF, PNG, BMP, or JPG.

Motivated readers that dive in will quickly discover that you can feed batch files to the terminal version of R by executing the command-line syntax:

Rterm -q --vanilla <SourceFile.r &GtOufFile.txt

/do this once your Windows system PATH variable is updated to include C:\Program Files\R\rw1070\bin;.

Figure 4: The R graphs.

This takes a SourceFile.r and feeds it to R. The output from your data analyses gets written out as text to OutFile.txt and any plots get saved out into a multipage postscript (.ps) file—one page per graph. I found an excellent open source program to view these post script files—as well as convert them to PDF files (or 101 other formats)—called GhostScript. There's actually two pieces to GhostScript on Windows. Ghostscript (the engine and accessible via command-line) and GSview (a graphical interface into Ghostscript).

This, of course, opens up even more possibilities! For example, I prefer to use TextPad for all of my coding independent of the programming language. Thus, I configured two new tools in TextPad—one for Perl and one for R.

Perl and R Working Together
This article has barely scratched the surface of these two open source languages. They can work synergistically to offer serious, viable solutions to complicated data extraction and analysis challenges. I encourage you not to wait two years before you explore Perl and R. They're both free, so what do you have to lose?

Tom Dierickx is a Data Analyst specializing in automating data processes and authoring data-driven solutions. He has a wide range of computer programming and database development experience along with a M.S. in Statistics from Arizona State University. All of the languages, tools, and techniques he enthusiastically pursues - whether technical or statistical - are simply an outgrowth of the passion he has for working with data. E-mail Tom.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date