hanks to the driving technologies of the 90snamely, the Internet and enterprise databasesdata collection and storage have become easier and cheaper than ever before. This has caused many organizations to shift their focus to data extraction and analysis components. In this new era, "traditional" tabular reporting packages now seem lacking. Expectations like real-time information are hard to live up toespecially for many smaller organizations. Developments at the data extraction layer are lagging
despite the fact that newer data analysis paradigms, like OLAP and data mining, have been growing in popularity at the enterprise level. Moreover, many existing data extraction, reporting, and analysis processes depend on bulky, expensive, proprietary software, or else have been painstakingly developed with many person-hours. These processes also tend to be geared towards the enterprise. So what if you're in need of an ad hoc solution that will quickly solve small, day-to-day problems? Perl and R can help.
What is R?
R is an open-source, object-oriented system for statistical computation and graphics compatible with the commercial S-Plus program. I stumbled across R a couple years ago when my employer was an SPSS shop. At the time, I didn't feel the to learn about the product. Now I'm working for a smaller company (<50 employees) who simply can't afford them spend $1200 per seat. About a month ago, I finally looked into R. Now I'm hooked!
This system possesses tremendous potential for individuals and organizations of all sizes to analyze data and easily create insightful graphics. Plus, R is free, cross-platform, and relatively lightweight (<30Mb), making it a viable option to deploy anywhereyour laptop, an external client, or your home computer. Unfortunately, there's no GUI (yet) to aid in learning the language, but the documentation is excellent (it's largely compatible with S-Plus, so any examples for S-Plus should also work in R).
R can be used to import data from a database or from a text file, or even to download data sets directly from a URL as if it were a simple text file on your local computer. SPSS base can't do this! The only catch is that the URL must be to a page containing only plain text, not HTML. To workaround this, you can use an intermediary Web page that strips out the HTML and retains only the data, in delimited text format. If you come from the ASP/VB world, you know this type of "cleansing" is tricky. This is where Perl's text processing capabilities come in handy.
|Figure 1: I used the Boston Red Sox stat page as my sample Web page.
A Working Example
Whenever I'm trying to learn any new programming language or statistical technique, I look at baseball stats. It often helps to apply things in a context that makes the most intuitive sense. My R + Perl example retrieves real-time batting statistics for any given team, downloads them into R, and then performs some basic graphing and statistics. The Boston Red Sox is the guinea pig team and here's the sample Web page.