Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Perl + R: Open Source Programs Simplify Data Manipulation

Extracting data from huge data stores for real-time reporting and analysis can be frustrating, especially if your data processes rely on bulky, expensive apps designed for a large-scale enterprise. Perl and R are two open source technologies that can be used to simplify everyday data extraction needs, cheaply and easily.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

hanks to the driving technologies of the 90s—namely, the Internet and enterprise databases—data collection and storage have become easier and cheaper than ever before. This has caused many organizations to shift their focus to data extraction and analysis components. In this new era, "traditional" tabular reporting packages now seem lacking. Expectations like real-time information are hard to live up to—especially for many smaller organizations. Developments at the data extraction layer are lagging—despite the fact that newer data analysis paradigms, like OLAP and data mining, have been growing in popularity at the enterprise level. Moreover, many existing data extraction, reporting, and analysis processes depend on bulky, expensive, proprietary software, or else have been painstakingly developed with many person-hours. These processes also tend to be geared towards the enterprise. So what if you're in need of an ad hoc solution that will quickly solve small, day-to-day problems? Perl and R can help.

What is R?
R is an open-source, object-oriented system for statistical computation and graphics compatible with the commercial S-Plus program. I stumbled across R a couple years ago when my employer was an SPSS shop. At the time, I didn't feel the to learn about the product. Now I'm working for a smaller company (<50 employees) who simply can't afford them spend $1200 per seat. About a month ago, I finally looked into R. Now I'm hooked!

This system possesses tremendous potential for individuals and organizations of all sizes to analyze data and easily create insightful graphics. Plus, R is free, cross-platform, and relatively lightweight (<30Mb), making it a viable option to deploy anywhere—your laptop, an external client, or your home computer. Unfortunately, there's no GUI (yet) to aid in learning the language, but the documentation is excellent (it's largely compatible with S-Plus, so any examples for S-Plus should also work in R).

R can be used to import data from a database or from a text file, or even to download data sets directly from a URL as if it were a simple text file on your local computer. SPSS base can't do this! The only catch is that the URL must be to a page containing only plain text, not HTML. To workaround this, you can use an intermediary Web page that strips out the HTML and retains only the data, in delimited text format. If you come from the ASP/VB world, you know this type of "cleansing" is tricky. This is where Perl's text processing capabilities come in handy.

Figure 1: I used the Boston Red Sox stat page as my sample Web page.

A Working Example
Whenever I'm trying to learn any new programming language or statistical technique, I look at baseball stats. It often helps to apply things in a context that makes the most intuitive sense. My R + Perl example retrieves real-time batting statistics for any given team, downloads them into R, and then performs some basic graphing and statistics. The Boston Red Sox is the guinea pig team and here's the sample Web page.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date