Data Cleansing: Powershell Versus UNIX Shell Scripting
While researching how to use the Windows Management Interface (WMI) to collect statistics for performance monitoring, I was lucky to stumble across some new tools improvements in Windows Server 2003 R2 that rival the ease-of-use of UNIX shells. These logarithmic improvements not only can help improve scripting automation tasks and teach you how to collect WMI data, but the WMI utilities don't require any written code. The scripting language improvement is called Powershell
, and the WMI management improvements are called WMICodeCreator and Scriptomatic.
The excitement of discovering Powershell was heightened by my recent experience working with UNIX shell scripting in conjunction with Informatica in the context of ETL processing. I had my eyes opened to standard operating procedures and the expectations of ETL developers in this environment. I couldn't help but contrast UNIX with what I knew about accomplishing the same tasks in Windows. To be precise, I thought the functionality I saw my teammates use for UNIX shell scripting was possible in Windows, but only with the previously described overhead.
The principal difference between what I saw and what I expected was the facility of piping commands so the data stream is transformed by each operation. The products of these chained commands exceed those of the individual contributing commands. And all this power came with an entry-level account in UNIX.
The UNIX scripting task that left me shaking my head was a data file that featured an interesting characteristic: the numeric part of the address line abutted the last character of the name (e.g., "Smith123 Main Street" or "Jones456 Elm Street"). My UNIX compadre told me this was no problem. He explained that we would pre-process the file and insert a comma and a space at each number with a Perl regular expression like this:
The expression inserts a comma in front of the first number it encounters in a string. You save it in a filein this example, it's called fileand redirect it to an output file.txt like this:
./file.pl inputFile.txt > outputFile.txt
My colleague wrote a file with that expression, and then read the text file using the expression file as a filter and another UNIX convention in the same category as pipe or redirection. Next, he wrote the product to a new file named "OUT." This file had the numbers shifted over so that the last name ended with a comma.
I was flabbergasted.
The good news is Powershell enables this functionality in Windows at least as directly and naturally as UNIX shell scripting does. To accomplish the same thing as the Perl routine, the expression would say "
–replace \d,$1". The parentheses go away, and you replace the normal regular expression tag for replace (
s/) with the Powershell cmdlet
-replace. The following screen shots (Figures 1-4) show just how brain-dead simple working with Powershell is compared with anything else.
Figure 1 shows how to invoke Powershell from the command window. You know you've got it when the PS prompt blinks.
|Figure 1. Invoking Powershell from the Command Window|
Figure 2 shows how to use Powershell's match cmdlet (say, commandlet) to execute the Bill Clinton test for existence. If a match is found in the string, the inbuilt variable
$matches contains the results of the match.
|Figure 2. Using Powershell's Match cmdlet to Execute the Bill Clinton Test for Existence|
Figure 3 sets the scene with a sample input file that has a problem: the address line numbers abut the name.
Figure 4 shows how to store an instance of the munged input line and use a .NET regular expression to fix the input stream. The English translation is "Find a number of indeterminate length."
|Figure 4. Store an Instance of the Munged Input Line and Use a .NET Regular Expression to Fix the Input Stream|
Powershell is light years ahead of Windows' preter-language, command processor commands (What is that stuff anyway? The stuff you write command files in?). It provides everything you ever wanted in a programming language: flow of control and strong typing of its variables and expressions to manipulate them to create scripts that leverage the .NET Framework in a much more accessible skin.