he version of DTS in SQL Server 2000 provides a bundle of pre-packaged tasks. Most are dedicated to moving data, but some are specific to SQL Server maintenance. The data tasks facilitate moving information between database servers (not limited to SQL Server), or moving other file types from FTP or text files into databases. Some tasks are specific to SQL Server maintenance (e.g., transferring databases or transferring user logins), but the bulk of the functionality is dedicated to moving data.
From a higher-level view, DTS tasks perform rudimentary jobs that you can use to get your work done. However, they got me only almost home because I was using them “right out of the box.” I didn’t understand how to connect them to fulfill my requirements. Once I learned how to coordinate several different types of tasks, I found that DTS allows access to other object models (ADO), which enabled me to reach my processing goals.
How can I import an indeterminate number of items from a directory into a staging table and stamp the records with the directory from which they came and the time of the import? How do I then parse each file name into fields in the staging table by identifying the elements contained in the imported file?
Use the DTS Execute Process task to execute a batch file that redirects the result of a dir command to a text file. Then use the DTS Connection for a Text File Source to pipe the contents of the text file into a Transform Data task. This fills a table that will hold the names of the files you want to process.
Business Problem: “After the Fact” Updates
In terms of resource usage, establishing a database connection between an application and a server is one of the most expensive things you can do. Yet because the interface (DTS Designer) makes selecting a connection and placing it in a package so easy, you may be tempted to just select a new connection, point it to a task, and go?ignoring the resulting resource glut. But the right thing to do and the easy thing to do are two different courses.
If you need to import a collection of objects, the easy thing to do is to open a connection and hook it up to a task that performs the operation (insert, update, find a substring, etc). However, the right thing to do is reuse the connection and loop through the collection, moving on to the next instance.
This connection principle lent itself well to a recent project I had to tackle at a client organization. My job was to build a solution that would automatically refresh check register data from Excel, kind of like replication does for databases. The problem was, as far as I knew, replication didn’t exist for Excel.
Using a selection of DTS tasks enabled me to rebuild a snapshot between two disparate data sources (Excel and the data warehouse), saving the organization hours of manual reconciliation between their claim data and their check data. By using DTS, I was able to import their data from Excel and compare it against the data warehouse to generate a list of only the checks that required investigation, typically less than $5,000 worth of checks on claims batches of more than $5 million.
Basically, I automated the business process of importing cleared checks from Excel workbooks and comparing them to claim records from the data warehouse. However, the unstable nature of claims, which can undergo a change in state after they are paid, complicates this process. For example, utilization management could decide that a doctor was overpaid, a claim was paid twice, or a claim was underpaid. These facts of life mean that checks often undergo changes after they are written. Likewise, a check could be lost voided or reissued.
I needed to account for this dynamic aspect in my solution. A snapshot of the check register was not adequate, because these changes in state occurred sometimes months after the claim was settled. Instead, the solution needed to refresh the check data every time a new batch of claims was reconciled, typically three times a week. So while replication would have been a good solution if the check data were also stored in a database, in this case the Finance department, which was responsible for the checking part of the process, managed the data in Excel. (What finance department doesn’t?)
The Task to Fulfill the Task
So after examining the pre-packaged tasks in the DTS Designer, I came across the Execute Process task, which lets you execute a process from DTS (kind of how
xp_cmdshell does, except without the usual restrictions or rights). Since this task reaches out to the OS, the rights to execute it are left to members of the administrator’s role. DTS lets you short-circuit this restriction by executing the command in the context of the service running the server. So you get a little immunity over the rights required to run OS jobs.
Figure 1 shows a top down view of the first leg of this process.
|Figure 1: Top Down View of Executive Process|
The step Run Checkloc.bat is an Execute Process task, something that runs a program or operating system command. Here, I’m using it to execute a batch file. I want to redirect the output of a DIR command into a text file, and then use the DTS text import task to push the list from the DIR command into a table. The batch file changes the current directory (in two steps, just for clarity’s sake) and then the DIR command is executed. It looks for all instances of files ending in XLS and redirects them to my local drive into a text file named Checkloc.txt. This file is input to the text file source connection later on in the package.
For easier deployment, you should record all your path references in DTS using the
\ServerNameSharename designation instead of referring to the mapped drive letter. The mapped drive is dependent on the location where the process runs, and the server may not have installed the mapping that the developer who created the package did when he or she created it. So preface your paths with the
\ServerNameShareName convention rather than using an explicit drive letter.
The commands shown in Figure 2 produce a text file that looks like the directory display at the command window. Figure 3 shows the output of the cmd file.
|Figure 2: Commands to Produce Text File|
By pointing the Text File Source task to this text file, I was able to direct DTS to use this as input into a table. I just brought a whole line from the text file into a staging table, then applied the Execute SQL Tasks one after another to clean up the row (using the substring function and an update command). As a result, I was left with only the directory name, the file name, and (by doing a substring select) a file type. File types were embedded in the file names, so HH represented one type of check payment and OV another. These were required for grouping the report that Finance creates at the end of their reconciliation process. Making data part of a file name generally is a bad practice, but the IT department doesn’t have a lot of influence over Finance’s naming conventions.
|Figure 3: The Output of the cmd File|
I input the text file into a transform data task (see Figure 4). The source for the transformation is the Text file I created from the batch file. The destination is the scrubbing table, which has one 255 varchar row holding the untransformed input and another column called Cleaned to hold the output of the transformation process. The transformation type is Copy Column, which does exactly what it says. I used T-SQL to parse the contents of Col001 into Cleaned, and then just replaced Col001 with the substring I needed out of Cleaned.
|Figure 4: The Transform Data Task|
After the Transform Data task, a sequence of Execute SQL tasks perform cleanup on the imported rows. The tasks’ names indicate their jobs. For example, the first Execute SQL task removes an End of Line character from the end of the row, which appears as a little box sticking out at the end of the file name. It stands for the carriage return line feed characters, which are not displayable. I also had to remove the excess lines that were imported from the directory listing. I used the line starting with the path and appended it to the front of the file name so that I could tell the source from which the Excel file came. The last two tasks parse the HH and OV out of the file name and put them in the first column.
At the end of this process, I had the contents of the directory holding the Excel spreadsheets in a table. My next job was to extract the contents of each of these files into a scrubbing table. That would leave me with the most current version of the check registers to validate the payments of claims for which the checks were issued.
As so happens in the course of development, ideas that would’ve improved or refined the initial method used to create a solution crop up after the fact. So its no surprise that after I got this month’s 10-Minute Solution working, I got a few other ideas that would have made it more elegant.
It would have been more elegant to replace the batch file with the Scripting Runtime File System Object (FSO). FSO has methods that know about the folder name, the folder’s files collection, and each file’s attributes (its name, date, and time). It can also read file names and rename them too.
The main advantage of using FSO would have been eliminating the text file altogether. By putting the FSO inside a for loop that would execute once for each file in its file collection, the operation that loads the text file and moves the cleaned file name to the location table could be reduced from a two-step process to a one-step process. Unfortunately, this idea came to me after the process was already in production. So I didn’t follow up on it, but it does strike me as a more elegant way to accomplish this same end. The file names and file types would remain available, and you could either use the file’s date and time, or else retrieve the update time using
getdate(). You could even make the process more lightweight by saving the file’s date and time and the time it was last loaded (the
getdate() value), and use these as kind of a timestamp. If the file date and time did not change since the last time the location table was populated, you would know that the Excel worksheet was not changed. Therefore, it would not need to re-imported. Ah, but hindsight is always at least 20/20.
In next month’s 10-Minute Solution, I will show you how to use the contents of this table of file names to feed two more downstream processes that open up each Excel file and insert their contents into a staging table. Then I’ll show you how to create secure processes that encapsulate these sub-processes into higher-level packages. Stay tuned.