Browse DevX
Sign up for e-mail newsletters from DevX


Home-grown Data Mining in SQL Server-2 : Page 2




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Assignment: Find a Needle in a Haystack
As a database programmer, I spend a lot of time behind the Select statement, writing queries for reports and working them into stored procedures for processing. Ninety percent of that effort is the following tasks:
  1. Designing the appropriate join conditions to limit the result set
  2. Refining the filter conditions in the where clause to take care of any additional criteria that are not dependent on the join, but have to do with either treatment of nulls or a selection of codes or other flags that tailor the result to exactly what the user requires
  3. Finding appropriate indexes to match the search conditions that users use the most to improve query performance

The assumption underlying this work is that you know where the data you are looking for resides. You look for either a foreign key column to join to or a subset of values that define the result set. These are normally as plain as the nose on your face, but I had the opposite experience during a recent project. I knew the values I was looking for, but I had no idea where I would find them.

I was working in a new shop with dozens of servers that had hundreds of databases, which made for a very steep learning curve. I had to absorb and reflect the new enterprise's organization and assumptions, and I got very worked up when I found out that keys I assumed to be foreign key columns were constructed on the float data type. My assumption turned out to be only half right; the database and table where I observed this anomaly was in a staging database, used to prep data for one of the data marts. That explained the data being segregated, but I found no reasonable explanation for the transactional system where the data came from being designed with float keys.

At any rate, I was armed with a set of four code values that designated the state of payment for a particular invoice line item:

  • One value designated a line item as a new invoice line item that was paid for on a particular check.
  • Another designated an invoice line as cancelled with no replacement invoice line.
  • A third designated an invoice line as cancelled with a new invoice line.

The last value designated an invoice line as originally paid but the subject of a voided and reissued check.

Clear as mud, right?

So what I had was four codes and no idea in which database or table I would find them. Naturally, my first move was to ask the person who assigned me this task where I could find them. He didn't know, and neither did any of the other developers close to me. I did manage to find out that the source of the codes was a third-party application that was in the finance department. However, the users in that department had no idea where or how the codes were maintained. They just knew what the codes meant when they saw them and used their third-party application to assign them to their transactions.

Without enough knowledge or time to learn the answer, I had to come up with a way to locate these codes so that I could use them in my own work. I also could have just hardcoded these values, but I have been around long enough to have the "yes, these are all the codes we use" assumption come back to bite me. Six months later at the report phase of a project, I've found out that some vital code or flag had been overlooked, excluded, or introduced. The moral of the story is if you have the choice, data-driven strategies are best. ( some time and I'll tell you my story about the Medicare Incident Number).

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date