Home-grown Data Mining in SQL Server

Home-grown Data Mining in SQL Server

s a database programmer, when you don’t know where the tables and columns you need to do your job are, you’re dead in the water. You can browse through Enterprise Manager looking for tables whose columns are likely candidates, but columns often are tersely named in a less-than-intuitive manner, making the search for columns (either join columns for foreign keys or code columns for descriptive data) difficult and error prone.

The flipside of this scenario is an even more enigmatic predicament. You may know which values are required for an application or a report, but not know from where they originate. Schema information can provide some clues. You can acquire it either through the information schema views or?for the diehards among us?from joining sysobjects to whatever it is you are trying to find. However, if you’re searching for data, not structural information, you won’t find much help out there.

This 10-Minute Solution demonstrates a solution that uses information schema views to guide you through a database with which you are unfamiliar. You can specify the data values you need, and this solution will extract the table and column names that house those values.

How do I find data values in an unfamiliar database when I don’t know their location?

Use information schema views to navigate through the database and extract the table and column names that house your values.

Assignment: Find a Needle in a Haystack
As a database programmer, I spend a lot of time behind the Select statement, writing queries for reports and working them into stored procedures for processing. Ninety percent of that effort is the following tasks:

  1. Designing the appropriate join conditions to limit the result set
  2. Refining the filter conditions in the where clause to take care of any additional criteria that are not dependent on the join, but have to do with either treatment of nulls or a selection of codes or other flags that tailor the result to exactly what the user requires
  3. Finding appropriate indexes to match the search conditions that users use the most to improve query performance

The assumption underlying this work is that you know where the data you are looking for resides. You look for either a foreign key column to join to or a subset of values that define the result set. These are normally as plain as the nose on your face, but I had the opposite experience during a recent project. I knew the values I was looking for, but I had no idea where I would find them.

I was working in a new shop with dozens of servers that had hundreds of databases, which made for a very steep learning curve. I had to absorb and reflect the new enterprise’s organization and assumptions, and I got very worked up when I found out that keys I assumed to be foreign key columns were constructed on the float data type. My assumption turned out to be only half right; the database and table where I observed this anomaly was in a staging database, used to prep data for one of the data marts. That explained the data being segregated, but I found no reasonable explanation for the transactional system where the data came from being designed with float keys.

At any rate, I was armed with a set of four code values that designated the state of payment for a particular invoice line item:

  • One value designated a line item as a new invoice line item that was paid for on a particular check.
  • Another designated an invoice line as cancelled with no replacement invoice line.
  • A third designated an invoice line as cancelled with a new invoice line.

The last value designated an invoice line as originally paid but the subject of a voided and reissued check.

Clear as mud, right?

So what I had was four codes and no idea in which database or table I would find them. Naturally, my first move was to ask the person who assigned me this task where I could find them. He didn’t know, and neither did any of the other developers close to me. I did manage to find out that the source of the codes was a third-party application that was in the finance department. However, the users in that department had no idea where or how the codes were maintained. They just knew what the codes meant when they saw them and used their third-party application to assign them to their transactions.

Without enough knowledge or time to learn the answer, I had to come up with a way to locate these codes so that I could use them in my own work. I also could have just hardcoded these values, but I have been around long enough to have the “yes, these are all the codes we use” assumption come back to bite me. Six months later at the report phase of a project, I’ve found out that some vital code or flag had been overlooked, excluded, or introduced. The moral of the story is if you have the choice, data-driven strategies are best. ( some time and I’ll tell you my story about the Medicare Incident Number).

Immediate Solution vs. Better Judgment
The first thing that sprang to mind was trying to use the undocumented system stored procedures, sp_MSForEachDB and sp_MSForEachTable, to query each table in each database on a particular server. I’m not sure whether this approach would have worked well. While a query both for metadata (domains, data types, etc.) and for the data in a table is possible, I couldn’t figure out how to limit the return values by using the TOP modifier in the select statement. TOP would have prevented the query from returning potentially millions of rows. After all, I knew what I was looking for but didn’t know how many records my statement could potentially return, so I needed some way to limit the result. I ended up abandoning that approach, but clearly I could use a fragment of the idea. I could use the information schema views to restrict my investigation at least by data type.

I crossed over to “the dark side” when, instead of taking the time to design a set-based solution for the problem, I considered applying nested cursors to manage the search. Nested cursors matched my original undocumented system stored procedure solution. I let the logical equivalent of the solution override my better judgment. As it turned out, a set-based solution wasn’t all that critical, because the cursor solution was only slightly slower?and I needed the answer as soon as I could get it.

My strategy was to create an outer loop that would supply the table name, and then open an inner loop that would supply the column name. The inner loop would be restricted by an IN clause and a data type qualifier. I planned to build a string out of the two loops and then execute the string. If necessary, I could execute the string into a table so the results could be reviewed after the run, but in the solution’s first incarnation I was happy just to learn the names of the database, table, and column that contained my four values. I could get away with this because I was looking for character data. I didn’t know if it was char, nchar, varchar, or nvarchar, but at least I could eliminate the other twenty or so datatypes available to SQL Server.

Code for the Solution
I used the following code to query tables in a database for values I knew, but whose location I didn’t know:

DECLARE @tablename varchar(50)DECLARE @columnname varchar(50)DECLARE @datatype varchar(50)DECLARE @strsql Nvarchar(500)SET NOCOUNT ONDECLARE t1 CURSOR FOR    SELECT table_name FROM information_schema.tablesOPEN t1FETCH NEXT FROM t1 INTO @tablenameWHILE @@fetch_status = 0    BEGIN        DECLARE c1 CURSOR FOR        SELECT column_name, data_type FROMinformation_schema.columns        WHERE table_name = @tablename        OPEN c1        FETCH NEXT FROM c1 INTO @columnname, @datatype        WHILE @@fetch_status = 0            BEGIN                       IF @datatype = 'VARCHAR' or @datatype ='NVARCHAR'or @datatype ='NCHAR'or @datatype ='CHAR'BEGIN            SET @strsql = N' SELECT TOP 10'                    + @tablename+N'.'+@columnname                    + N' FROM '                    + @tablename                    + N' WHERE ' +@columnname                    + N' IN (''AC'',''VR'',''AB'')'                            EXEC sp_executesql @strsql                                        END                FETCH NEXT FROM c1 INTO @columnname,@datatype            END        CLOSE c1        DEALLOCATE c1        FETCH NEXT FROM t1 INTO @tablename    ENDCLOSE t1DEALLOCATE t1SET NOCOUNT OFF

I began by creating placeholder variables to store the values fetched from the table name or outer cursor. These feed the inner cursor that queries the table name’s columns (in the line WHERE table_name = @tablename).

The code then tests for the datatype, so I don’t bother asking about columns that are money or date time, for example. The other nice thing is using sp_executesql to run the string created from the variables collected in the cursor. I included the TOP modifier in hopes of getting the best execution plan, rather than have an already inefficient process wait around for table scans. TOP is a shot in the arm that can reduce execution time:

SET @strsql = N' SELECT TOP 10'                    + @tablename+N'.'+@columnname                    + N' FROM '                    + @tablename                    + N' WHERE ' +@columnname                    + N' IN (''AC'',''VR'',''AB'')'                            EXEC sp_executesql @strsql

Found: Four Needles in a Haystack
This idea could also be expressed as a set-based solution by wrapping a temporary table insert in a WHILE loop. This is a technique that was used before cursors existed. Since that route requires a temporary table, I chose this way to find my needle in a haystack. I hope you can find some use for it too.


About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist