Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Creating Your Own Identity (or What Does it Mean to Be Normal?)

A theoretical discussion of the meaning of data normalization and related concepts.


advertisement
n this month's 10-Minute Solution, I'll try to combine some database theory with a few of the practices I've touched upon in earlier Solutions. The older material was about update triggers; this month I'll discuss one use for them.

Normalization is kind of a "least cost method" for determining the (or at least a) "proper" relation for a column. For the purposes of this discussion, I'll say that a relation is a collection of domains. A domain defines a data type and the operations allowed on that type. Minding types and the operations allowed on them prevents me from making nonsensical comparisons, such as "100 degrees is more orange than 100 decibels" or "blue is hotter than fifty pounds."

Parenthetically, you can think of the design decisions you make while normalizing as finding out the relative rates of change between data elements. For instance, order lines change faster than order numbers, a person's phone number changes more often than the person's name, etc. These are expressed as either adding more columns to the relation or adding more rows and creating a new relation. As a rule of thumb—for transactional systems anyway—adding rows is much preferred to adding columns.



Decomposition and Aggregation
While I was in the seminary, I learned that there are two ways to normalize a schema: by decomposition or aggregation.

Decomposition is probably best visualized if you picture everything you know about what you want your system to remember and place that into one giant table. Now look across the rows and see whether there are any determinants (predictors) in that row. By determinant I mean "given x, I also know y," so given Student Number, I also know Student Name. (By the way, there are probably many things I could isolate—hair color, birthday, sex, etc.—based on this discovered predictor.)

Aggregation goes the other way. Instead of starting from a universal table, you start with a list of attributes. By finding out which determinants are active, you clump like elements into something called a data group. Data groups can start out having multiple entry points that "point" to the next element in a relation. As a database designer, it is your mission to resolve them down to one entry point, which acts as the predictor for everything else in the data group. Once all the dependent elements are collected into data groups, go ahead and create tables out of them, with your "predictor" as the primary key.

At the end of both methods, you want to wind up with "dense" tables—that is, no null values if at all possible, although sometimes it is unavoidable. Also, you want to eliminate doubles or duplicates. Given that a primary key uniquely identifies a row, the notion of duplicates is antithetical to normalization.

Rules of Inference
There are six rules of inference that determine the relationship between one, two, or more determinants. They are valuable in framing the dependencies between columns so that you do not lose data while rearranging the columns from the universal table (i.e., nonloss decomposition).

First normal form means that the data is in a relation of columns that contains the same type of data and that there is a way to tell one row from another using a single column as a predictor for the row—namely, the primary key.

Second normal form means that if in that relation there are two columns used as the tiebreaker for the row, then all other elements in the row must depend on the combination of columns. If you find that there is a column that is dependent on only one part of the key, you must remove them (the column and its key) to their own relation.

Third normal form means that there cannot be any cross-talk between columns. In other words, there cannot be any embedded dependencies between columns. If you have a table with City, State, and Zip, you can see that there's an embedded dependency between City and State.

These three forms make up the foundation of transactional systems because they guarantee the accuracy of Insert, Update, and Delete. There are two other forms that bear upon Update/Delete anomalies, but these fourth and fifth normal forms have to do with eliminating cyclic or pairs' dependencies on non-key attributes. To be honest, I've only had to wrestle with these forms when repetition in the columns tipped me off that something was amiss and life would be simpler if I separated them.

What Is a Key?
To paraphrase Chris Date, one of the luminaries of database theory, the purpose of a database is to be able to distinguish one thing from another. In his discussions on the meaning of duplicate rows in a table, Date avers that duplicates make no sense, because duplication makes it impossible to tell one row (actually, one instance of the entity) from another.

For example, let's say you have a table of Employees and its structure is:

FirstName, LastName, Sex

It makes no sense to have two rows in the employee table, like this:

Georgopulos, Andrew, M Georgopulos, Andrew, M

because you'll never be sure which one of the two records you want. If you ever need to correct the spelling of the name, you'll need to correct both of them.

If I broaden the example to include more information—by including another column in the table—you might be able to use that column in combination with the other columns to act as a "tiebreaker" between the apparent duplicates. For example, Birthday might be used with FirstName and LastName. This means you're accepting the assumption that it is very unlikely that two people born on the same day will have the same name. Be aware that unlikely does not mean impossible, however. Social Security Number might be a candidate for our tiebreaker, but Social Security Numbers are not always unique! So one thing is apparent from this investigation: if you want uniqueness, you have to be more rigorous about distinguishing one thing from another.

The idea of a key comes from the idea of determinants. A key is a column or combination of columns whose value does not repeat in a row of the table. When more than one column acts this way, it's called a candidate key. Once you decide which candidate key is going to be the primary key, the other candidate keys get demoted to alternate keys. You've now made your first design decision.

The idea of a key is predictability: if you know x about something then you also know y about it. In other words, you can say that x determines y or that y is determined by x. Other transformations are possible between determinants. I am not going to examine the rest of them in this article, but there is a good treatment of the topic in Joe Celko's SQL for Smarties by Joe Celko (Academic Press, 1995; revised 1999).

Choosing the Right Candidate Key
The choice of one candidate over another more often than not has to do with how well each of the candidate keys verbalizes the business' rules. In cases where one column would identify a row just as well as another (both are guaranteed unique), one will be chosen because it fits the syntax of the business better than the other. For example, if you have a hotel and you can identify a room by its room number as well as you can by its telephone extension, you would probably pick RoomNumber as the key because it fits the context of a hotel application better than TelephoneNumber does. The point is that there is more than one "correct" design possible, but one of them probably describes the business more naturally than the others.

There probably is a combination of columns that represents facts about the entity that could uniquely identify an instance of it. If you included all the columns in the example, for instance, you could unambiguously distinguish one employee from another because it is impossible for two people to share the same first and last names, birth date, and Social Security Number. With that combination of business facts, you can be guaranteed that when you look at a row, you are looking at the one and only thing (employee) described by that row.

Good as this sounds, creating unique identifiers out of business facts is fraught with danger. The danger lies in the nature of information, because it changes. Birthdays don't change, but a person could have more than one Social Security Number issued or could change his or her name—or there could be a data-entry error.

Basing the identification of a unique instance of an entity on the columns that describe that instance is asking for trouble, because they can change. What is required, then, is a meaningless placeholder, preferably an integer that can be used as an abbreviation that stands for the whole row. (I say "preferably an integer" because integers don't need to be translated from a character set for the database, so there is less overhead in interpreting an integer key rather than a character key.)

These placeholders are called surrogate keys because they "stand for" the combination of columns that uniquely identifies the row. Note that this puts the weight of integrity checking on the application designer because it is possible to record a duplicate row in the table and have it differ only by its surrogate key—putting us right back in the "duplicate" boat.

Using Built-in Functions to Generate an ID
You can use the Identity type (or SELECT @@IDENTITY) in Microsoft SQL Server to generate a new ID for you. The database has the presence of mind to use integers sequentially, if it can, and discards "used" IDs of transactions that failed to commit. This unnerves some people, as it leads to gaps in the sequence of generated IDs. But if you think about it the serialization of the keys really shouldn't matter; only their uniqueness matters. Normal database maintenance procedures execute routines that correct gaps in the identity sequence (dbcc checkident). There are other issues with the identity property that require special handling, especially where BCP is involved, but that's a topic for another article. (BCP stands for Bulk Copy Program, a utility in SQL Server that converts tabular data into text files and alternatively converts text files into database tables.)

Building Your Own IDs
It is possible to create your own identity crank. This technique has the drawback of serializing updates and creating hot spots in the identity table, because the next ID needs to be written and retrieved from a table you create. However it has the advantage of being almost maintenance-free.

Let's take a look at the column definitions from a PUBS database and look for a table with an INT primary key. You can use the following statement to display the table names, the column names for each table, and the data type for each column:

select tab.name, col.name, typ.name from sysobjects tab, syscolumns col, systypes typ where tab.type = 'U' and tab.id = col.id and col.usertype = typ.usertype

Now that you know what data types are being used, you can use the JOBS table as an illustration:

select publishers.pub_name, jobs.job_desc, jobs.job_id, employee.fname, employee.lname, from jobs, publishers, employee where employee.pub_id = publishers.pub_id and employee.job_id = jobs.job_id

Note that jobs.job_id has an INT data type for its primary key. Because it's the only one in PUBS with an INT, I'll use it for this illustration. In order to capitalize on the INT data type that JOBS uses for its primary key, you need to create a table that has two fields, one an INT data type to track the next ID to use for the table and the other column to hold the table name. You can use the following declaration to create the new ID table:

create table id (newid int, tablename varchar(32) )

To set up this experiment, you need to populate the ID table with the name of a table, and the next ID to use as a primary key in that table. Let's find out the highest value used for the job_id by querying the job_id column in the JOBS table:

select max(job_id) from jobs

I got 14 for this query, which I'll increment by one because that is the next value to use for a new entry into the JOBS table. Now you need to update the ID table with the table name, and the next available ID for that table, so you can say:

insert into id values (15, "JOBS")

That sets you up to use the ID table with the JOBS table to generate a new ID for the next insert into the JOBS table:

CREATE PROC GetNewId AS Declare @tablename varchar(32), @NextKey int BEGIN TRAN Update ID SET NewID = NewID + 1 Where Tablename = @Tablename select @NextKey = NewID from ID where tablename = @tablename COMMIT TRAN

Now you can insert any new line into JOBS with:

GetNewID "JOBS"

Summary
In everyday practice, it is desirable to have a "next ID" generator. SQL Server provides such a facility, but you must be scrupulous in applying database maintenance to get any benefit from it. The "next ID" generator, which I call the identity property, has a few unsettling idiosyncrasies that can be circumvented. This month's Solution showed you how to create your own identity while removing the necessity of one more DBCC check from you.



   
Drew Georgopulos is the Database Architect for GoAmerica Communications. He moonlights as an Adjunct Professor of Computer Science at Marymount College of Fordham University in Tarrytown, NY teaching systems analysis and relational database design. He holds a certification in systems analysis and relational design from Columbia University, has presented at SQL Connections on SQLDMO, and is a member of the Worldwide Institute of Software Architects.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap