Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Sound Off with Soundex

In this article, learn how to use phonetic algorithms such as soundex to make looking up names faster, easier, and more reliable.


advertisement
If you call a restaurant to make reservations, it doesn't really matter how they spell your name. If your name is Shawn Dwayne, it doesn't matter if they write down Shaun Duane, Sean Dwyane, or Shaan Dhuane as long as they can read their handwriting when you show up to eat.

Talking to a customer service rep, however, is very different story. Unless the rep spells your name correctly, the customer database will laugh merrily at attempts to find your record and possibly start a game of 20 questions trying to narrow down the search.

Yet somehow the better organized companies usually find customer records quickly and easily. How do they do that? Some use caller ID to find your record, at least if you're calling from your home phone. Others may employ psychics but there's actually a very simple technique that will let you find records quickly and painlessly most of the time without worrying about exact spelling: phonetic algorithms.



A phonetic algorithm encodes words to represent the way they sound rather than the way they're spelled. If the algorithm does a good job, then words that sound alike have the same encoding. Instead of looking up Sheawn Dhuawyne in the database, the computer looks up the sound of the name. If there are a few names that same sound the same, the computer lets the rep ask the customer a couple questions to make the correct choice. ("Are you the Sean Dwayne on 2652 Elm St?")

Start with Soundex

Soundex is a particular phonetic algorithm originally patented way back in 1918 when computers were treadle-powered. It was later modified and adopted by the US Census Bureau to help study their immense collection of data.

Each name's census soundex encoding consists of a letter and three digits. For example, my name, Stephens, is encoded as S315. Sometimes people add a dash after the letter to get S-315 but that's just decoration. To find a name's census soundex encoding, you can follow these steps:

1. Encode the letters using the table below, discarding vowels, H, W, and Y except in the first position:

2. If a name contains double letters, treat them as one.

3. If two adjacent letters have the same soundex code, treat them as one.

4. If two consonants are separated by a vowel or Y, use both consonants. (I.e. don't use rule 2 or 3 to collapse them.)

5. If two consonants are separated by H or W, treat them as one. (I.e. rule 3 applies.)

6. Replace the first code with the original name's first letter.

Letters Code
B F P V 1
C G J K Q S X Z 2
D T 3
L 4
M N 5
R 6

For my name, you get 23152 (S = 2, T = 3, P = 1, N = 5, S = 2). There are no adjacent duplicates so you replace the initial 2 with S and truncate to four characters to get S315. For another example, consider the name APPLEFAB. When you start encoding letters, you keep the initial A. I encode it as 0 so it has a placeholder. Then rule 1 gives you 011411 (A = 0, P = 1, P = 1, L = 4, F = 1, B = 1). The two Ps are double letters so we drop one. The F and B have the same code so rule 3 says to drop one but they are separated by A so rule 4 overrides that and we keep them both, giving the code 01411. Replacing the initial code 0 with A and truncating to 4 characters gives the final result A141.

You can learn more about the census soundex algorithm and see its official definition at The Soundex Indexing System.

The Soundex example program, which is available for download in C# and Visual Basic versions, uses the following code to implement this algorithm. I actually think the method followed by the code is a bit easier to follow than the official definition.

// Return the census soundex for a name. static public string CensusSoundex(string name) {    // Convert to upper case.    name = name.ToUpper();    // Encode the characters.    string encoded_name = "";    foreach (char ch in name)    {        if ("BFPV".Contains(ch))        {            encoded_name += '1';        }        else if ("CGJKQSXZ".Contains(ch))        {            encoded_name += '2';        }        else if ("DT".Contains(ch))        {            encoded_name += '3';        }        else if (ch=='L')        {            encoded_name += '4';        }        else if ("MN".Contains(ch))        {            encoded_name += '5';        }        else if (ch=='R')        {            encoded_name += '6';        }        else if ("HW".Contains(ch))        {            encoded_name += '-';        }        else        {            // Vowels and Y.            encoded_name += '0';        }    }    // Remove H and W, after the first character.    encoded_name = encoded_name[0] +        encoded_name.Substring(1).Replace("-", "");    // Remove adjacent duplicates.    encoded_name = Regex.Replace(encoded_name, @"(.)\1+", "$1");    // Replace the first code with the    // first letter in the original name.    encoded_name = name[0] +        encoded_name.Substring(1);    // Remove vowels and Y (after the first letter).    // These all have encoding 0.    encoded_name = encoded_name.Replace("0", "");    // Make sure we have exactly 4 characters.    return encoded_name.PadRight(4, '0').Substring(0, 4); }

The code starts by converting the name to upper case. It then loops through the name's characters, building a string that contains the encodings for each character. Instead of removing vowels, H, W, and Y, it encodes H and W as -, and vowels and Y as 0. While these characters must go eventually, they make the code a bit easier for now.

Next the code removes the encodings for H and W after the first character. The only reason I encoded them was to make it easy to leave the encoding if the name starts with H or W.

The code then uses the regular expression class Regex to remove adjacent duplicate codes. If you haven't used regular expressions before, you may want to spend a few minutes studying them. By using regular expressions, you can search for complicated patterns and make replacements within a string or file. For more information, see Microsoft's web page .NET Framework Regular Expressions.

The CensusSoundex function calls the Regex class's static Replace method passing it a pattern to find and a pattern with which to replace it. The pattern to find is (.)\1+. The . matches any single character. The parentheses make the matched character into a group. The sub-expression \1 means to repeat whatever is in group 1, which contains whatever the . matched. Finally the + means to repeat the thing immediately before it (group 1) one or more times. The result is that this expression matches any one character followed by one or more instances of the same character.

The replacement pattern is $1. This means to replace whatever is found with the value of the first group. (Yes, it's a little odd that \1 means the first group while you are matching but $1 means the first group while you are replacing.)

Taken all together, the call to Regex.Replace replaces multiple occurrences of any character with a single occurrence of that character.

Note that at this point the code has removed H and W from the encoding (except possibly as the first character) so rule 5 is satisfied.

Also note that the encoding includes a 0 for any vowel or Y so the replacement does not remove adjacent duplicates separated by a vowel or Y as required by rule 4.

Having compressed duplicates, the code replaces the encoding's first character with the original name's first letter. It removes the encodings for vowels and Y to get rid of them, and finishes by padding the result to four characters if necessary.

Sound Reasoning

Now that you know how to calculate soundex encodings, how do you use them to make locating customer records easier? First add a new Soundex field to your Customers table and store each customer's soundex encoding in that field.

Now when a customer calls and says his name is Balzac, you enter your best guess at the spelling. The program encodes Ballzack (or whatever you enter) to get B422, looks up records with that Soundex value, and displays them in a list. You can then ask the customer for clarification if needed ("Is that Honoré de Balzac?") and select the right one.

Note that Shawn Dwayne, Shaun Duane, Sean Dwyane, and Shaan Dhuane all encode as S535 so the customer service program will find them all. Also Balsac, Balzak, and Ballzack all encode to B422 so the program can find them, too.

The Census web site recommends that you also look at encodings for names that have prefixes such as Van, Di, Con, and so forth both with and without the prefix. For example, DeLorean may have been filed under Dalorean or Lorean. Of course that's for the census database and your data may have other characteristics. Note that some databases including SQL Server and Oracle provide a SOUNDEX function that you can use to put soundex codes in the Customers table and to figure out what value to search for when you enter Ballzack. You still need to use the field properly but at least you don't need to use your own code to calculate the soundex encoding.

Numeric Soundex

Soundex encodings contain a letter and three digits between 1 and 6. While these encodings are easy for a person to read, they would be slightly easier for a computer to process if they were numbers instead of strings.

The following code returns a numeric soundex encoding.

// Return the name converted into a numeric census soundex value. static public int NumericCensusSoundex(string name) {    // Convert into the normal census soundex format.    name = CensusSoundex(name);    // Convert the first letter into a number.    string string_result = ((int)name[0] - (int)'A').ToString();    // Add the rest of the string.    string_result += name.Substring(1);    // Convert the result into an integer.    return int.Parse(string_result); }

The code first calls the CensusSoundex function described previously. It then converts the initial letter to a number where A = 1, B = 2, etc. The code then appends the census soundex encoding's digits and converts the whole thing into am integer. The largest value the code might produce is 25,666, which easily fits in an integer. (It even fits in a short integer.)

Soundex isn't the only phonetic algorithm out there. Others have been devised to handle special cases more effectively. For example, the New York State Identification and Intelligence System Phonetic Code (NYSIIS) converts specific common combinations of letters into phonetic equivalents. Some of the conversions it makes change KN to N, PH to FF, and SCH to SSS when they appear at the beginning of names.

Other algorithms are optimized for particular types of names. For example, Daitch-Mokotoff Soundex is designed to handle Eastern European names and can differentiate between names such as Moskowitz and Moskovitz.

Some programs combine a phonetic algorithm such as soundex with a match rating algorithm to decide which candidate strings make the best matches. The program calculates the encoding for the new string and then uses match rating to decide how close that encoding is to the encodings of other potential matches. For more information on this approach, see Wikipedia's entry Match Rating Approach. (You might also experiment with my previous article Turn a Parsnip into a Turnip with Edit Distance Algorithms to measure the differences between encodings.

Phonetic algorithms won't solve all of your customer lookup problems. If the customer says Smith but you enter Jones, no algorithm in the world will save you.

Addendum

The virtual ink had barely dried on this article before alert reader Thomas Gibson asked, "Is the soundex logic you described the algorithm that the SQL Server SOUNDEX function implements?"

Because the census soundex algorithm is posted on the Web and easy to find, my first thought was that the two were probably the same. It would make sense to stick with a standard algorithm to make SQL Server more compatible with census data and with other database products.

But you never know, so I wrote a test program to compare the two. The program used my code to calculate the census soundex and then used code similar to the following snippet to invoke the SQL Server SOUNDEX function for the same word.

SqlCommand cmd = new SqlCommand(   "SELECT SOUNDEX('ASHCRAFT')", conn); string sqlserver_soundex = (string)cmd.ExecuteScalar();

This code creates a SqlCommand object to execute the query "SELECT SOUNDEX('ASHCRAFT')" on the SQL Server connection conn. It then calls the command's ExecuteScalar method to execute the query. ExecuteScalar returns the query's first column in the first row of the result, but this query only returns one row containing one column so that's all we need.

Instead of executing a hard-coded query, the program actually loops through a list of 225 "random" names and builds queries similar to the one shown in the previous code at run time. It compares the results given by my code and SQL Server, and displays any differences in a DataGridView control.

To my slight surprise, the two algorithms matched on most but not all of the names. For example, for the name Pfister the census soundex algorithm produces P236 but SQL Server returns P123.

After a little experimentation, I found that SQL Server's implementation removes the first letter from the name before encoding it so it doesn't collapse duplicate letters if the first letter has the same encoding as the second letter.

In the name Pfister, P and F both have the encoding 1. The census soundex algorithm encodes the string as 11236, collapses the first two values to get 1236, replaces the first code with P, and finishes with P236.

SQL Server's algorithm removes the initial P, encodes the rest of the name as 1236. It restores the initial P to get P1236, and truncates to four characters to finish with P123.

Having figured out how SQL Server calculates its SOUNDEX function, it was easy to implement a version of it in code. The SqlServerSoundex program, which is available for download and shown in Figure 1, compares the census soundex algorithm, SQL Server's SOUNDEX function, and my implementation of SQL Server's algorithms.

Figure 1. Soundex Vexed: The pink cells show names for which the census soundex and SQL Server SOUNDEX function produce different results.

The program's DataGridView displays a name if the census soundex value is different from SQL Server's value, or if SQL Server's value is different from my version of the SQL Server algorithm. It highlights differing values in pink so they are easy to spot.

You can see in Figure 1 that SQL Server disagrees with the census soundex in 4 of the 225 names I tested. You can also see that my version of the SQL Server algorithm matches SQL Server's implementation.

So which version should you use? If you want to calculate soundex values on the fly instead of storing them in the database, you should use the database's built-in function. For example, to find Pfister's customer record, you could execute a query similar to the following.

SELECT * FROM Customers   WHERE SOUNDEX(LastName) = SOUNDEX('PFISTER')

However, you'll probably get better performance if you pre-calculate soundex values and store them in the database so you can use the following query instead.
    SELECT * FROM Customers   WHERE SoundexName = 'P236'
Not only does the database not need to calculate the SOUNDEX function for every entry it examines, but you can index the SoundexName field for even faster lookup.

And if you're storing soundex codes in the database, it doesn't really matter which algorithm you use.


   
Rod Stephens is a consultant and author who has written more than a dozen books and two hundred magazine articles, mostly about Visual Basic. During his career he has worked on an eclectic assortment of applications for repair dispatch, fuel tax tracking, professional football training, wastewater treatment, geographic mapping, and ticket sales. His VB Helper web site receives more than 7 million hits per month and provides three newsletters and thousands of tips, tricks, and examples for Visual Basic programmers.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap