Soundex is a hashing system for english words. From an english word, you generate a letter and three numbers. that roughly describe how an given word sounds. Similar sounding words will have similar codes. It might be used, for example, by 411 (phone information), to look up other spellings of a last name. It was used by the United States Census Bureau to find similar names in census records. Soundex was created by Robert C. Russell of Pittsburgh, Pennsylvania. He received U.S. patent 1,261,167 on April 2, 1918 on for it. The U.S. Patent and Trademark office has the original Soundex Patent (1,261,167) online). You might be interested in a history of various different versions of the Soundex coding system.
You can see soundex in action here.
A bit of warning about Soundex codes: although in theory they should always be the same for a given name, in practice they sometimes vary. There are a number of reasons. Sometimes implementations of the algorithm have bugs that only become apparent in a small number of cases. (I've seen a number of implementations with bugs.) Sometimes last names are entered into the system incorrectly (various computer systems think my last name is "Smet", "De", or "Desmet", which map to S530, D000, and D253 respectively). In addition, the Soundex system is really english oriented. There is no support for characters beyond the 26 letters used in the english language. As a result, names with unusual letters (like æ, ø, or Ð) are sometimes encoded different ways by different people and programs.
Are you considering using Soundex for anything important? You might want to think again. Soundex is actually a pretty poor algorithm for doing fuzzy name comparisons. The specification has always been a bit fuzzy, so a single name might have different encodings depending on who did it. You might want to look at "Considering a Soundex-based Solution for an Important Application?" with it's "10 major problems with Soundex and other key-based name match solutions." You might also want to look at "Cracking the Soundex Code" which lists some of the problems with using Soundex for looking for geneology records.
The first letter is simply the first letter in the word. The remaining numbers range from 1 to 6, indicating different categories of sounds created by consanants following the first letter. If the word is too short to generate 3 numbers, 0 is added as needed. If the generated code is longer than 3 numbers, the extra are thrown away.
Code | Letters | Description |
---|---|---|
1 | B, F, P, V | Labial |
2 | C, G, J, K, Q, S, X, Z | Gutterals and sibilants |
3 | D, T | Dental |
4 | L | Long liquid |
5 | M, N | Nasal |
6 | R | Short liquid |
SKIP | A, E, H, I, O, U, W, Y | Vowels (and H, W, and Y) are skipped |
There are several special cases when calculating a soundex code:
Word | Soundex |
---|---|
Washington | W252 |
Wu | W000 |
DeSmet | D253 |
Gutierrez | G362 |
Pfister | P236 |
Jackson | J250 |
Tymczak | T522 |
Ashcraft | A261 |
By taking a soundex code and guessing with common letters, you can take a guess at the sound of the word. By comparing the word to a list of known soundex codes, you can guess at common words. My program does both of these.
Additional details on the Soundex system came from "The Soundex Machine" at the National Archives and Records Administration.