Ever wondered how Soundex works?

10

Did you ever need the Oracle Soundex function and wondered how it works? It’s actually quite simple.

Soundex returns a character string which represents the phonetic representation of the inputstring. This representation is, according to the The Art of Computer Programming (by Donald E. Knuth) defined as follows:....

  1. Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, w, y.

  2. Assign numbers to the remaining letters (after the first) as follows:

    b, f, p, v = 1<br />c, g, j, k, q, s, x, z = 2<br />d, t = 3<br />l = 4<br />m, n = 5<br />r = 6<br />
  3. If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w, then omit all but the first.

  4. Return the first four bytes padded with 0.

In fact, this specific algorithm is named the Russell Soundex, after Robert Russell and Margaret Odell who patented it back in 1918 and 1922. There are some improved or specific algorithms for the same purpose, like the Reverse Soundex, the Metaphone algorithm and the Daitch-Mokotoff Soundex (for Germanic or Slavic surnames!). All these variations are more complex then the Russell Soundex.

An example

Compare Lloyd and Ladd.

 

Step 1:

Lloyd becomes Lld, Ladd becomes Ldd (remove a,e,h, …)

Step 2:

Lld becomes L43, Ldd becomes L33    (replace letters by numbers)

Step 3:

L43 becomes L3, L33 becomes L3.   (remove doubles, including those in the first two letters)

Step 4:

Returns L300 for both words; according to the Soundex algorithm, Lloyd en Ladd are equal!

 

How does this work for other languages then?

There’s a bit of a problem. The Soundex in Oracle only works for words in the English language. For other languages, you need a variation of the algorithm.

The Dutch Russell Soundex is defined like this:

 

  1. Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, j, y
  2. Replace the following groups of letters:

    QU to KW

    SCH to SEE

    KS and KX to XX

    KC and CK to KK

    DT and TD to TT

    CH to GG

    SZ to SS

    IJ to YY

  3. Assign numbers to the remaining letters (after the first) as follows:
  4. b, p = 1
    c, g, s, k, z, q = 2
    d, t = 3
    f, v, w = 4
    l = 5
    m, n = 6
    r = 7
    x = 8

  5. If two or more letters with the same number were adjacent in the
    original name (before step 1), then omit all but the first.
  6. Return the first four bytes padded with 0.

As you see, the difference with the Oracle built-in Soundex en the alternative version is, that an extra step is introduced, and different groups of letters are assigned to the same number. This is because of the difference in pronounciation of Dutch and English words, as you might have already guessed.

Share.

About Author

10 Comments

  1. This is a great explanation. I can see this function having a problem with words beginning with W. This is just the first thing that came to mind. I tested ‘witch’ and ‘which’ and received the results W200 and W320 respectfully.

  2. With numbers, in particular, confusion is relatively unlikely – no need for a system like SOUNDEX. Note, however, that U.S. aviators often had to communicate alpha-numeric data over very noisy radio-links, leading them to develop the famous “Alpha Charlie Foxtrot” style of phonetic alphabet. There, they did find one ambiguity with spoken digits: 5 -vs- 9, leading to using “Five” and “Niner” (the latter’s two syllables unambiguously distinguishing it).

    Thus, if your Zip-Code comparison “SOUNDEX” is more-likely concerned with the spoken numerals than with the actual numeric data, “Five” and “Nine” are your only likely points of verbal confusion.

    On the other hand, if the questionable data was mostly transcribed from hand-written information, then you have several common points of confusion: 4 -vs- 9, 5 -vs- 6, and 7 -vs- 1 (with a too-large “flag” at the top).

    If you are going to do a “Fuzzy” check for alphanumeric code-sequences (serial “numbers”), then it gets even worse: -vs- , -vs- , -vs- -vs- -vs- , -vs- -vs- , -vs- -vs- , -vs- -vs- -vs- , etc. Good luck on that, and please publish your results if you do implement a “fuzzy writing” match-algorithm!

  3. Many of the SOUNDEX limitations stem from its origins as a pre-computerization filing mechanism. When I worked in the Title industry, we had to code these names by hand, using a little lookup-booklet (which was actually based on precaculated digrams and trigrams). Still, a laborious process for a “quick lookup” activity.

    In “modern” computerized usage, we are probably far overreaching the original intent. I suspect that it was originally designed to be used on a single surname (which may seem like multiple words, like Tate-Abury, or Fan Lo), but not generally to be used for general text or phrases (which inherently include punctuation). Thus, punctuation characters (O’Dell) and “inconsequential” spaces (Fan Lo) would be ignored. Similarly, numeric data would only appear in “sounded out” form: “John Asset III” = “John Asset 3″ = “John Asset the Third”. I believe (with nothing to back me up) that this latter would encode “Asset Third” rather than “Asset 3″, and would disregard the punctuation (space), resulting in either A233 or A236, depending on whether the “disregarded” space were seen as initially breaking up the “run:” of ‘T’ characters. I think it should, in fact, be A233 (as does Oracle 10g).

  4. I do not know the way Oracle SOUNDEX handles numbers. I’ve once written the algorithm in PL/SQL with the Dutch rules, and there numbers would have been left as is. My guess is that Oracle SOUNDEX either removes numbers or leaves them as is. In the first case you will not be able to use SOUNDEX as comparison function and you have to write some additional code. In the second case it’ll be quite useable. A simple test gives much answers:
    SQL> select soundex( ‘AB1234′) , soundex( ‘AB2234′ ) from dual;
    SOUN SOUN
    —- —-
    A100 A100
    Elapsed: 00:00:00.2
    SQL>
    I assume this is not what you want, so you’ll need to do additional coding to compare numerics (which is in fact, really really easy).

  5. This is great for understanding but I have a client that is trying to pass me numbers to be used by SOUNDEX (zip codes) to compare against a master list – I know that everything is translated into a number currently – does SOUNDEX support numerics? Does it just read it in as is?

  6. Even in English, Soundex is not that good for a lot of applications – I really don’t like the idea that all initial letters have distinct sounds. Metaphone too has its flaws
    There are a lot of more sophisticated algorithms in the field of computational lingusitcs and speech processing that give far better results.
    We worked on a project to screen adverts submitted through the internet to a publisher met legal standards. Here we had to implement a robust algorithm to detect the use of sound-alikes to get around the law (we also had to look out for numbers in words (’3′ for ‘E’, ’8′ for ‘ATE’) so we had little option to write our own.)

  7. Just found one more inconsistency:
    After step 1 all ‘U’ are removed, so the rule QU => KW in step2 doesn’t really make sense, right?
    Together with the previous remark about SCH–>SEE it starts to feel like step 2 could(should) be applied before step 1?

    I also think there are some possible optimisations that might be applied:
    - DT and TD already result to 33, no need to translate to TT first?
    - same for KC, CK == KK == 22

    But on the other hand, I get the feeling the KZ sequence should get the same treatment as KS (i.e. replace to XX)

    And one more question: I persume the translation rules of step 2 need to be applied recursively?
    Given the possibility of sequences like CKX, I persume those should end up being coded like XXX, rather then like KKX.

  8. Hi Patrick,

    Thx for posting this, much appreciated since indeed the classic (english) Russel Soundex isn’t always playing nice with Dutch names.

    I’m looking into implementing this Dutch version for use outside Oracle as well (so not in PL/SQL, but rather in Java) and I’m somewhat puzzled by the description of step 2:

    [1] replace SCH –> SEE
    question: what should happen with these introduced ‘e’ s later on in the following steps? If the ‘e’ s need to be dropped (as in step 1) then doesn’t it make more sense just to replace SCH –> S?

    [2] similar question about replacing to doubles like XX, KK, … etc
    since step4 will remove the doubles anyway, doesn’t it make more sense never to introduce them?

    Outside the scope of this pure algorithm, but maybe you happen to know:
    A lot of dutch/flemish names are prefixed with VAN, DE or VANDE (with or without spaces) Is it common practice just to drop those before calculating the soundex?

    I was also wondering where you found these dutch language rules?
    Point being: I’ld like to try pushing my Java implementation up as an addition of the jakarta-commons-codec (http://jakarta.apache.org/commons/codec/) package but would need to make sure that there are no license or patent issues preventing that.

    -marc=

  9. Patrick Sinke on

    No, the Soundex function in Oracle only supports English words.
    Knowing the described algorithm, it’s quite easy to write some
    PL/SQL that does a Soundex comparison for Dutch language though. It’s 40 or 50 lines of code, no more.

  10. Interesting to know that something that looks so complex in fact is simple and understandable. Does Oracle support the Dutch Soundex Rules or is it easy to create your own Dutch Soundex function within Oracle?