Ever wondered how Soundex works? 20188367001

Ever wondered how Soundex works?

Did you ever need the Oracle Soundex function and wondered how it works? It’s actually quite simple.

Soundex returns a character string which represents the phonetic representation of the inputstring. This representation is, according to the The Art of Computer Programming (by Donald E. Knuth) defined as follows:

  1. Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, w, y.
  2. Assign numbers to the remaining letters (after the first) as follows:
    b, f, p, v = 1
    c, g, j, k, q, s, x, z = 2
    d, t = 3
    l = 4
    m, n = 5
    r = 6
  3. If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w, then omit all but the first.
  4. Return the first four bytes padded with 0.

In fact, this specific algorithm is named the Russell Soundex, after Robert Russell and Margaret Odell who patented it back in 1918 and 1922. There are some improved or specific algorithms for the same purpose, like the Reverse Soundex, the Metaphone algorithm and the Daitch-Mokotoff Soundex (for Germanic or Slavic surnames!). All these variations are more complex then the Russell Soundex.

 

An example

Compare Lloyd and Ladd.

 

Step 1:

Lloyd becomes Lld, Ladd becomes Ldd (remove a,e,h, …)

Step 2:

Lld becomes L43, Ldd becomes L33    (replace letters by numbers)

Step 3:

L43 becomes L3, L33 becomes L3.   (remove doubles, including those in the first two letters)

Step 4:

Returns L300 for both words; according to the Soundex algorithm, Lloyd en Ladd are equal!

 

How does this work for other languages then?

There’s a bit of a problem. The Soundex in Oracle only works for words in the English language. For other languages, you need a variation of the algorithm.

The Dutch Russell Soundex is defined like this:

 

    1. Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, j, y
    2. Replace the following groups of letters:

      QU to KW

      SCH to SEE

      KS and KX to XX

      KC and CK to KK

      DT and TD to TT

      CH to GG

      SZ to SS

      IJ to YY

    3. Assign numbers to the remaining letters (after the first) as follows:

b, p = 1
c, g, s, k, z, q = 2
d, t = 3
f, v, w = 4
l = 5
m, n = 6
r = 7
x = 8

  1. If two or more letters with the same number were adjacent in the
    original name (before step 1), then omit all but the first.
  2. Return the first four bytes padded with 0.

As you see, the difference with the Oracle built-in Soundex en the alternative version is, that an extra step is introduced, and different groups of letters are assigned to the same number. This is because of the difference in pronounciation of Dutch and English words, as you might have already guessed.

10 Comments

  1. Alex April 30, 2008
  2. Terry Roddy December 19, 2007
  3. Terry Roddy December 19, 2007
  4. Patrick Sinke February 22, 2007
  5. Todd Whiteley February 15, 2007
  6. Pete_S June 29, 2006
  7. Marc Portier May 22, 2006
  8. Marc Portier May 22, 2006
  9. Patrick Sinke March 10, 2006
  10. GerwinT March 7, 2006