Ever wondered how Soundex works?
Did you ever need the Oracle Soundex function and wondered how it works? It’s actually quite simple.
Soundex returns a character string which represents the phonetic representation of the inputstring. This representation is, according to the The Art of Computer Programming (by Donald E. Knuth) defined as follows:
-
Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, w, y.
-
Assign numbers to the remaining letters (after the first) as follows:
b, f, p, v = 1<br />c, g, j, k, q, s, x, z = 2<br />d, t = 3<br />l = 4<br />m, n = 5<br />r = 6<br />
-
If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w, then omit all but the first.
-
Return the first four bytes padded with 0.
In fact, this specific algorithm is named the Russell Soundex, after Robert Russell and Margaret Odell who patented it back in 1918 and 1922. There are some improved or specific algorithms for the same purpose, like the Reverse Soundex, the Metaphone algorithm and the Daitch-Mokotoff Soundex (for Germanic or Slavic surnames!). All these variations are more complex then the Russell Soundex.
An example
Compare Lloyd and Ladd.
Step 1:
Lloyd becomes Lld, Ladd becomes Ldd (remove a,e,h, …)
Step 2:
Lld becomes L43, Ldd becomes L33 (replace letters by numbers)
Step 3:
L43 becomes L3, L33 becomes L3. (remove doubles, including those in the first two letters)
Step 4:
Returns L300 for both words; according to the Soundex algorithm, Lloyd en Ladd are equal!
How does this work for other languages then?
There’s a bit of a problem. The Soundex in Oracle only works for words in the English language. For other languages, you need a variation of the algorithm.
The Dutch Russell Soundex is defined like this:
- Retain the first letter of the string and remove all other occurrences of the letters a, e, h, i, o, u, j, y
-
Replace the following groups of letters:
QU to KW
SCH to SEE
KS and KX to XX
KC and CK to KK
DT and TD to TT
CH to GG
SZ to SS
IJ to YY
- Assign numbers to the remaining letters (after the first) as follows:
-
If two or more letters with the same number were adjacent in the
original name (before step 1), then omit all but the first. - Return the first four bytes padded with 0.
b, p = 1
c, g, s, k, z, q = 2
d, t = 3
f, v, w = 4
l = 5
m, n = 6
r = 7
x = 8
As you see, the difference with the Oracle built-in Soundex en the alternative version is, that an extra step is introduced, and different groups of letters are assigned to the same number. This is because of the difference in pronounciation of Dutch and English words, as you might have already guessed.
This is a great explanation. I can see this function having a problem with words beginning with W. This is just the first thing that came to mind. I tested ‘witch’ and ‘which’ and received the results W200 and W320 respectfully.