Comments on: Ever wondered how Soundex works? Friends of Oracle and Java Thu, 23 Apr 2015 12:54:46 +0000 hourly 1 By: Alex Wed, 30 Apr 2008 21:37:55 +0000 This is a great explanation. I can see this function having a problem with words beginning with W. This is just the first thing that came to mind. I tested ‘witch’ and ‘which’ and received the results W200 and W320 respectfully.

By: Terry Roddy Wed, 19 Dec 2007 21:52:02 +0000 With numbers, in particular, confusion is relatively unlikely – no need for a system like SOUNDEX. Note, however, that U.S. aviators often had to communicate alpha-numeric data over very noisy radio-links, leading them to develop the famous “Alpha Charlie Foxtrot” style of phonetic alphabet. There, they did find one ambiguity with spoken digits: 5 -vs- 9, leading to using “Five” and “Niner” (the latter’s two syllables unambiguously distinguishing it).

Thus, if your Zip-Code comparison “SOUNDEX” is more-likely concerned with the spoken numerals than with the actual numeric data, “Five” and “Nine” are your only likely points of verbal confusion.

On the other hand, if the questionable data was mostly transcribed from hand-written information, then you have several common points of confusion: 4 -vs- 9, 5 -vs- 6, and 7 -vs- 1 (with a too-large “flag” at the top).

If you are going to do a “Fuzzy” check for alphanumeric code-sequences (serial “numbers”), then it gets even worse: -vs- , -vs- , -vs- -vs- -vs- , -vs- -vs- , -vs- -vs- , -vs- -vs- -vs- , etc. Good luck on that, and please publish your results if you do implement a “fuzzy writing” match-algorithm!

By: Terry Roddy Wed, 19 Dec 2007 21:31:29 +0000 Many of the SOUNDEX limitations stem from its origins as a pre-computerization filing mechanism. When I worked in the Title industry, we had to code these names by hand, using a little lookup-booklet (which was actually based on precaculated digrams and trigrams). Still, a laborious process for a “quick lookup” activity.

In “modern” computerized usage, we are probably far overreaching the original intent. I suspect that it was originally designed to be used on a single surname (which may seem like multiple words, like Tate-Abury, or Fan Lo), but not generally to be used for general text or phrases (which inherently include punctuation). Thus, punctuation characters (O’Dell) and “inconsequential” spaces (Fan Lo) would be ignored. Similarly, numeric data would only appear in “sounded out” form: “John Asset III” = “John Asset 3″ = “John Asset the Third”. I believe (with nothing to back me up) that this latter would encode “Asset Third” rather than “Asset 3″, and would disregard the punctuation (space), resulting in either A233 or A236, depending on whether the “disregarded” space were seen as initially breaking up the “run:” of ‘T’ characters. I think it should, in fact, be A233 (as does Oracle 10g).

By: Patrick Sinke Thu, 22 Feb 2007 11:58:04 +0000 I do not know the way Oracle SOUNDEX handles numbers. I’ve once written the algorithm in PL/SQL with the Dutch rules, and there numbers would have been left as is. My guess is that Oracle SOUNDEX either removes numbers or leaves them as is. In the first case you will not be able to use SOUNDEX as comparison function and you have to write some additional code. In the second case it’ll be quite useable. A simple test gives much answers:
SQL> select soundex( ‘AB1234′) , soundex( ‘AB2234′ ) from dual;
—- —-
A100 A100
Elapsed: 00:00:00.2
I assume this is not what you want, so you’ll need to do additional coding to compare numerics (which is in fact, really really easy).

By: Todd Whiteley Thu, 15 Feb 2007 19:57:09 +0000 This is great for understanding but I have a client that is trying to pass me numbers to be used by SOUNDEX (zip codes) to compare against a master list – I know that everything is translated into a number currently – does SOUNDEX support numerics? Does it just read it in as is?

By: Pete_S Thu, 29 Jun 2006 13:58:06 +0000 Even in English, Soundex is not that good for a lot of applications – I really don’t like the idea that all initial letters have distinct sounds. Metaphone too has its flaws
There are a lot of more sophisticated algorithms in the field of computational lingusitcs and speech processing that give far better results.
We worked on a project to screen adverts submitted through the internet to a publisher met legal standards. Here we had to implement a robust algorithm to detect the use of sound-alikes to get around the law (we also had to look out for numbers in words (‘3′ for ‘E’, ‘8’ for ‘ATE’) so we had little option to write our own.)

By: Marc Portier Mon, 22 May 2006 11:00:15 +0000 Just found one more inconsistency:
After step 1 all ‘U’ are removed, so the rule QU => KW in step2 doesn’t really make sense, right?
Together with the previous remark about SCH–>SEE it starts to feel like step 2 could(should) be applied before step 1?

I also think there are some possible optimisations that might be applied:
– DT and TD already result to 33, no need to translate to TT first?
– same for KC, CK == KK == 22

But on the other hand, I get the feeling the KZ sequence should get the same treatment as KS (i.e. replace to XX)

And one more question: I persume the translation rules of step 2 need to be applied recursively?
Given the possibility of sequences like CKX, I persume those should end up being coded like XXX, rather then like KKX.

By: Marc Portier Mon, 22 May 2006 09:55:04 +0000 Hi Patrick,

Thx for posting this, much appreciated since indeed the classic (english) Russel Soundex isn’t always playing nice with Dutch names.

I’m looking into implementing this Dutch version for use outside Oracle as well (so not in PL/SQL, but rather in Java) and I’m somewhat puzzled by the description of step 2:

[1] replace SCH –> SEE
question: what should happen with these introduced ‘e’ s later on in the following steps? If the ‘e’ s need to be dropped (as in step 1) then doesn’t it make more sense just to replace SCH –> S?

[2] similar question about replacing to doubles like XX, KK, … etc
since step4 will remove the doubles anyway, doesn’t it make more sense never to introduce them?

Outside the scope of this pure algorithm, but maybe you happen to know:
A lot of dutch/flemish names are prefixed with VAN, DE or VANDE (with or without spaces) Is it common practice just to drop those before calculating the soundex?

I was also wondering where you found these dutch language rules?
Point being: I’ld like to try pushing my Java implementation up as an addition of the jakarta-commons-codec ( package but would need to make sure that there are no license or patent issues preventing that.


By: Patrick Sinke Fri, 10 Mar 2006 15:38:45 +0000 No, the Soundex function in Oracle only supports English words.
Knowing the described algorithm, it’s quite easy to write some
PL/SQL that does a Soundex comparison for Dutch language though. It’s 40 or 50 lines of code, no more.

By: GerwinT Tue, 07 Mar 2006 07:16:21 +0000 Interesting to know that something that looks so complex in fact is simple and understandable. Does Oracle support the Dutch Soundex Rules or is it easy to create your own Dutch Soundex function within Oracle?