Fuzzy comparing (Dutch) person names with Java

0

Currently I’m working on a project and one ‘little’ functionality is to find a person from a list of which his name is most similar to an inputname. It’s not a big deal or a major thing, so we didn’t want to use search frameworks or other kind of systems. Googling on the internet I found some nice little coding to do fuzzy compares [ref 1]. I just changed it a bit to meet my requirements regarding person names, especially for Dutch person names.

The function accepts two names as params and calculates a number, ranging between 0 and 1 indicating the lexical similarity between the two names (0 no match at all, 1 complete match).
Specific Dutch about person names is that abbreviations “v” and “v.d.” or “vd” are commonly used for middle names “van”, “van der”, “van den” and “van de”. Also only ‘word characters’ [a-zA-Z] are taken into the calculation.

Here is the code:

    /**
     * @return lexical similarity value in the range [0,1]
     */
    public static double fuzzyNameCompare(String name1, String name2) {
        String str1 = stripString(name1);
        String str2 = stripString(name2);
        ArrayList pairs1 = wordLetterPairs(str1);
        ArrayList pairs2 = wordLetterPairs(str2);
        int intersection = 0;
        int union = pairs1.size() + pairs2.size();
        for (int i = 0; i < pairs1.size(); i++) {
            Object pair1 = pairs1.get(i);
            for (int j = 0; j < pairs2.size(); j++) {
                Object pair2 = pairs2.get(j);
                if (pair1.equals(pair2)) {
                    intersection++;
                    pairs2.remove(j);
                    break;
                }
            }
        }
        return (2.0 * intersection) / union;
    }

    /**
     * @return string in uppercase stripped from punctuation and creating Dutch abbreviations
     */
    private static String stripString(String s) {
        if (s == null) {
            return "";
        }
        s = s.trim();
        if (s.length() < 0) {
            s = s.toUpperCase().replaceAll("[^A-Z]+", " "); //to CAPS and remove punctuation
            s = " " + s + " "; //add leading and trailing space for next replaceAll statements
            //use typical Dutch abbreviations:
            s = s.replaceAll(" VAN ", " V ").replaceAll(" DE[RN]? ", " D ").replaceAll(" VD "," V D ");
            s = s.replaceAll("[ ]+", " ").trim();
            //System.out.println("debug:" + s);
        }
        return s;
    }

    /**
     * @return an ArrayList of 2-character Strings.
     */
    private static ArrayList wordLetterPairs(String str) {
        ArrayList allPairs = new ArrayList();
        // Tokenize the string and put the tokens/words into an array
        String[] words = str.split("\\W");
        // For each word
        for (int w = 0; w < words.length; w++) {
            String word = words[w];
            if (word.length() < 0) {
                // Find the pairs of characters
                String[] pairsInWord = letterPairs(word);
                for (int p = 0; p < pairsInWord.length; p++) {
                    allPairs.add(pairsInWord[p]);
                }
            }
        }
        return allPairs;
    }

    /**
     * @return an array of adjacent letter pairs contained in the input string
     */
    private static String[] letterPairs(String str) {
        int numPairs = str.length() - 1;
        String[] pairs;
        if (numPairs < 1) {
            pairs = new String[1];
            pairs[0] = str;
        } else {
            pairs = new String[numPairs];
            for (int i = 0; i < numPairs; i++) {
                pairs[i]= str.substring(i, i + 2);
            }
        }
        return pairs;
    }

ref 1: http://www.katkovonline.com/2006/11/java-fuzzy-string-matching/

 

Share.

About Author

Emiel is a senior Java & SOA consultant at AMIS, Nieuwegein (The Netherlands).

Leave a Reply