Determine the Language of a Document from the Letter Frequency – using Levenshtein Distance between sequences

2

imageEven though many languages share the same or a very similar alphabet, the use of letters in documents written in these languages is quite distinct. The letter ” e” is quite popular, but not the most used letter in every language. In fact, the letter frequency is very specific to a language – and can be used to determine the language of a document in a simple and pretty fast way.

The very simple steps are:

  • count occurrences of letters in a document
  • order the letters by number of occurrences into a string (most occurring letter first) (for example: eiaonlrtscdupmgvfhqbzéèûâjxêy)
  • compare the letter sequence with the known letter-occurrence-sequences for all languages and find the closest match; that gives the language of the document

For this we require two important things :

  1. letter frequency sequence for all languages we want to test for – available at http://letterfrequency.org/letter-frequency-by-language/
  2. a method to compare the similarity of two sequences – and a great candidate for this is the Levenshtein Distance – introduced for example here: https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/

A step by step description of how to determine the language of documents programmatically is found in this Jupyter Notebook where I have created a small Python based program to count letters in a document, create the letter sequence based on occurrence and compare this sequence to known sequences for twenty different languages. This GitHub repo contains the code and document examples: https://github.com/lucasjellema/language-determination-analytics.

Walking through the Approach

1. Read Letter Frequency Data into Pandas Data Frame

Load the CSV data with letter sequences for 20+ languages (taken from http://letterfrequency.org/letter-frequency-by-language/) from a text file into a Pandas Data Frame and prepare it for visualization and further processing.

image

2. Create a function that loads a document from file and counts the occurrences of letters

image

3. Run the function for a specific document

The document text-file-italian.txt is processed – this document is taken from the Gutenberg project site (https://www.gutenberg.org) – spoiler alert: from an Italian book. The resulting letter count result is turned from a dict into a Pandas Data Frame and presented:image

4. Derive the ordered letter sequence to compare with known sequences for various languages

This string contains all different letters in the document, ordered by their occurrence

image

5. Create a function to calculate the distance or difference between two letter sequences – based on the Levenshtein Distance

Put very simply the Levenshtein Distance expresses the similarity between two sequences by determining the number of actions (insert, delete, substitute) required to convert one sequence into the other. Two letter sequences are very close if few changes are required to morph one sequence into the other (such as bright and blight that are very close and freight that is also close to bright but a little further away than blight. This article describes the Levenshtein Distance method and an implementation in Python in some more detail. I have happily taken and slightly modified the Python function.

image

Try out the function for a few strings:

image

6. Find the shortest distance between the Letter Sequence for the Document Under Scrutiny and the Ordered Letter Sequences for All Languages

Now that we can calculate the similarity score between two distances, we can iterate over the letter frequency based sequences for all languages, calculate the distance from our document’s sequence to each of these and select the language for which the distance is smallest as the most likely language for the document:

image

Running this code results in:

imageWhich is absolutely correct.

7. Some more examples

This function can process any text file and show us the most likely language for its contents:

image

I have invoked the function for a few additional documents. It does pretty great – and it does so very fast! It incorrectly identifies a Danish document (also taken from the Gutenberg project) as Swedish. I hear these two are very close? And given that the text is from the Gutenberg site, it is probably somewhat old (fashioned) and perhaps has a different letter frequency and today’s Danish does?

image

image

image

Credits

As all ways I had some very firm shoulders to stand on when composing this article. In addition to everyone in the community who graciously shared their knowledge, I would like to thank my colleagues Onno Hartvelt, Jeffrey Resodikromo, Rosanna Denis for sparring and brainstorming around this article and several others.

Resources

Letter Frequencies per language: http://letterfrequency.org/

Levenshtein Distance – to compare series and their difference (in Python) – https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/

Letter frequency is language specific. See for Letter Frequencies this page in Wikipedia: https://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_letters_in_other_languages.

About Author

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.

2 Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.