Even though many languages share the same or a very similar alphabet, the use of letters in documents written in these languages is quite distinct. The letter ” e” is quite popular, but not the most used letter in every language. In fact, the letter frequency is very specific to a language – and can be used to determine the language of a document in a simple and pretty fast way.
The very simple steps are:
- count occurrences of letters in a document
- order the letters by number of occurrences into a string (most occurring letter first) (for example: eiaonlrtscdupmgvfhqbzéèûâjxêy)
- compare the letter sequence with the known letter-occurrence-sequences for all languages and find the closest match; that gives the language of the document
For this we require two important things :
- letter frequency sequence for all languages we want to test for – available at http://letterfrequency.org/letter-frequency-by-language/
- a method to compare the similarity of two sequences – and a great candidate for this is the Levenshtein Distance – introduced for example here: https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/
A step by step description of how to determine the language of documents programmatically is found in this Jupyter Notebook where I have created a small Python based program to count letters in a document, create the letter sequence based on occurrence and compare this sequence to known sequences for twenty different languages. This GitHub repo contains the code and document examples: https://github.com/lucasjellema/language-determination-analytics.
Walking through the Approach
1. Read Letter Frequency Data into Pandas Data Frame
Load the CSV data with letter sequences for 20+ languages (taken from http://letterfrequency.org/letter-frequency-by-language/) from a text file into a Pandas Data Frame and prepare it for visualization and further processing.
2. Create a function that loads a document from file and counts the occurrences of letters
3. Run the function for a specific document
The document text-file-italian.txt is processed – this document is taken from the Gutenberg project site (https://www.gutenberg.org) – spoiler alert: from an Italian book. The resulting letter count result is turned from a dict into a Pandas Data Frame and presented:
4. Derive the ordered letter sequence to compare with known sequences for various languages
This string contains all different letters in the document, ordered by their occurrence
5. Create a function to calculate the distance or difference between two letter sequences – based on the Levenshtein Distance
Put very simply the Levenshtein Distance expresses the similarity between two sequences by determining the number of actions (insert, delete, substitute) required to convert one sequence into the other. Two letter sequences are very close if few changes are required to morph one sequence into the other (such as bright and blight that are very close and freight that is also close to bright but a little further away than blight. This article describes the Levenshtein Distance method and an implementation in Python in some more detail. I have happily taken and slightly modified the Python function.
Try out the function for a few strings:
6. Find the shortest distance between the Letter Sequence for the Document Under Scrutiny and the Ordered Letter Sequences for All Languages
Now that we can calculate the similarity score between two distances, we can iterate over the letter frequency based sequences for all languages, calculate the distance from our document’s sequence to each of these and select the language for which the distance is smallest as the most likely language for the document:
Running this code results in:
7. Some more examples
This function can process any text file and show us the most likely language for its contents:
I have invoked the function for a few additional documents. It does pretty great – and it does so very fast! It incorrectly identifies a Danish document (also taken from the Gutenberg project) as Swedish. I hear these two are very close? And given that the text is from the Gutenberg site, it is probably somewhat old (fashioned) and perhaps has a different letter frequency and today’s Danish does?
As all ways I had some very firm shoulders to stand on when composing this article. In addition to everyone in the community who graciously shared their knowledge, I would like to thank my colleagues Onno Hartvelt, Jeffrey Resodikromo, Rosanna Denis for sparring and brainstorming around this article and several others.
Letter Frequencies per language: http://letterfrequency.org/
Levenshtein Distance – to compare series and their difference (in Python) – https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/
Letter frequency is language specific. See for Letter Frequencies this page in Wikipedia: https://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_letters_in_other_languages.