At Datastreamer we index a lot of HTML. On an average day we index about 5TB of HTML content and write about 600GB of that to our Elasticsearch index. As part of our indexing we perform data augmentation including language detection.
But how do we scale that and provide high quality language classification without any central point of failure?
Language classification (for text) is the ability for an algorithm to take input as raw text or HTML and provide a corresponding language code.
Sounds easy at first until you factor in that there are 50 major languages that you need to support (and plenty of languages used by smaller populations).
Additionally, it has to be fast, reliable, and easily distributed.
Obviously having humans perform the classification manually is not an option.
Even if you had plenty of humans to tell you if some text is in English it’s going to be difficult for them to identify a language with which they’re not familiar.
Our strategy is that we use Wikipedia to form a base corpus to perform the language classification.
Wikipedia is nice enough to publish both the HTML for articles across all languages as well as their HTTP logs.
We use Wikipedia’s HTTP logs to produce an index of the top 10k documents per language.
We then use the text of the articles to build a corpus for each language.
Now that we have a corpus (courtesy of Wikipedia) we can start analyzing the text for patterns to index the content.
Fortunately Zipf’s Law comes to our rescue:
Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word “of” accounts for slightly over 3.5% of words (36,411 occurrences), followed by “and” (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.
Basically, natural language distribution of terms in English (and other langauges) will always follow Zipf’s Law.
You end up with a lot of popular words the second ranking word is usually about half the rank of the first.
A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias (dumps from October 2015) in a log-log scale.
Note that the above plot is across ALL languages. I’ve always found it fascinating that the internal representation of human language is stable across all language.
This is almost certainly due to what Chomsky referred to as our I language (or internal language) and that fact that humans are hard wired for an internal language. This is probably formed via evolution and why all languages have the same mathematical structure.
What’s more interesting here is that not only do the words follow a Zipfian distribution, but so do the ngrams.
What’s an Ngram?
An N-gram is an N-length word or symbol tuple.
For example. A unigram is just one word or symbol at a time.
A bigram is two words or symbols, Trigrams are three, etc.
For our use we don’t require the words though, we can split the actual characters in the text up into bigrams and trigrams.
So for example, the phrase:
Could be broken into the following bigrams:
he el ll lo
And could be broken into the following trigrams:
hel ell llo
If you keep a histograph of these, they ALSO form a Zipfian distribution.
Now to compute language models all we have to do is take each language model and compute the Zipf distribution of the ngrams.
However, we don’t need all of them. We truncate them down to the top 500 ngrams which is MORE than enough.
This means we have a super small model for each language.
Each language in our index is less than 10k and fits into RAM.
This allows us to easily distribute it across our cluster.
Additionally, since the language model is stable, it RARELY needs updating.
In fact, models built on text from hundreds of years ago would work just fine on documents created today.
The final language classification stage is easy. You take an input document, compute the ngram distribution using the same algorithm you used to build the corpus.
Then you compute the delta between the input document and the target languages.
The target language with the closest score is the winner.
Additionally, we can fail if there isn’t a clear winner. All natural documents will form a unique signature so if you invent a fake language (like Dothraki) it won’t actually classify and you can return ‘undefined’
Now the main problem here is that this only works for romantic language (English, German, French, Spanish, etc) does NOT work for other languages like Arabic, Japanese, Thai, Korean, etc.
How do we handle that? Codepage detection.
This is really the simple model of looking at the character distribution within unicode and if the majority of text fits into a specific language region then we’re done.
We actually start off with code page detection and then if that fails we fall back to ngram detection.
This solves all the necessary issues for a language classifier when using it in production.
It’s fast. Easily distributed. Our language models are small, etc.
We have about 600 cores executing 24/7 performing classification using this technique and have been doing so for nearly 10 years now.