java Extremely Simple Language Identifier
jExSLI
jEXSLI was developed as part of summer internship project at FBK HLT group by Kristina Gulordava
jExSLI is tool is a simple text language identifier that can be used as a simple means to understand for example in which language a text input of your application was given. It's written in Java (compartible with all application written in Java 1.5 or later) and is distributed as a single jar file.
An initial list of languages contains 20 most commonly used languages and can be easily extended.
In this tool we applied very simple text categorization approach based on similarity of documents presented as vectors of terms with their tf*idf values. To exploit this idea we need to have each language presented as such vector and for this we use most frequent words of a language and their frequencies. According to our evaluation it's a reasonable approach that performs well not only for big texts but for phrases larger than 5 words.
People who looked at this resource also viewed the following: