Incorrect definition of the Russian language. Russian language is defined as hungarian.
I beg to nominate this ticket for me. Please just do not carry out work on this ticket. I will do the work.
To refine the results of testing the language of the text.
The simplest solution
The simplest solution: open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
and apply it patch textcat.conf.diff (including attached files in this ticket)
Slightly more complex solution:
Unzip complex_solution.tar.bz2 (Take it to the attached files in this ticket)
open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
and apply it patch textcat.conf.diff of uncompressed complex_solution.tar.bz2. Then copy the file russian-UTF8.lm (from complex_solution.tar.bz2) in textcat.jar \ org \ knallgrau \ utils \ textcat \ language_fp \ russian-UTF8.lm
This will allow analysis of Russian text in UTF-8 encoding.
Thank you for your patch.
I have a question.
In the textcat.conf, can we only add the "russian-UTF8.lm" line in the textcat.conf ?
OpenSearchServer runs only in UTF8.
No, you can not do that, because Many documents in Russian are in cp1251 (windows-1251). For example, I attach the file document in Word. For example, when initializing WordExtractor (this statement applies to PDFBox) no encoding is specified - the text can be returned in the encoding, which was the original document. To be sure, I did a test - put down only a string russianUTF8.lm - finally text again became recognized as a hungarian. I strongly recommend to specify all 4 lines (koi8r, windows1251, iso8859_5 and utf-8) - thus you can avoid mistakes in the recognition of language texts.
Example of the Russian text
We have to check that their is no license conflict with the "russian-UTF8.lm" file. What is the origin of this file ? I saw that the most recent version of Textcat is in SpamAssassin.