Incorrect definition of the Russian language. Russian language is defined as hungarian.
I beg to nominate this ticket for me. Please just do not carry out work on this ticket. I will do the work.
To refine the results of testing the language of the text.
The simplest solution
The simplest solution: open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
and apply it patch textcat.conf.diff (including attached files in this ticket)
Slightly more complex solution:
Unzip complex_solution.tar.bz2 (Take it to the attached files in this ticket)
open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
and apply it patch textcat.conf.diff of uncompressed complex_solution.tar.bz2. Then copy the file russian-UTF8.lm (from complex_solution.tar.bz2) in textcat.jar \ org \ knallgrau \ utils \ textcat \ language_fp \ russian-UTF8.lm
This will allow analysis of Russian text in UTF-8 encoding.
Thank you for your patch.
I have a question.
In the textcat.conf, can we only add the "russian-UTF8.lm" line in the textcat.conf ?
OpenSearchServer runs only in UTF8.
No, you can not do that, because Many documents in Russian are in cp1251 (windows-1251). For example, I attach the file document in Word. For example, when initializing WordExtractor (this statement applies to PDFBox) no encoding is specified - the text can be returned in the encoding, which was the original document. To be sure, I did a test - put down only a string russianUTF8.lm - finally text again became recognized as a hungarian. I strongly recommend to specify all 4 lines (koi8r, windows1251, iso8859_5 and utf-8) - thus you can avoid mistakes in the recognition of language texts.
Example of the Russian text
We have to check that their is no license conflict with the "russian-UTF8.lm" file. What is the origin of this file ? I saw that the most recent version of Textcat is in SpamAssassin.
Not sure that understand your question ... What is the connection issue with SpamAssassin? :) Includes TextCat no file russianUTF8.lm - I have it myself generated by
java-jar textcat-1.0.1.jar-createfp myfile.txt russianUTF8
As the source file was used text on the 1.8 MB. Given what they've done functional generation . lm files open, it is worth expect that there no prohibitions on the generation and use of our . lm files.
I found a bug. Now analyzing it. At this point, probably, there will be problems with the definition of the language. In the library configuration textcat, that I recommend. Now analyze how this error corrected.
Found a way to fix. Soon I'll send another patch.
Example of correct recognition of languages.
I apologize for past "my" commit - he could not lead to correct recognition of almost all languages. The error was due to the fact that textcat the version that existed at the moment is not very well recognize the files in different encodings.
At the moment I decided to get serious analysis of the problem. In the end, I decided to use the library textcat source from CVS developer textcat. I collected sources textcat a preliminary adjustment packages, to avoid having to change existing source OSS Server, reworked configuration file textcat, so again not disrupt OSS Server.
So I spent the extra tests for detection of different languages like: German, French, Russian and English. Test was successful, I attach a screenshot as proof of the correctness of recognition languages.
Unfortunately, I could not download files in this ticket - for some reason to file more than 230 KB sourceforge.net gives HTTP error 417.
The result was a library, here's the link: http://code.google.com/p/ftspc/downloads/detail?name=textcat-1.0.1-russian-utf8.jar&can=2&q=
Just post the project, which used to build TextCat. I understand that it was possible to do a project based on the ant, but I did a project in NetBeans, my favorite development environment. Here's the link: http://code.google.com/p/ftspc/downloads/detail?name=textcat-1.0.1-russian-utf8.zip&can=2&q=
Just post the project, which used to validate the work TextCat. Here's the link: http://code.google.com/p/ftspc/downloads/detail?name=TextCatTest.zip&can=2&q=