#87 Not the correct definition of the Russian language.

v1.2
open
KhArtN
None
5
2012-09-13
2011-02-25
KhArtN
No

Incorrect definition of the Russian language. Russian language is defined as hungarian.

Discussion

  • KhArtN

    KhArtN - 2011-02-25

    I beg to nominate this ticket for me. Please just do not carry out work on this ticket. I will do the work.

     
  • KhArtN

    KhArtN - 2011-02-25

    To refine the results of testing the language of the text.

     
  • KhArtN

    KhArtN - 2011-02-25

    The simplest solution

     
  • KhArtN

    KhArtN - 2011-02-25

    The simplest solution: open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
    and apply it patch textcat.conf.diff (including attached files in this ticket)

     
  • KhArtN

    KhArtN - 2011-02-25

    Complex solution

     
  • KhArtN

    KhArtN - 2011-02-25

    Slightly more complex solution:
    Unzip complex_solution.tar.bz2 (Take it to the attached files in this ticket)
    open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
    and apply it patch textcat.conf.diff of uncompressed complex_solution.tar.bz2. Then copy the file russian-UTF8.lm (from complex_solution.tar.bz2) in textcat.jar \ org \ knallgrau \ utils \ textcat \ language_fp \ russian-UTF8.lm

    This will allow analysis of Russian text in UTF-8 encoding.

     
  • Emmanuel Keller

    Emmanuel Keller - 2011-02-26

    Thank you for your patch.
    I have a question.
    In the textcat.conf, can we only add the "russian-UTF8.lm" line in the textcat.conf ?
    OpenSearchServer runs only in UTF8.

     
  • KhArtN

    KhArtN - 2011-02-27

    No, you can not do that, because Many documents in Russian are in cp1251 (windows-1251). For example, I attach the file document in Word. For example, when initializing WordExtractor (this statement applies to PDFBox) no encoding is specified - the text can be returned in the encoding, which was the original document. To be sure, I did a test - put down only a string russianUTF8.lm - finally text again became recognized as a hungarian. I strongly recommend to specify all 4 lines (koi8r, windows1251, iso8859_5 and utf-8) - thus you can avoid mistakes in the recognition of language texts.

     
  • Emmanuel Keller

    Emmanuel Keller - 2011-02-28

    We have to check that their is no license conflict with the "russian-UTF8.lm" file. What is the origin of this file ? I saw that the most recent version of Textcat is in SpamAssassin.

     
  • KhArtN

    KhArtN - 2011-02-28

    Not sure that understand your question ... What is the connection issue with SpamAssassin? :) Includes TextCat no file russianUTF8.lm - I have it myself generated by
    java-jar textcat-1.0.1.jar-createfp myfile.txt russianUTF8
    As the source file was used text on the 1.8 MB. Given what they've done functional generation . lm files open, it is worth expect that there no prohibitions on the generation and use of our . lm files.

     
  • KhArtN

    KhArtN - 2011-02-28

    I found a bug. Now analyzing it. At this point, probably, there will be problems with the definition of the language. In the library configuration textcat, that I recommend. Now analyze how this error corrected.

     
  • KhArtN

    KhArtN - 2011-02-28

    Found a way to fix. Soon I'll send another patch.

     
  • KhArtN

    KhArtN - 2011-02-28

    Example of correct recognition of languages.

     
  • KhArtN

    KhArtN - 2011-02-28

    I apologize for past "my" commit - he could not lead to correct recognition of almost all languages. The error was due to the fact that textcat the version that existed at the moment is not very well recognize the files in different encodings.

    At the moment I decided to get serious analysis of the problem. In the end, I decided to use the library textcat source from CVS developer textcat. I collected sources textcat a preliminary adjustment packages, to avoid having to change existing source OSS Server, reworked configuration file textcat, so again not disrupt OSS Server.

    So I spent the extra tests for detection of different languages like: German, French, Russian and English. Test was successful, I attach a screenshot as proof of the correctness of recognition languages.

    Unfortunately, I could not download files in this ticket - for some reason to file more than 230 KB sourceforge.net gives HTTP error 417.

    The result was a library, here's the link: http://code.google.com/p/ftspc/downloads/detail?name=textcat-1.0.1-russian-utf8.jar&can=2&q=

    Just post the project, which used to build TextCat. I understand that it was possible to do a project based on the ant, but I did a project in NetBeans, my favorite development environment. Here's the link: http://code.google.com/p/ftspc/downloads/detail?name=textcat-1.0.1-russian-utf8.zip&can=2&q=

    Just post the project, which used to validate the work TextCat. Here's the link: http://code.google.com/p/ftspc/downloads/detail?name=TextCatTest.zip&can=2&q=

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks