#87 Not the correct definition of the Russian language.

v1.2
open
KhArtN
None
5
2012-09-13
2011-02-25
KhArtN
No

Incorrect definition of the Russian language. Russian language is defined as hungarian.

Discussion

1 2 > >> (Page 1 of 2)
  • KhArtN
    KhArtN
    2011-02-25

    I beg to nominate this ticket for me. Please just do not carry out work on this ticket. I will do the work.

     
  • KhArtN
    KhArtN
    2011-02-25

    To refine the results of testing the language of the text.

     
    Attachments
  • KhArtN
    KhArtN
    2011-02-25

    The simplest solution

     
    Attachments
  • KhArtN
    KhArtN
    2011-02-25

    The simplest solution: open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
    and apply it patch textcat.conf.diff (including attached files in this ticket)

     
  • KhArtN
    KhArtN
    2011-02-25

    Complex solution

     
  • KhArtN
    KhArtN
    2011-02-25

    Slightly more complex solution:
    Unzip complex_solution.tar.bz2 (Take it to the attached files in this ticket)
    open textcat.jar \ org \ knallgrau \ utils \ textcat \ textcat.conf
    and apply it patch textcat.conf.diff of uncompressed complex_solution.tar.bz2. Then copy the file russian-UTF8.lm (from complex_solution.tar.bz2) in textcat.jar \ org \ knallgrau \ utils \ textcat \ language_fp \ russian-UTF8.lm

    This will allow analysis of Russian text in UTF-8 encoding.

     
  • Thank you for your patch.
    I have a question.
    In the textcat.conf, can we only add the "russian-UTF8.lm" line in the textcat.conf ?
    OpenSearchServer runs only in UTF8.

     
  • KhArtN
    KhArtN
    2011-02-27

    No, you can not do that, because Many documents in Russian are in cp1251 (windows-1251). For example, I attach the file document in Word. For example, when initializing WordExtractor (this statement applies to PDFBox) no encoding is specified - the text can be returned in the encoding, which was the original document. To be sure, I did a test - put down only a string russianUTF8.lm - finally text again became recognized as a hungarian. I strongly recommend to specify all 4 lines (koi8r, windows1251, iso8859_5 and utf-8) - thus you can avoid mistakes in the recognition of language texts.

     
  • We have to check that their is no license conflict with the "russian-UTF8.lm" file. What is the origin of this file ? I saw that the most recent version of Textcat is in SpamAssassin.

     
1 2 > >> (Page 1 of 2)