Menu

Multi-Language Analysis Problem

Anonymous
2016-12-27
2016-12-28
  • Anonymous

    Anonymous - 2016-12-27

    Hello, I want to make a Clustering Analysis Document on a corpus containing tags of several languages (es, pt, ru, ja, pl, en, de, it, fr, ar, ..). Preprocessing allows formatting by language and not in UTF8 (all languages combined).
    Do you have an idea to do this Analysis?
    Thank you for your help

     
  • HIGUCHI Koichi

    HIGUCHI Koichi - 2016-12-28

    Hello,

    Currently, KH Coder assumes that a file contains texts in one language.

    For example, if the file contains Japanese text, KH Coder uses ChaSen to extract words. And if the file contains English text, KH Coder uses Stanford POS Tagger by default. So, if a file contains multiple language text, KH Coder cannot extract words properly.

    Also, Windows version of R cannot handle multi language characters. If we set R’s “locale” as “Japanese”, R will not able to handle characters outside of Japanese character code CP932.

    We have at least above 2 problems to handle multi language data. To bypass word extraction problem, you can tag all words in the data file like this:

    <word1> <word2> <word3>…

    To bypass R issue, you can Unidecode (plain ASCII transliterate) the data.
    http://search.cpan.org/~sburke/Text-Unidecode-1.30/lib/Text/Unidecode.pm

    Or you can bypass the R issue by using Linux or Mac version of R.

    Anyway, it could be possible but not easy.

     

Anonymous
Anonymous

Add attachments
Cancel