KH Coder / Discussion / Open Discussion: Multi-Language Analysis Problem

Multi-Language Analysis Problem

Forum: Open Discussion

Creator: Anonymous

Created: 2016-12-27

Updated: 2016-12-28

Comment has been marked as spam.
Undo

View and moderate all "Open Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Discussion"

Anonymous - 2016-12-27

Hello, I want to make a Clustering Analysis Document on a corpus containing tags of several languages (es, pt, ru, ja, pl, en, de, it, fr, ar, ..). Preprocessing allows formatting by language and not in UTF8 (all languages combined).
Do you have an idea to do this Analysis?
Thank you for your help

Hello, I want to make a Clustering Analysis Document on a corpus containing tags of several languages (es, pt, ru, ja, pl, en, de, it, fr, ar, ..). Preprocessing allows formatting by language and not in UTF8 (all languages combined). Do you have an idea to do this Analysis? Thank you for your help

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

HIGUCHI Koichi - 2016-12-28

Hello,

Currently, KH Coder assumes that a file contains texts in one language.

For example, if the file contains Japanese text, KH Coder uses ChaSen to extract words. And if the file contains English text, KH Coder uses Stanford POS Tagger by default. So, if a file contains multiple language text, KH Coder cannot extract words properly.

Also, Windows version of R cannot handle multi language characters. If we set R’s “locale” as “Japanese”, R will not able to handle characters outside of Japanese character code CP932.

We have at least above 2 problems to handle multi language data. To bypass word extraction problem, you can tag all words in the data file like this:

<word1> <word2> <word3>…

To bypass R issue, you can Unidecode (plain ASCII transliterate) the data.
http://search.cpan.org/~sburke/Text-Unidecode-1.30/lib/Text/Unidecode.pm

Or you can bypass the R issue by using Linux or Mac version of R.

Anyway, it could be possible but not easy.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous