Preprocessing text

  • Nobody/Anonymous

    Dear Koi-chi.
    I am trying to use KH coder, because i think it is a very interesting
    software. I am trying to use it with texts in Spanis, but i am not able to do
    it. What ki d of preprocessing is necessary so as to the program read the
    Spanish text?
    Best regards,


  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-05-31

    Dear Marcial:

    Thank you for your post!

    First, you have to configure KH Coder to process Spanish text. Go to "Project" -> "Settings" in the menubar. Select "Stemming with Snowball" and "Spanish" in the settings window.

    Second, you have to prepare a plain text file (*.txt) as a "target file" for KH Coder. Not Word (*.doc *.docx) or PDF (*.pdf). The encoding of the text file should be something like "Latin-1," "ISO 8859-1," "US-ASCII," or "Plain ASCII." And be sure that the text file doesn't include tab characters or any other control characters other than "line feed."

    Best regards,

    If it still won't work, will you tell me the error message? Thanx.

    Last edit: HIGUCHI Koichi 2012-11-11
  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-06-01

    Also, the text file should not contain angle brackets: "<" or ">."

    (Angle brackets are for tagging only)

  • Nobody/Anonymous

    Thanks a lot for your answer.
    I have achieved preprocessing my text, and I have a frequency list of words,
    but I am not able to do concordances, and, when I try to do co-ocurrence
    networks, the following errors are presented:

    Can't use string ("0") as a HASH ref while "strict refs" in use at
    /<C:\\khcoder\\kh_coder.exe>plotR/ line 263.

    plotR::network::new at /<C:\\khcoder\\kh_coder.exe>plotR/ line 263
    gui_window::word_netgraph::calc at
    /<C:\\khcoder\\kh_coder.exe>gui_window/ line 340
    gui_window::word_netgraph::ANON at
    /<C:\\khcoder\\kh_coder.exe>gui_window/ line 139
    Tk callback for .toplevel4.button1
    Tk::ANON at /<C:\\khcoder\\kh_coder.exe> line 250
    Tk::Button::butUp at /<C:\\khcoder\\kh_coder.exe>Tk/ line 175
    (command bound to event)

    Best regards,


  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-06-03

    Hello. Thank you for telling me the situation!

    About concordances, how about this procedure?

    1. Go to "Tools" "Words" "Frequency List" in the menubar
    2. Click "OK"

    The frequency list shold be displayed in a Excel window. Now we try

    1. Copy a word from the frequency list (Excel). To do this, select a word in the frequency list displayed in the Excel window, then hit Ctrl+C.
    2. Go to "Tools" "Words" "KWIC Concordances" in the menubar of KH Coder.
    3. Click the entry of "Word." I mean the white entry space besides "Word:."
    4. Hit Ctrl+V to paste the word into the entry.
    5. Click "Search."

    Now you should be able to see concordance lines, I think. If not, let me know.

    About co-occurrence networks, could you tell me the 1st error message? "Can't use string..." is in 2nd or 3rd pop up, isn't it?

    Thank you.

    Last edit: HIGUCHI Koichi 2012-11-11
  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-06-03

    About co-occurrence networks, you may also try this:

    1. Go to "Tools" "Words" "Co-Occurrence Network" in the menubar.
    2. Select "Sentences" as the "Unit."
    3. Click "OK" button.

    I am not sure but when there is only one paragraph in the target text file,
    this may help. Because the default unit setting is "Paragraphs" and it needs
    multiple paragraphs to calcurete co-occurrence.

    Last edit: HIGUCHI Koichi 2012-11-11
  • Nobody/Anonymous

    I have tried your method for Kwic concordance and it is not working.
    In the co-ocurrence networks, the error message "can't use string..." is the
    third or fourth message.
    Before that, the messages are:

    simpleError in source(PERLINPUTFILE): embedded nul in string:"\0b\0a\0r"
    simpleError in eval.with.vis(expr, envir, enclos): could not find function



  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-06-04

    Hello. Thank you for you post!

    It seems that the text file still contains some control characters and pre-
    processing is not quite complete.

    Could you send me the text file via e-mail so that I can check whether the
    problem is in the data or not?

    I will use the data only for debugging KH Coder. And once it's done, I will
    delete it right away. So please consider the option.

    Best regards.

  • HIGUCHI Koichi

    HIGUCHI Koichi - 2012-06-04

    In case if it is difficult to send the data to me, you can try changing the
    character code of the text file. "plain ascii" or "us ascii" may help.

    And also you can clean up you text using "clean" function of Excel.

    After you make changes to the text file, you have to run "pre-processing"

    And check the number of paragraphs displayed in the "Database Stats" section
    of KH Coder window. This number should be consistent with the number of rows
    in Excel.

    Last edit: HIGUCHI Koichi 2012-11-11


Cancel  Add attachments

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks