I wish that KH Coder was multi-threaded. I am currently analyzing a corpus containing 40K sentences, 10K paragraphs and 106 headings. Based on what I am doing execution can take up to 20 minutes. For instance, if I execute a word co-occurrence network of words, TAGs only (resulting in 40 selected words), filter edges - top 100, types of edges words-headings it generally takes up to 30 minutes to complete execution. I am sure that you are familiar with the issues. On windows the UI thread locks. It would also be nice of some of the computation was in threads other than the main execution thread. While I don't mind getting a coffee as it is executing my sense is that if I had 10K headings I would wait for the most of the day. I have learned to avoid feature configurations where I produce large matrices but with a much larger corpus I will undoubtedly find myself doing just this based on necessity.
So, my question. As far as the UI and background threads go...Do you plan on using multi-threading in KH Coder in the future? Even better, it would be great if I could distribute execution across machines. Of course this would require a number of fundamental changes to the architecture of KH Coder. I guess that the broader question is...What future do you envision for KH Coder? It is a great tool. Have you considered making the source code open? Perhaps others could help you issues such as multi-threading.
I know that you are a scientist and your focus is conducting research and not producing a highly scalable commercial product. But it would really be good to hear that you had big plans for KH Coder! Of course, it is such a good tool that I will use it in any form!
You seem to have CSS turned off.
Please don't fill out this field.
BTW...I don't want you to think that all execution takes a long time. For instance, in the example above when I specify the edge type as word-word execution time is acceptable.
Thank you for taking the time to post your valuable feedback. I really appreciate it.
I am well aware of the issue. But unfortunately, I am not ready for writing parallel processing right now. And currently, optimization of processing speed is not prioritized in the development of KH Coder. I am sorry for the inconvenience it may cause. The good news here is that KH Coder is already open source software. It is written in Perl, and you can download source code (*-strb.zip or *.tar.gz) and edit it.
Anyway, if you have really massive amount of data, I recommend that you would perform random sampling to reduce the file size. If we can correctly perform random sampling, we need only 2000 respondents to know Obama’s approval ratings in whole USA. The sampling error in this case will be less than 2%. If we use statistics properly, we rarely need to put the whole "big data" into the analysis.
About plans, well, I think of many things. For example, there should be a detailed manual in English language. And it would be nice to be able to analyze Arabic, Chinese, and Russian language data. Some advanced linguistic/statistical methods like negation detection or topic models could be implemented. Also I would like to write a book that includes usages and application examples of KH Coder. Et cetera, et cetera. But please note that it would take really long to make these things happen. As you wrote, I am kind of a Sunday programmer :)
P.S. Computing of co-occurrence network of “words – variables / heading” takes much longer than “words – words?” I didn’t aware of that and I think I am going to look into it.
Sign up for the SourceForge newsletter: