Hi guys. Can hunspell make decompounding? I mean fix "helloworld" to "hello world" ? This example works, but some longer one: "sequoyahcountysheriff" - doesn't. I'm normalizing corpora for language modeling, which contain lots of URLs (~20%). I need to fix that out-of-space staff. Any ideas?
Thanks in advance
Just wondering if you have found an answer to your question, somewhere else, because I have the same problem.
Yeap. I found stackoverflow thread on that: http://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words.
Viterbi algorithm works great for this task and completely solve my problem. I've download google unigrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) and used that as data set for my decompounder.
If you have small amount of data to split, you can try this: http://code.google.com/p/google-api-spelling-java/ It takes something around 1 sec to check and fix one word and I suppose if you'll make too much requests you'll be bunned, but I don't check that.
Looks promising, thank you…