Summary:
Hunspell (using the example program) seems to mark
words that start with a capital letter and contain
non-ASCII characters as mis-spelled and then lists the
same word
as one of the suggestions. This seems to happen only when
using UTF-8 encoded dictionary and string.
Steps to reproduce:
1. Using the example program, and (for example) the
Estonian UTF-8 encoded affix and dict file
(mug.imo.ee/speller/et_EE.zip) check a properly spelled
wordlist where words start with capital letters (included
in that package)
Expected results:
The words should pass the spell check.
Actual results:
Every word is marked as mis-spelled and the suggestion
lists contain the same words.
Notes/Workarounds:
Everything is fine when using Latin-1 for example. The
problem is, there are some letters that don't map in that
space in Estonian. :-) I haven't tried this with other
languages, but one of the posts here about Russian seems
to maybe relate to the same problem.
Logged In: YES
user_id=726595
Hi,
Your aff file contains the 3-byte UTF-8 signal header before
the SET parameter in the first line.
Unfortunatelly, Hunspell haven't supported this header, yet.
Remove it, or leave an empty line in the beginning of the
aff file.
Best regards,
Laci
Logged In: YES
user_id=1155705
Hi,
You're exactly right. This fixed the problem for me.
When I converted the dict file to UTF-8, Hunspell would
complain about the missing word count (the first line),
until I stripped the UTF-8 header.
Btw, for Mac/TextWrangler users, this means setting the
encoding to (UTF-8, No BOM)
Many, many thanks!