#11 False negatives with leading capital letter and UTF-8

closed
None
5
2006-02-16
2006-02-16
filippl
No

Summary:
Hunspell (using the example program) seems to mark
words that start with a capital letter and contain
non-ASCII characters as mis-spelled and then lists the
same word
as one of the suggestions. This seems to happen only when
using UTF-8 encoded dictionary and string.

Steps to reproduce:
1. Using the example program, and (for example) the
Estonian UTF-8 encoded affix and dict file
(mug.imo.ee/speller/et_EE.zip) check a properly spelled
wordlist where words start with capital letters (included
in that package)

Expected results:
The words should pass the spell check.

Actual results:
Every word is marked as mis-spelled and the suggestion
lists contain the same words.

Notes/Workarounds:
Everything is fine when using Latin-1 for example. The
problem is, there are some letters that don't map in that
space in Estonian. :-) I haven't tried this with other
languages, but one of the posts here about Russian seems
to maybe relate to the same problem.

Discussion

  • Logged In: YES
    user_id=726595

    Hi,

    Your aff file contains the 3-byte UTF-8 signal header before
    the SET parameter in the first line.

    Unfortunatelly, Hunspell haven't supported this header, yet.
    Remove it, or leave an empty line in the beginning of the
    aff file.

    Best regards,

    Laci

     
    • assigned_to: nobody --> nemethl
    • status: open --> closed
     
  • filippl
    filippl
    2006-02-16

    Logged In: YES
    user_id=1155705

    Hi,

    You're exactly right. This fixed the problem for me.
    When I converted the dict file to UTF-8, Hunspell would
    complain about the missing word count (the first line),
    until I stripped the UTF-8 header.

    Btw, for Mac/TextWrangler users, this means setting the
    encoding to (UTF-8, No BOM)

    Many, many thanks!