Hunspell / Bugs (archive) / #11 False negatives with leading capital letter and UTF-8

False negatives with leading capital letter and UTF-8

#11 False negatives with leading capital letter and UTF-8

Status: closed

Owner: Németh László

Labels: None

Priority: 5

Updated: 2006-02-16

Created: 2006-02-16

Creator: filippl

Private: No

Summary:
Hunspell (using the example program) seems to mark
words that start with a capital letter and contain
non-ASCII characters as mis-spelled and then lists the
same word
as one of the suggestions. This seems to happen only when
using UTF-8 encoded dictionary and string.

Steps to reproduce:
1. Using the example program, and (for example) the
Estonian UTF-8 encoded affix and dict file
(mug.imo.ee/speller/et_EE.zip) check a properly spelled
wordlist where words start with capital letters (included
in that package)

Expected results:
The words should pass the spell check.

Actual results:
Every word is marked as mis-spelled and the suggestion
lists contain the same words.

Notes/Workarounds:
Everything is fine when using Latin-1 for example. The
problem is, there are some letters that don't map in that
space in Estonian. :-) I haven't tried this with other
languages, but one of the posts here about Russian seems
to maybe relate to the same problem.

Discussion

Németh László - 2006-02-16

Logged In: YES
user_id=726595

Hi,

Your aff file contains the 3-byte UTF-8 signal header before
the SET parameter in the first line.

Unfortunatelly, Hunspell haven't supported this header, yet.
Remove it, or leave an empty line in the beginning of the
aff file.

Best regards,

Laci

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Németh László - 2006-02-16

assigned_to: nobody --> nemethl

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

filippl - 2006-02-16

Logged In: YES
user_id=1155705

Hi,

You're exactly right. This fixed the problem for me.
When I converted the dict file to UTF-8, Hunspell would
complain about the missing word count (the first line),
until I stripped the UTF-8 header.

Btw, for Mac/TextWrangler users, this means setting the
encoding to (UTF-8, No BOM)

Many, many thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

False negatives with leading capital letter and UTF-8

Group

Searches

Help

#11 False negatives with leading capital letter and UTF-8

Discussion