Hunspell / Bugs (archive) / #232 wordforms reports UTF-8 encoding errors for certain words

#232 wordforms reports UTF-8 encoding errors for certain words

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2013-03-04

Created: 2013-03-04

Creator: ebukva

Private: No

I am using wordforms tool to test a dictionary in development.

When I test a word that contains the letter "č" followed by any two ascii letters, wordforms reports multiple errors:
"UTF-8 encoding error. Missing continuation byte in 0. character position:"

for example:
`wordforms -s test.aff test.dic radnički`

produces:
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e

The contents of the test.dic file is:
1
radnički

The contents of the test.aff file is:
SET UTF-8
SFX G Y 5
SFX G ti m ti
SFX G ti š ti
SFX G ti mo ti
SFX G ti te ti
SFX G iti e iti

If i try `wordforms -s test.aff test.dic radničk`, (without the "i") it produces even more of same errors. However if i remove/add one more letter to the word, it works as expected, which is to say `wordforms -s test.aff test.dic radnič` and `wordforms -s test.aff test.dic radničkim` works as expeted.

It actually does not matter whether the word being tested in is the .dic or not. Errors appear on any input word. It does matter what is in the .aff file though. This particular combination of flags and a few others will provoke the error on wordforms but some other affix flags won't.

I’m running hunspell 1.3.2 via command line. The issue is just with `wordforms` tool. The `hunspell` itself checks the same .dic and .aff combo properly.

Discussion

ebukva - 2013-03-04

forgot to add: both files are utf-8 files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Németh László - 2013-03-04

Please, check the encoding of your files, especially the s with háček. I seems for me, this is a conversion error resulted by the character encoding differences between ISO-8859-2 and Windows-1250.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ebukva - 2013-03-05

I was originally working on a ISO-8859-2 dictionary when I encountered the issue. But the example files pasted above, in the details, I created from scratch and typed all contents by hand using Sublime Edit set to UTF-8 for both. I’m using Mac OS 10.8.2. The Terminal.app’s character encoding is also se to to UTF-8.

running `file -I test.aff` results in:
test.aff: text/plain; charset=utf-8

running `file -I test.dic` results in:
test.dic: text/plain; charset=utf-8

Is this issue reproducible?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

wordforms reports UTF-8 encoding errors for certain words

Group

Searches

Help

#232 wordforms reports UTF-8 encoding errors for certain words

Discussion