Menu

#232 wordforms reports UTF-8 encoding errors for certain words

open
nobody
None
5
2013-03-04
2013-03-04
ebukva
No

I am using wordforms tool to test a dictionary in development.

When I test a word that contains the letter "č" followed by any two ascii letters, wordforms reports multiple errors:
"UTF-8 encoding error. Missing continuation byte in 0. character position:"

for example:
`wordforms -s test.aff test.dic radnički`

produces:
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e
UTF-8 encoding error. Missing continuation byte in 0. character position:
?e

The contents of the test.dic file is:
1
radnički

The contents of the test.aff file is:
SET UTF-8
SFX G Y 5
SFX G ti m ti
SFX G ti š ti
SFX G ti mo ti
SFX G ti te ti
SFX G iti e iti

If i try `wordforms -s test.aff test.dic radničk`, (without the "i") it produces even more of same errors. However if i remove/add one more letter to the word, it works as expected, which is to say `wordforms -s test.aff test.dic radnič` and `wordforms -s test.aff test.dic radničkim` works as expeted.

It actually does not matter whether the word being tested in is the .dic or not. Errors appear on any input word. It does matter what is in the .aff file though. This particular combination of flags and a few others will provoke the error on wordforms but some other affix flags won't.

I’m running hunspell 1.3.2 via command line. The issue is just with `wordforms` tool. The `hunspell` itself checks the same .dic and .aff combo properly.

Discussion

  • ebukva

    ebukva - 2013-03-04

    forgot to add: both files are utf-8 files.

     
  • Németh László

    Please, check the encoding of your files, especially the s with háček. I seems for me, this is a conversion error resulted by the character encoding differences between ISO-8859-2 and Windows-1250.

     
  • ebukva

    ebukva - 2013-03-05

    I was originally working on a ISO-8859-2 dictionary when I encountered the issue. But the example files pasted above, in the details, I created from scratch and typed all contents by hand using Sublime Edit set to UTF-8 for both. I’m using Mac OS 10.8.2. The Terminal.app’s character encoding is also se to to UTF-8.

    running `file -I test.aff` results in:
    test.aff: text/plain; charset=utf-8

    running `file -I test.dic` results in:
    test.dic: text/plain; charset=utf-8

    Is this issue reproducible?