Belarusian has words with apostrophes. In the dictionary file, such words are written using U+0027. Hunspell in AOO accepts words both with U+0027 and with U+2019 as correctly spelled (i. e. with з'яўляцца in the dictionary, both з'яўляцца and з’яўляцца are correct), which is fine because U+2019 is preferred for apostrophe [1]. Hunspell in OmegaT, however, treats U+2019 as a word separator and tries to check each part of a word containing an apostrophe separately (з’яўляцца is processed as з and яўляцца, the first being correct per se, and the second is an error).
Discussing the issue in the OmegaT Yahoo! group revealed that the issue is not specific to Belarusian. At least, Didier Briel reports for French:
It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.
I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?
In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue
In view of the above, Hunspell in OmegaT must accept U+2019 as a valid apostrophe character for languages using apostrophes.
Diff:
Related RFE, where the issue is on the source side:
- Make apostrophe a word boundary in glossary tool
https://sourceforge.net/p/omegat/feature-requests/469/
This is addressed in trunk, r6900.
U+2019 RIGHT SINGLE QUOTATION MARK is now normalized to U+0027 APOSTROPHE for spelling purposes.
The tokenizer used for the context menu was not the same as that used for editor marks (red underlines). They are now consistent (both use the project's target-language tokenizer).
Thanks! Seems fine to me (as built from source downloaded today).
Belarusian words with typographic apostrophes are now recognized as
correctly spelled. Time to flag the bug as fixed or should we wait for the
next release?
Best regards,
Dmitri Gabinski
Last edit: Aaron Madlon-Kay 2016-01-12
Thank you for the confirmation.
That's "open-fixed" until we release.
Didier
Closed in the released version 3.1.9 of OmegaT.
Didier
Please, re-open this ticket. The problem is back in 3.6.0, at least in the snapshot, downloaded today, 06 January 2016 from https://omegat.ci.cloudbees.com/job/omegat-trunk/
Can you give more details? I do not reproduce it with French.
(I don't remember changes in /trunk that would impact this.)
Didier
I am afraid, I see nothing to add to the original report.
Anyway, find attached a very small testcase: the same English word ‘announcement’ is translated as ‘аб’ява’ (‘abjava’) written with a straight ' in the first case (no spellcheck error detected) and with a curly ’ in the second case (a spellcheck error detected specifically for the second part of the word being not a valid word per se, the first part can be a preposition in Belarusian, thus it is valid in any case). The archive also contains the spellcheck dictionary used.
In Ukrainian U+2019 is treated as word separator when moving caret, but misspelled words with U+2019 are properly recognized and proper spelling suggestions are given, though when a suggestion is accepted, U+0027 is used in the corrected word.
Hunspell works fine with U+2019, U+0027 and U+02BC (which is sometimes used for apostrophe, but, too, is changed into U+0027 upon using a spelling correction).
But in the project provided by Dmitri there indeed is a bug with U+2019 just as he describes (I checked with the dics in the attached zip, and with the ones in Debian repositories).
--
Kos
I can confirm that the attached project exhibits the described behavior (the portion after U+2019 is underlined with red) however this is not a bug, or rather is outside the scope of this ticket.
The cause is that the source language tokenizer breaks at U+2019 so the word in its entirety is never seen by the spellchecker. To solve this, simply use the HunspellTokenizer, which will work fine since there is a supplied Hunspell spelling dictionary.
Thanks for the pointer. Indeed, changing the tokenizer to Hunspell fixed the problem. OK, it was a bug in me then.