OmegaT - multiplatform CAT tool / Bugs / #598 Incorrect handling of typographic apostrophe (U+2019)

Description has changed:

Diff:

--- old
+++ new
@@ -2,11 +2,11 @@

 Discussing the issue in the OmegaT Yahoo! group revealed that the issue is not specific to Belarusian. At least, Didier Briel reports for French:

-It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.
+>It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.

-I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?
+>I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?

-In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue
+>In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue

 In view of the above, Hunspell in OmegaT must accept U+2019 as a valid apostrophe character for languages using apostrophes.

Didier Briel - 2013-06-10

Related RFE, where the issue is on the source side:
- Make apostrophe a word boundary in glossary tool
https://sourceforge.net/p/omegat/feature-requests/469/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-02-10

assigned_to: Aaron Madlon-Kay

Group: 3.0 --> 3.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-02-10

This is addressed in trunk, r6900.

U+2019 RIGHT SINGLE QUOTATION MARK is now normalized to U+0027 APOSTROPHE for spelling purposes.

The tokenizer used for the context menu was not the same as that used for editor marks (red underlines). They are now consistent (both use the project's target-language tokenizer).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gabix - 2015-02-11
  
  Thanks! Seems fine to me (as built from source downloaded today).
  Belarusian words with typographic apostrophes are now recognized as
  correctly spelled. Time to flag the bug as fixed or should we wait for the
  next release?
  
  Best regards,
  
  Dmitri Gabinski
  
  Last edit: Aaron Madlon-Kay 2016-01-12
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-02-11

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-02-11

Seems fine to me (as built from source downloaded today).

Thank you for the confirmation.

Time to flag the bug as fixed or should we wait for the next release?

That's "open-fixed" until we release.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-03-11

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-03-11

Closed in the released version 3.1.9 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2016-01-06

Please, re-open this ticket. The problem is back in 3.6.0, at least in the snapshot, downloaded today, 06 January 2016 from https://omegat.ci.cloudbees.com/job/omegat-trunk/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Didier Briel - 2016-01-06
  
  Can you give more details? I do not reproduce it with French.
  (I don't remember changes in /trunk that would impact this.)
  
  Didier
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2016-01-11

I am afraid, I see nothing to add to the original report.

Anyway, find attached a very small testcase: the same English word ‘announcement’ is translated as ‘аб’ява’ (‘abjava’) written with a straight ' in the first case (no spellcheck error detected) and with a curly ’ in the second case (a spellcheck error detected specifically for the second part of the word being not a valid word per se, the first part can be a preposition in Belarusian, thus it is valid in any case). The archive also contains the spellcheck dictionary used.

Apostrophe_Testcase.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kos Ivantsov - 2016-01-11

In Ukrainian U+2019 is treated as word separator when moving caret, but misspelled words with U+2019 are properly recognized and proper spelling suggestions are given, though when a suggestion is accepted, U+0027 is used in the corrected word.
Hunspell works fine with U+2019, U+0027 and U+02BC (which is sometimes used for apostrophe, but, too, is changed into U+0027 upon using a spelling correction).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kos Ivantsov - 2016-01-11

But in the project provided by Dmitri there indeed is a bug with U+2019 just as he describes (I checked with the dics in the attached zip, and with the ones in Debian repositories).

--
Kos

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2016-01-12

I can confirm that the attached project exhibits the described behavior (the portion after U+2019 is underlined with red) however this is not a bug, or rather is outside the scope of this ticket.

The cause is that the source language tokenizer breaks at U+2019 so the word in its entirety is never seen by the spellchecker. To solve this, simply use the HunspellTokenizer, which will work fine since there is a supplied Hunspell spelling dictionary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabix - 2016-01-12

Thanks for the pointer. Indeed, changing the tokenizer to Hunspell fixed the problem. OK, it was a bug in me then.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Incorrect handling of typographic apostrophe (U+2019)

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#598 Incorrect handling of typographic apostrophe (U+2019)

Discussion