Menu

#598 Incorrect handling of typographic apostrophe (U+2019)

3.1
closed-fixed
None
5
2016-01-12
2013-06-10
Gabix
No

Belarusian has words with apostrophes. In the dictionary file, such words are written using U+0027. Hunspell in AOO accepts words both with U+0027 and with U+2019 as correctly spelled (i. e. with з'яўляцца in the dictionary, both з'яўляцца and з’яўляцца are correct), which is fine because U+2019 is preferred for apostrophe [1]. Hunspell in OmegaT, however, treats U+2019 as a word separator and tries to check each part of a word containing an apostrophe separately (з’яўляцца is processed as з and яўляцца, the first being correct per se, and the second is an error).

Discussing the issue in the OmegaT Yahoo! group revealed that the issue is not specific to Belarusian. At least, Didier Briel reports for French:

It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.

I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?

In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue

In view of the above, Hunspell in OmegaT must accept U+2019 as a valid apostrophe character for languages using apostrophes.

[1] http://unicode.org/Public/UNIDATA/NamesList.txt

Discussion

  • Didier Briel

    Didier Briel - 2013-06-10
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,11 +2,11 @@
    
     Discussing the issue in the OmegaT Yahoo! group revealed that the issue is not specific to Belarusian. At least, Didier Briel reports for French:
    
    -It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.
    +>It's even stranger here, because Hunspell only recognizes the form with the straight apostrophe, but, when inserting a correction on a badly spelt word with an apostrophe (e.g., l'enfantt) inserts it with the curly apostrophe (l’enfant) and then mark it as badly spelt. So I have to go back and replace the curly one with the straight one. That's rather infuriating.
    
    -I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?
    +>I just checked in 2.6, and Hunspell inserts with the correction with the curly apostrophe, but doesn't complain about the spelling. Can you check you have the same behaviour (Hunspell not complaining about curly apostrophes) in 2.6?
    
    -In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue
    +>In my case, I just found it's a (French) tokenizer issue. If I select "Lucene 3.0" (which is the Porter algorithm), Hunspell does not complain about curly apostrophes. If I select something above (3.1 up to "current") for the tokenizer behaviour, I get the issue
    
     In view of the above, Hunspell in OmegaT must accept U+2019 as a valid apostrophe character for languages using apostrophes.
    
     
  • Didier Briel

    Didier Briel - 2013-06-10

    Related RFE, where the issue is on the source side:
    - Make apostrophe a word boundary in glossary tool
    https://sourceforge.net/p/omegat/feature-requests/469/

     
  • Didier Briel

    Didier Briel - 2015-02-10
    • assigned_to: Aaron Madlon-Kay
    • Group: 3.0 --> 3.1
     
  • Aaron Madlon-Kay

    This is addressed in trunk, r6900.

    1. U+2019 RIGHT SINGLE QUOTATION MARK is now normalized to U+0027 APOSTROPHE for spelling purposes.

    2. The tokenizer used for the context menu was not the same as that used for editor marks (red underlines). They are now consistent (both use the project's target-language tokenizer).

     
    • Gabix

      Gabix - 2015-02-11

      Thanks! Seems fine to me (as built from source downloaded today).
      Belarusian words with typographic apostrophes are now recognized as
      correctly spelled. Time to flag the bug as fixed or should we wait for the
      next release?

      Best regards,

      Dmitri Gabinski

       

      Last edit: Aaron Madlon-Kay 2016-01-12
  • Didier Briel

    Didier Briel - 2015-02-11
    • status: open --> open-fixed
     
  • Didier Briel

    Didier Briel - 2015-02-11

    Seems fine to me (as built from source downloaded today).

    Thank you for the confirmation.

    Time to flag the bug as fixed or should we wait for the next release?

    That's "open-fixed" until we release.

    Didier

     
  • Didier Briel

    Didier Briel - 2015-03-11
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2015-03-11

    Closed in the released version 3.1.9 of OmegaT.

    Didier

     
  • Gabix

    Gabix - 2016-01-06

    Please, re-open this ticket. The problem is back in 3.6.0, at least in the snapshot, downloaded today, 06 January 2016 from https://omegat.ci.cloudbees.com/job/omegat-trunk/

     
    • Didier Briel

      Didier Briel - 2016-01-06

      Can you give more details? I do not reproduce it with French.
      (I don't remember changes in /trunk that would impact this.)

      Didier

       
  • Gabix

    Gabix - 2016-01-11

    I am afraid, I see nothing to add to the original report.

    Anyway, find attached a very small testcase: the same English word ‘announcement’ is translated as ‘аб’ява’ (‘abjava’) written with a straight ' in the first case (no spellcheck error detected) and with a curly ’ in the second case (a spellcheck error detected specifically for the second part of the word being not a valid word per se, the first part can be a preposition in Belarusian, thus it is valid in any case). The archive also contains the spellcheck dictionary used.

     
  • Kos Ivantsov

    Kos Ivantsov - 2016-01-11

    In Ukrainian U+2019 is treated as word separator when moving caret, but misspelled words with U+2019 are properly recognized and proper spelling suggestions are given, though when a suggestion is accepted, U+0027 is used in the corrected word.
    Hunspell works fine with U+2019, U+0027 and U+02BC (which is sometimes used for apostrophe, but, too, is changed into U+0027 upon using a spelling correction).

     
  • Kos Ivantsov

    Kos Ivantsov - 2016-01-11

    But in the project provided by Dmitri there indeed is a bug with U+2019 just as he describes (I checked with the dics in the attached zip, and with the ones in Debian repositories).

    --
    Kos

     
  • Aaron Madlon-Kay

    I can confirm that the attached project exhibits the described behavior (the portion after U+2019 is underlined with red) however this is not a bug, or rather is outside the scope of this ticket.

    The cause is that the source language tokenizer breaks at U+2019 so the word in its entirety is never seen by the spellchecker. To solve this, simply use the HunspellTokenizer, which will work fine since there is a supplied Hunspell spelling dictionary.

     
  • Gabix

    Gabix - 2016-01-12

    Thanks for the pointer. Indeed, changing the tokenizer to Hunspell fixed the problem. OK, it was a bug in me then.

     

Log in to post a comment.