Menu

Re: [Hunspell-devel] NFC/NFD support in Hunspell

Soren Bro
2018-10-27
2018-10-27
  • Soren Bro

    Soren Bro - 2018-10-27

    As I said, you seem to know what you're talking about and, for that reason
    alone I'd like to keep you around. :)

    Remember though, that we're not experts in HUNSPELL. We (I?) just started
    experimenting with it very recently. I also have a lot in my plate, as you
    may have deducted.

    i have searched for lists and IRC channels to ask questions but came up
    short. I'm guessing that's how you found your way here, to the Hermes mail
    list.

    I'm of the opinion that all sourceforge projects should have an IRC channel
    on FREENODE. No such luck with HUNSPELL so far, but we have one. It's on
    the FREENODE IRC server and it's called #hermesmail. It's not in any
    danger of going malthusian anytime soon, but at least I'm usually there.
    If not AFK. I use my phone as hotspot, so when I'm not home I'm not online.
    Historical reasons.

    I'm perfectly comfortable with our exchanges being public on the list. It's
    just gmail that messes with my recipients from time to time.

    Regards.


    This question is really independent of UTF-8. You can represent both NFC
    and NFD in UTF-8 (just like you can represent both in UTF-16). >>
    +
    For instance, according to my experiments, if I define my .dic file to
    include the word "blasé" where the é is represented by one character
    (U+00E9 = >> C3 A9 in UTF-8 - this is NFC), and then try to use it to
    check the data "blasé" with the é represented by two characters (U+0065 &
    U+0301 = 65 CC >> 81 in UTF-8 - this is NFD), the word will be marked as
    misspelled.

    At least this is what my tests show. Is there something I could put in
    my Hunspell files to handle this? If not, then a spell-checker will only
    handle >> data in the normalization form that it is specifically defined
    for. This is unfortunate, since applications are ideally supposed to handle
    NFC and NFD >> as if they are equivalent. Defining a spell-checker to
    handle both could be a big pain and make the string list huge.

    Maybe the assumption has been that a Hunspell spell-checker is only
    written to handle NFC, but there's nothing to enforce that, and it would
    surprise me if all applications that want to use Hunspell work that way.
    In fact I'm sure they don't.

    I see that the MAP mechanism can be used to define closely related
    sequences. But NFC and NFD are supposed to be equivalent, not just similar.

    On Fri, Oct 26, 2018 at 9:35 PM Sharon Correll sharon_correll@sil.org
    wrote:

    This question is really independent of UTF-8. You can represent both NFC
    and NFD in UTF-8 (just like you can represent both in UTF-16).

    For instance, according to my experiments, if I define my .dic file to
    include the word "blasé" where the é is represented by one character
    (U+00E9 = C3 A9 in UTF-8 - this is NFC), and then try to use it to check
    the data "blasé" with the é represented by two characters (U+0065 & U+0301
    = 65 CC 81 in UTF-8 - this is NFD), the word will be marked as misspelled.

    At least this is what my tests show. Is there something I could put in my
    Hunspell files to handle this? If not, then a spell-checker will only
    handle data in the normalization form that it is specifically defined for.
    This is unfortunate, since applications are ideally supposed to handle NFC
    and NFD as if they are equivalent. Defining a spell-checker to handle both
    could be a big pain and make the string list huge.

    Maybe the assumption has been that a Hunspell spell-checker is only
    written to handle NFC, but there's nothing to enforce that, and it would
    surprise me if all applications that want to use Hunspell work that way. In
    fact I'm sure they don't.

    I see that the MAP mechanism can be used to define closely related
    sequences. But NFC and NFD are supposed to be equivalent, not just similar.

    On 10/26/2018 1:48 PM, sbrothy@gmail.com wrote:

    I'm quite sure diacritics and such are covered by UTF-8. After all, Arabic
    and Hebrew are. You're concerned about the transition though, if I read you
    correctly?

    We're sorta committed to UTF-8 as it is. Unless someone shows me a
    language more obscure than "Modern Greek (Polytonic) or "Friulian",or the
    RTL-ones like Arabic or Hebrew I am not concerned. I'm quite sure they'll
    survive any potential shift. In fact I think it can only become better.
    Regardless of method.

    But that's just my completely unbacked optimism shining through. :)

    As always, you're welcome to a second opinion. Anyone?

    Regards,
    Soren

    On Friday, October 26, 2018, sbrothy@gmail.com wrote:

    Oh You're talking about normalisation. I got confused by all the German
    links I ran into. Eggs on my face. Let me get back to you on this one.

    Regards

    On Thursday, October 25, 2018, sbrothy@gmail.com wrote:

    What I think I forgot to mention, is that I'm trying to replace the
    spellchecking and it's a spaghetti-code nightmare.

    Regards.

    On Thu, Oct 25, 2018 at 11:00 PM sbrothy@gmail.com wrote:

    This may be an older list:

    https://github.com/elastic/hunspell/tree/master/dicts

    Regards.

    On Thu, Oct 25, 2018 at 10:51 PM sbrothy@gmail.com wrote:

    To be brutally honest, I considered replacing the GUI with WxWidgets
    to be portable and all, but only MFC has the "sexiness" expexted by it's
    users. Dockable toolbars and tabbed dockable windows etc. You can tell me
    all you want that as long as the functionality is the same it won't matter,
    but I'm not convinced.

    Which makes me kinda curious and worried about the MAC users.

    Regards.

    On Thu, Oct 25, 2018 at 10:31 PM sbrothy@gmail.com wrote:

    I guess this ones's for me. If you mean whether HUNSPELL supports
    Korean or similar, the answer is yes. It even supports Hebrew. An RTL
    language. I don't know whether Korean is RTL, but replacing SPELL32.DLL
    isn't easy. To say the least....

    Regards,
    Soren

    On Thu, Oct 25, 2018 at 9:33 PM Sharon Correll sharon_correll@sil.org wrote:

    I apologize if I have overlooked something, but...is there any kind
    of
    NFC/NFD support in Hunspell currently? If not, it appears that a
    spell-checker designed for NFC data will not work if the client app
    sends it NFD, and vice versa.

    If there is no such support, it might be something that I would
    consider adding. It surprises me to think that this is not a
    significant need.

    Thanks,
    Sharon Correll
    SIL International


    Hunspell-devel mailing list
    Hunspell-devel@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/hunspell-devel

    --
    Søren Bro Thygesen

     
    • Soren Bro

      Soren Bro - 2018-10-27

      "on my plate", "deduced"?! And I'm the spellchecker?! Heh.

      Regards

      On Sat, Oct 27, 2018 at 6:18 PM Soren Bro sbrothy@users.sourceforge.net
      wrote:

      As I said, you seem to know what you're talking about and, for that reason
      alone I'd like to keep you around. :)

      Remember though, that we're not experts in HUNSPELL. We (I?) just started
      experimenting with it very recently. I also have a lot in my plate, as you
      may have deducted.

      i have searched for lists and IRC channels to ask questions but came up
      short. I'm guessing that's how you found your way here, to the Hermes mail
      list.

      I'm of the opinion that all sourceforge projects should have an IRC channel
      on FREENODE. No such luck with HUNSPELL so far, but we have one. It's on
      the FREENODE IRC server and it's called #hermesmail. It's not in any
      danger of going malthusian anytime soon, but at least I'm usually there.
      If not AFK. I use my phone as hotspot, so when I'm not home I'm not online.
      Historical reasons.

      I'm perfectly comfortable with our exchanges being public on the list. It's
      just gmail that messes with my recipients from time to time.

      Regards.

      This question is really independent of UTF-8. You can represent both NFC
      and NFD in UTF-8 (just like you can represent both in UTF-16). >>
      +
      For instance, according to my experiments, if I define my .dic file to
      include the word "blasé" where the é is represented by one character
      (U+00E9 = >> C3 A9 in UTF-8 - this is NFC), and then try to use it to
      check the data "blasé" with the é represented by two characters (U+0065 &
      U+0301 = 65 CC >> 81 in UTF-8 - this is NFD), the word will be marked as
      misspelled.

      At least this is what my tests show. Is there something I could put in
      my Hunspell files to handle this? If not, then a spell-checker will only
      handle >> data in the normalization form that it is specifically defined
      for. This is unfortunate, since applications are ideally supposed to handle
      NFC and NFD >> as if they are equivalent. Defining a spell-checker to
      handle both could be a big pain and make the string list huge.

      Maybe the assumption has been that a Hunspell spell-checker is only
      written to handle NFC, but there's nothing to enforce that, and it would
      surprise me if all applications that want to use Hunspell work that way.
      In fact I'm sure they don't.

      I see that the MAP mechanism can be used to define closely related
      sequences. But NFC and NFD are supposed to be equivalent, not just similar.

      On Fri, Oct 26, 2018 at 9:35 PM Sharon Correll sharon_correll@sil.org
      wrote:

      This question is really independent of UTF-8. You can represent both NFC
      and NFD in UTF-8 (just like you can represent both in UTF-16).

      For instance, according to my experiments, if I define my .dic file to
      include the word "blasé" where the é is represented by one character
      (U+00E9 = C3 A9 in UTF-8 - this is NFC), and then try to use it to check
      the data "blasé" with the é represented by two characters (U+0065 & U+0301
      = 65 CC 81 in UTF-8 - this is NFD), the word will be marked as misspelled.

      At least this is what my tests show. Is there something I could put in my
      Hunspell files to handle this? If not, then a spell-checker will only
      handle data in the normalization form that it is specifically defined for.
      This is unfortunate, since applications are ideally supposed to handle NFC
      and NFD as if they are equivalent. Defining a spell-checker to handle both
      could be a big pain and make the string list huge.

      Maybe the assumption has been that a Hunspell spell-checker is only
      written to handle NFC, but there's nothing to enforce that, and it would
      surprise me if all applications that want to use Hunspell work that way. In
      fact I'm sure they don't.

      I see that the MAP mechanism can be used to define closely related
      sequences. But NFC and NFD are supposed to be equivalent, not just similar.

      On 10/26/2018 1:48 PM, sbrothy@gmail.com wrote:

      I'm quite sure diacritics and such are covered by UTF-8. After all, Arabic
      and Hebrew are. You're concerned about the transition though, if I read you
      correctly?

      We're sorta committed to UTF-8 as it is. Unless someone shows me a
      language more obscure than "Modern Greek (Polytonic) or "Friulian",or the
      RTL-ones like Arabic or Hebrew I am not concerned. I'm quite sure they'll
      survive any potential shift. In fact I think it can only become better.
      Regardless of method.

      But that's just my completely unbacked optimism shining through. :)

      As always, you're welcome to a second opinion. Anyone?

      Regards,
      Soren

      On Friday, October 26, 2018, sbrothy@gmail.com wrote:

      Oh You're talking about normalisation. I got confused by all the German
      links I ran into. Eggs on my face. Let me get back to you on this one.

      Regards

      On Thursday, October 25, 2018, sbrothy@gmail.com wrote:

      What I think I forgot to mention, is that I'm trying to replace the
      spellchecking and it's a spaghetti-code nightmare.

      Regards.

      On Thu, Oct 25, 2018 at 11:00 PM sbrothy@gmail.com wrote:

      This may be an older list:

      https://github.com/elastic/hunspell/tree/master/dicts

      Regards.

      On Thu, Oct 25, 2018 at 10:51 PM sbrothy@gmail.com wrote:

      To be brutally honest, I considered replacing the GUI with WxWidgets
      to be portable and all, but only MFC has the "sexiness" expexted by it's
      users. Dockable toolbars and tabbed dockable windows etc. You can tell me
      all you want that as long as the functionality is the same it won't matter,
      but I'm not convinced.

      Which makes me kinda curious and worried about the MAC users.

      Regards.

      On Thu, Oct 25, 2018 at 10:31 PM sbrothy@gmail.com wrote:

      I guess this ones's for me. If you mean whether HUNSPELL supports
      Korean or similar, the answer is yes. It even supports Hebrew. An RTL
      language. I don't know whether Korean is RTL, but replacing SPELL32.DLL
      isn't easy. To say the least....

      Regards,
      Soren

      On Thu, Oct 25, 2018 at 9:33 PM Sharon Correll sharon_correll@sil.org
      wrote:

      I apologize if I have overlooked something, but...is there any kind
      of
      NFC/NFD support in Hunspell currently? If not, it appears that a
      spell-checker designed for NFC data will not work if the client app
      sends it NFD, and vice versa.

      If there is no such support, it might be something that I would
      consider adding. It surprises me to think that this is not a
      significant need.

      Thanks,
      Sharon Correll
      SIL International


      Hunspell-devel mailing list
      Hunspell-devel@lists.sourceforge.net
      https://lists.sourceforge.net/lists/listinfo/hunspell-devel

      --
      Søren Bro Thygesen


      Re: [Hunspell-devel] NFC/NFD support in Hunspell
      https://sourceforge.net/p/hermesmail/discussion/general/thread/297bf941e4/?limit=25#3052


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/hermesmail/discussion/general/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.