HERMES Mail / Discussion / General Discussion: Re: [Hunspell-devel] NFC/NFD support in Hunspell

The cult-favourite eMail client formerly known as Qualcomm Eudora

Re: [Hunspell-devel] NFC/NFD support in Hunspell

Forum: General Discussion

Creator: Soren Bro

Created: 2018-10-27

Updated: 2018-10-27

Soren Bro - 2018-10-27

As I said, you seem to know what you're talking about and, for that reason
alone I'd like to keep you around. :)

Remember though, that we're not experts in HUNSPELL. We (I?) just started
experimenting with it very recently. I also have a lot in my plate, as you
may have deducted.

i have searched for lists and IRC channels to ask questions but came up
short. I'm guessing that's how you found your way here, to the Hermes mail
list.

I'm of the opinion that all sourceforge projects should have an IRC channel
on FREENODE. No such luck with HUNSPELL so far, but we have one. It's on
the FREENODE IRC server and it's called #hermesmail. It's not in any
danger of going malthusian anytime soon, but at least I'm usually there.
If not AFK. I use my phone as hotspot, so when I'm not home I'm not online.
Historical reasons.

I'm perfectly comfortable with our exchanges being public on the list. It's
just gmail that messes with my recipients from time to time.

Regards.

This question is really independent of UTF-8. You can represent both NFC
and NFD in UTF-8 (just like you can represent both in UTF-16). >>
+
For instance, according to my experiments, if I define my .dic file to
include the word "blasé" where the é is represented by one character
(U+00E9 = >> C3 A9 in UTF-8 - this is NFC), and then try to use it to
check the data "blasé" with the é represented by two characters (U+0065 &
U+0301 = 65 CC >> 81 in UTF-8 - this is NFD), the word will be marked as
misspelled.

At least this is what my tests show. Is there something I could put in
my Hunspell files to handle this? If not, then a spell-checker will only
handle >> data in the normalization form that it is specifically defined
for. This is unfortunate, since applications are ideally supposed to handle
NFC and NFD >> as if they are equivalent. Defining a spell-checker to
handle both could be a big pain and make the string list huge.

Maybe the assumption has been that a Hunspell spell-checker is only
written to handle NFC, but there's nothing to enforce that, and it would
surprise me if all applications that want to use Hunspell work that way.
In fact I'm sure they don't.

I see that the MAP mechanism can be used to define closely related
sequences. But NFC and NFD are supposed to be equivalent, not just similar.

On Fri, Oct 26, 2018 at 9:35 PM Sharon Correll sharon_correll@sil.org
wrote:

This question is really independent of UTF-8. You can represent both NFC
and NFD in UTF-8 (just like you can represent both in UTF-16).

For instance, according to my experiments, if I define my .dic file to
include the word "blasé" where the é is represented by one character
(U+00E9 = C3 A9 in UTF-8 - this is NFC), and then try to use it to check
the data "blasé" with the é represented by two characters (U+0065 & U+0301
= 65 CC 81 in UTF-8 - this is NFD), the word will be marked as misspelled.

At least this is what my tests show. Is there something I could put in my
Hunspell files to handle this? If not, then a spell-checker will only
handle data in the normalization form that it is specifically defined for.
This is unfortunate, since applications are ideally supposed to handle NFC
and NFD as if they are equivalent. Defining a spell-checker to handle both
could be a big pain and make the string list huge.

Maybe the assumption has been that a Hunspell spell-checker is only
written to handle NFC, but there's nothing to enforce that, and it would
surprise me if all applications that want to use Hunspell work that way. In
fact I'm sure they don't.

I see that the MAP mechanism can be used to define closely related
sequences. But NFC and NFD are supposed to be equivalent, not just similar.

On 10/26/2018 1:48 PM, sbrothy@gmail.com wrote:

I'm quite sure diacritics and such are covered by UTF-8. After all, Arabic
and Hebrew are. You're concerned about the transition though, if I read you
correctly?

We're sorta committed to UTF-8 as it is. Unless someone shows me a
language more obscure than "Modern Greek (Polytonic) or "Friulian",or the
RTL-ones like Arabic or Hebrew I am not concerned. I'm quite sure they'll
survive any potential shift. In fact I think it can only become better.
Regardless of method.

But that's just my completely unbacked optimism shining through. :)

As always, you're welcome to a second opinion. Anyone?

Regards,
Soren

On Friday, October 26, 2018, sbrothy@gmail.com wrote:

Oh You're talking about normalisation. I got confused by all the German
links I ran into. Eggs on my face. Let me get back to you on this one.

Regards

On Thursday, October 25, 2018, sbrothy@gmail.com wrote:

What I think I forgot to mention, is that I'm trying to replace the
spellchecking and it's a spaghetti-code nightmare.

Regards.

On Thu, Oct 25, 2018 at 11:00 PM sbrothy@gmail.com wrote:

This may be an older list:

https://github.com/elastic/hunspell/tree/master/dicts

Regards.

On Thu, Oct 25, 2018 at 10:51 PM sbrothy@gmail.com wrote:

To be brutally honest, I considered replacing the GUI with WxWidgets
to be portable and all, but only MFC has the "sexiness" expexted by it's
users. Dockable toolbars and tabbed dockable windows etc. You can tell me
all you want that as long as the functionality is the same it won't matter,
but I'm not convinced.

Which makes me kinda curious and worried about the MAC users.

Regards.

On Thu, Oct 25, 2018 at 10:31 PM sbrothy@gmail.com wrote:

I guess this ones's for me. If you mean whether HUNSPELL supports
Korean or similar, the answer is yes. It even supports Hebrew. An RTL
language. I don't know whether Korean is RTL, but replacing SPELL32.DLL
isn't easy. To say the least....

Regards,
Soren

On Thu, Oct 25, 2018 at 9:33 PM Sharon Correll sharon_correll@sil.org wrote:

I apologize if I have overlooked something, but...is there any kind
of
NFC/NFD support in Hunspell currently? If not, it appears that a
spell-checker designed for NFC data will not work if the client app
sends it NFD, and vice versa.

If there is no such support, it might be something that I would
consider adding. It surprises me to think that this is not a
significant need.

Thanks,
Sharon Correll
SIL International

Hunspell-devel mailing list
Hunspell-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/hunspell-devel

--
Søren Bro Thygesen

alternate

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Soren Bro - 2018-10-27
  
  "on my plate", "deduced"?! And I'm the spellchecker?! Heh.
  
  Regards
  
  On Sat, Oct 27, 2018 at 6:18 PM Soren Bro sbrothy@users.sourceforge.net
  wrote:
  
  As I said, you seem to know what you're talking about and, for that reason
  alone I'd like to keep you around. :)
  
  Remember though, that we're not experts in HUNSPELL. We (I?) just started
  experimenting with it very recently. I also have a lot in my plate, as you
  may have deducted.
  
  i have searched for lists and IRC channels to ask questions but came up
  short. I'm guessing that's how you found your way here, to the Hermes mail
  list.
  
  I'm of the opinion that all sourceforge projects should have an IRC channel
  on FREENODE. No such luck with HUNSPELL so far, but we have one. It's on
  the FREENODE IRC server and it's called #hermesmail. It's not in any
  danger of going malthusian anytime soon, but at least I'm usually there.
  If not AFK. I use my phone as hotspot, so when I'm not home I'm not online.
  Historical reasons.
  
  I'm perfectly comfortable with our exchanges being public on the list. It's
  just gmail that messes with my recipients from time to time.
  
  Regards.
  
  This question is really independent of UTF-8. You can represent both NFC
  and NFD in UTF-8 (just like you can represent both in UTF-16). >>
  +
  For instance, according to my experiments, if I define my .dic file to
  include the word "blasé" where the é is represented by one character
  (U+00E9 = >> C3 A9 in UTF-8 - this is NFC), and then try to use it to
  check the data "blasé" with the é represented by two characters (U+0065 &
  U+0301 = 65 CC >> 81 in UTF-8 - this is NFD), the word will be marked as
  misspelled.
  
  At least this is what my tests show. Is there something I could put in
  my Hunspell files to handle this? If not, then a spell-checker will only
  handle >> data in the normalization form that it is specifically defined
  for. This is unfortunate, since applications are ideally supposed to handle
  NFC and NFD >> as if they are equivalent. Defining a spell-checker to
  handle both could be a big pain and make the string list huge.
  
  Maybe the assumption has been that a Hunspell spell-checker is only
  written to handle NFC, but there's nothing to enforce that, and it would
  surprise me if all applications that want to use Hunspell work that way.
  In fact I'm sure they don't.
  
  I see that the MAP mechanism can be used to define closely related
  sequences. But NFC and NFD are supposed to be equivalent, not just similar.
  
  On Fri, Oct 26, 2018 at 9:35 PM Sharon Correll sharon_correll@sil.org
  wrote:
  
  This question is really independent of UTF-8. You can represent both NFC
  and NFD in UTF-8 (just like you can represent both in UTF-16).
  
  For instance, according to my experiments, if I define my .dic file to
  include the word "blasé" where the é is represented by one character
  (U+00E9 = C3 A9 in UTF-8 - this is NFC), and then try to use it to check
  the data "blasé" with the é represented by two characters (U+0065 & U+0301
  = 65 CC 81 in UTF-8 - this is NFD), the word will be marked as misspelled.
  
  At least this is what my tests show. Is there something I could put in my
  Hunspell files to handle this? If not, then a spell-checker will only
  handle data in the normalization form that it is specifically defined for.
  This is unfortunate, since applications are ideally supposed to handle NFC
  and NFD as if they are equivalent. Defining a spell-checker to handle both
  could be a big pain and make the string list huge.
  
  Maybe the assumption has been that a Hunspell spell-checker is only
  written to handle NFC, but there's nothing to enforce that, and it would
  surprise me if all applications that want to use Hunspell work that way. In
  fact I'm sure they don't.
  
  I see that the MAP mechanism can be used to define closely related
  sequences. But NFC and NFD are supposed to be equivalent, not just similar.
  
  On 10/26/2018 1:48 PM, sbrothy@gmail.com wrote:
  
  I'm quite sure diacritics and such are covered by UTF-8. After all, Arabic
  and Hebrew are. You're concerned about the transition though, if I read you
  correctly?
  
  We're sorta committed to UTF-8 as it is. Unless someone shows me a
  language more obscure than "Modern Greek (Polytonic) or "Friulian",or the
  RTL-ones like Arabic or Hebrew I am not concerned. I'm quite sure they'll
  survive any potential shift. In fact I think it can only become better.
  Regardless of method.
  
  But that's just my completely unbacked optimism shining through. :)
  
  As always, you're welcome to a second opinion. Anyone?
  
  Regards,
  Soren
  
  On Friday, October 26, 2018, sbrothy@gmail.com wrote:
  
  Oh You're talking about normalisation. I got confused by all the German
  links I ran into. Eggs on my face. Let me get back to you on this one.
  
  Regards
  
  On Thursday, October 25, 2018, sbrothy@gmail.com wrote:
  
  What I think I forgot to mention, is that I'm trying to replace the
  spellchecking and it's a spaghetti-code nightmare.
  
  Regards.
  
  On Thu, Oct 25, 2018 at 11:00 PM sbrothy@gmail.com wrote:
  
  This may be an older list:
  
  https://github.com/elastic/hunspell/tree/master/dicts
  
  Regards.
  
  On Thu, Oct 25, 2018 at 10:51 PM sbrothy@gmail.com wrote:
  
  To be brutally honest, I considered replacing the GUI with WxWidgets
  to be portable and all, but only MFC has the "sexiness" expexted by it's
  users. Dockable toolbars and tabbed dockable windows etc. You can tell me
  all you want that as long as the functionality is the same it won't matter,
  but I'm not convinced.
  
  Which makes me kinda curious and worried about the MAC users.
  
  Regards.
  
  On Thu, Oct 25, 2018 at 10:31 PM sbrothy@gmail.com wrote:
  
  I guess this ones's for me. If you mean whether HUNSPELL supports
  Korean or similar, the answer is yes. It even supports Hebrew. An RTL
  language. I don't know whether Korean is RTL, but replacing SPELL32.DLL
  isn't easy. To say the least....
  
  Regards,
  Soren
  
  On Thu, Oct 25, 2018 at 9:33 PM Sharon Correll sharon_correll@sil.org
  wrote:
  
  I apologize if I have overlooked something, but...is there any kind
  of
  NFC/NFD support in Hunspell currently? If not, it appears that a
  spell-checker designed for NFC data will not work if the client app
  sends it NFD, and vice versa.
  
  If there is no such support, it might be something that I would
  consider adding. It surprises me to think that this is not a
  significant need.
  
  Thanks,
  Sharon Correll
  SIL International
  
  Hunspell-devel mailing list
  Hunspell-devel@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/hunspell-devel
  
  --
  Søren Bro Thygesen
  
  Re: [Hunspell-devel] NFC/NFD support in Hunspell
  https://sourceforge.net/p/hermesmail/discussion/general/thread/297bf941e4/?limit=25#3052
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/hermesmail/discussion/general/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Re: [Hunspell-devel] NFC/NFD support in Hunspell

The cult-favourite eMail client formerly known as Qualcomm Eudora

Forums

Help

Re: [Hunspell-devel] NFC/NFD support in Hunspell

Regards.

Re: [Hunspell-devel] NFC/NFD support in Hunspell

The cult-favourite eMail client formerly known as Qualcomm Eudora

Forums

Help

Re: [Hunspell-devel] NFC/NFD support in Hunspell document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Regards.

Re: [Hunspell-devel] NFC/NFD support in Hunspell