#92 Wrong word boundaries

0.18
closed-fixed
spellcheck (10)
5
2013-10-13
2012-12-28
Anonymous
No

The characters like „ are parsed together with a word and are considered an error.

Related

Bugs: #92

Discussion

  • Dimitar Zhekov

    Dimitar Zhekov - 2012-12-29

    The unfortunate truth is that GNU regex, Scintilla regex, SciTE/Geany Find word and
    probably Scintilla/Geany lexer do not distinguish between the letters, numbers and
    punctuation characters >= 0x100 - they are all of the "same" character class. So,
    „“” are a general problem. Since the spell checking in Geany is always done by
    external program, you'd better try different spell checkers. I don't think the plugin
    author can fix it.

     
  • Lex Trotman

    Lex Trotman - 2013-07-24

    Use of the Unicode data in Glib would identify these characters as punctuation not letters.

    Regex search \pL does not classify them as letters either, so defining a "word" via regex would be viable. It could even be configurable so various meanings of "word" could be accommodated.

     
  • Dimitar Zhekov

    Dimitar Zhekov - 2013-07-24

    "Use of the Unicode data in Glib would identify these characters as punctuation not letters."

    That may be useful if the plugin (a) passes a list of words to the actual spell checker instead of plain text, and (b) uses it's own definition of "word", different from Geany/Scintilla. The latter is certainly implementable.

    "Regex search \pL does not classify them as letters either, so defining a "word" via regex would be viable."

    Last time I checked, Geany/Scintilla classified anything >= 0x100 as "nothing", and GNU Regex did not support UTF-8. GRegex should work, while SciTE has the worst i18n support, at least under Windows.

     
  • Lex Trotman

    Lex Trotman - 2013-07-25

    On 25 July 2013 04:27, Dimitar Zhekov sheckley@users.sf.net wrote:

    "Use of the Unicode data in Glib would identify these characters as
    punctuation not letters."

    That may be useful if the plugin (a) passes a list of words to the actual
    spell checker instead of plain text, and (b) uses it's own definition of
    "word", different from Geany/Scintilla. The latter is certainly
    implementable.

    It does. See sc_speller_process_line():140, it uses c library ispunct()
    and isspace().

    "Regex search \pL does not classify them as letters either, so defining a
    "word" via regex would be viable."

    Last time I checked, Geany/Scintilla classified anything >= 0x100 as
    "nothing", and GNU Regex did not support UTF-8. GRegex should work, while
    SciTE has the worst i18n support, at least under Windows.

    That was the result of trying find with "use regular expression" in Geany,
    we have used GRegex for a while now.


    Status: open
    Labels: spellcheck
    Created: Fri Dec 28, 2012 09:12 PM UTC by Anonymous
    Last Updated: Wed Jul 24, 2013 04:25 AM UTC
    Owner: Enrico Tröger

    The characters like „ are parsed together with a word and are considered
    an error.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/geany-plugins/bugs/92/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #92

  • Dimitar Zhekov

    Dimitar Zhekov - 2013-07-25

    "Last time I checked, Geany/Scintilla classified anything >= 0x100 as
    "nothing", and GNU Regex did not support UTF-8. GRegex should work, while SciTE has the worst i18n support, at least under Windows."

    "That was the result of trying find with "use regular expression" in Geany, we have used GRegex for a while now."

    We do, and GRegex search is okay, but plain search with "match whole word" or "match from start of word" still suffers from the above problem. Some time ago, I discussed that in the ML with one of our leading devs, and he assured me that "word" in Geany is not really a word, but something more similar to a programming language identifier, or God knows what. :)

     
  • Lex Trotman

    Lex Trotman - 2013-07-25

    On 26 July 2013 02:16, Dimitar Zhekov sheckley@users.sf.net wrote:

    "Last time I checked, Geany/Scintilla classified anything >= 0x100 as
    "nothing", and GNU Regex did not support UTF-8. GRegex should work, while
    SciTE has the worst i18n support, at least under Windows."

    "That was the result of trying find with "use regular expression" in
    Geany, we have used GRegex for a while now."

    We do, and GRegex search is okay, but plain search with "match whole word"
    or "match from start of word" still suffers from the above problem. Some
    time ago, I discussed that in the ML with one of our leading devs, and he
    assured me that "word" in Geany is not really a word, but something more
    similar to a programming language identifier, or God knows what. :)

    A "word" in that context is a contiguous sequence of the characters in the
    wordchars setting in filetypes files, which defaults to C identifiers.
    Other "word" uses in Geany have filetype specific characters sets (usually
    C plus a few) hard coded into Geany :( (eg autocomplete boundaries).

    None of these filetype specific "word" definitions contain Unicode chars
    and so are irrelevant to spell checking of human languages, so spell check
    uses its own definition, but at the moment not a one that works well for
    Unicode. Reading code points (not just octets) and replacing the C library
    calls with "g_unichar_isalpha() || g_unichar_ismark()" would go a long way
    to fixing it.


    Status: open
    Labels: spellcheck
    Created: Fri Dec 28, 2012 09:12 PM UTC by Anonymous
    Last Updated: Wed Jul 24, 2013 06:27 PM UTC
    Owner: Enrico Tröger

    The characters like „ are parsed together with a word and are considered
    an error.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/geany-plugins/bugs/92/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #92

  • Enrico Tröger

    Enrico Tröger - 2013-10-13

    Regarding the issue related to SpellCheck: this seems like the same as in #98 which is fixed in GIT master.

     
  • Enrico Tröger

    Enrico Tröger - 2013-10-13
    • status: open --> closed-fixed
    • Group: --> 0.18
     

Log in to post a comment.