Menu

#4035 Invisible characters should work better

None
closed
nobody
5
2024-06-11
2022-10-28
No

Often when cutting and pasting Maxima expressions from other programs (gmail, Word, etc.), various invisible characters get introduced and confuse the Maxima parser:

ex: 23;

incorrect syntax: 23 is not an infix operator
ex: 23;

(In the github Markdown editor, these characters show up as a red dot in edit mode)

In this example, there is an invisible zero-width space character (​) before 23, but the error message is mysterious. Other such characters include the zero-width joiner, the zero-width non-joiner, etc. Fortunately, Maxima does treat the non-breaking space (NBSP /  ) as a space.

There are three reasonable possibilities here: * Give an error * Ignore * Treat it as a space

The error option is fail-safe: code won't inadvertently mean what it wasn't meant to mean.
But the ignore and space options are more convenient most of the time, since they'll probably (!) do what was intended.

Complication: What about these characters within a quoted string?

Discussion

  • Robert Dodier

    Robert Dodier - 2022-11-01

    I think such characters should be treated as spaces by the parser. Also they should be preserved in quoted strings.

    I open bug #4039 about a related topic, punctuation characters.

     
  • Robert Dodier

    Robert Dodier - 2022-11-06

    I'm working on a patch to treat Unicode space characters as whitespace. The characters to be handled are:

    NO-BREAK_SPACE
    EN_QUAD
    EM_QUAD
    EN_SPACE
    EM_SPACE
    THREE-PER-EM_SPACE
    FOUR-PER-EM_SPACE
    SIX-PER-EM_SPACE
    FIGURE_SPACE
    PUNCTUATION_SPACE
    THIN_SPACE
    HAIR_SPACE
    ZERO_WIDTH_SPACE
    NARROW_NO-BREAK_SPACE
    MEDIUM_MATHEMATICAL_SPACE
    ZERO_WIDTH_NO-BREAK_SPACE
    

    which I got from a list on the web (https://jkorpela.fi/chars/spaces.html).

    This is going to make handling every non-whitespace character a little bit slower. I haven't investigated, but if it turns out the effect is too much, we can cut down the list. I suspect the effect on the speed of the parser won't be an issue, but I don't know that for sure yet.

     
    • Gunter Königsmann

      If we avoid iterating over the Unicode string from the user for each of these characters separately replacing all of these characters with spaces should be quite fast.

      Dowe already handle TAB characters correctly? And would it make sense to implode consecutive spaces into one when not in a string?

      In wxMaxima I always to exempt strings from changing characters hoping that the next step isn't a parse_string().

       
  • Stavros Macrakis

    Does it not work to add these characters to *whitespace-chars*?

     
  • Robert Dodier

    Robert Dodier - 2022-11-07
    • status: open --> closed
     
  • Robert Dodier

    Robert Dodier - 2022-11-07

    Fixed by commit 682395f, which does indeed just put the Unicode space characters on *WHITESPACE-CHARS* (for Unicode-aware Lisps; no change otherwise) in src/nparse.lisp. Closing this ticket as fixed.

     
  • Robert Dodier

    Robert Dodier - 2024-06-11
    • labels: parser --> parser, unicode
     

Log in to post a comment.