Maxima -- GPL CAS based on DOE-MACSYMA / Bugs / #4035 Invisible characters should work better

#4035 Invisible characters should work better

Milestone: None

Status: closed

Owner: nobody

Labels: parser (13) unicode (5)

Priority: 5

Updated: 2024-06-11

Created: 2022-10-28

Creator: Stavros Macrakis

Private: No

Often when cutting and pasting Maxima expressions from other programs (gmail, Word, etc.), various invisible characters get introduced and confuse the Maxima parser:

ex: 23;

incorrect syntax: 23 is not an infix operator
ex: 23;

(In the github Markdown editor, these characters show up as a red dot in edit mode)

In this example, there is an invisible zero-width space character () before 23, but the error message is mysterious. Other such characters include the zero-width joiner, the zero-width non-joiner, etc. Fortunately, Maxima does treat the non-breaking space (NBSP / ) as a space.

There are three reasonable possibilities here: * Give an error * Ignore * Treat it as a space

The error option is fail-safe: code won't inadvertently mean what it wasn't meant to mean.
But the ignore and space options are more convenient most of the time, since they'll probably (!) do what was intended.

Complication: What about these characters within a quoted string?

Discussion

Robert Dodier - 2022-11-01

I think such characters should be treated as spaces by the parser. Also they should be preserved in quoted strings.

I open bug #4039 about a related topic, punctuation characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Dodier - 2022-11-06

I'm working on a patch to treat Unicode space characters as whitespace. The characters to be handled are:

NO-BREAK_SPACE EN_QUAD EM_QUAD EN_SPACE EM_SPACE THREE-PER-EM_SPACE FOUR-PER-EM_SPACE SIX-PER-EM_SPACE FIGURE_SPACE PUNCTUATION_SPACE THIN_SPACE HAIR_SPACE ZERO_WIDTH_SPACE NARROW_NO-BREAK_SPACE MEDIUM_MATHEMATICAL_SPACE ZERO_WIDTH_NO-BREAK_SPACE

which I got from a list on the web (https://jkorpela.fi/chars/spaces.html).

This is going to make handling every non-whitespace character a little bit slower. I haven't investigated, but if it turns out the effect is too much, we can cut down the list. I suspect the effect on the speed of the parser won't be an issue, but I don't know that for sure yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gunter Königsmann - 2022-11-06
  
  If we avoid iterating over the Unicode string from the user for each of these characters separately replacing all of these characters with spaces should be quite fast.
  
  Dowe already handle TAB characters correctly? And would it make sense to implode consecutive spaces into one when not in a string?
  
  In wxMaxima I always to exempt strings from changing characters hoping that the next step isn't a parse_string().
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stavros Macrakis - 2022-11-06

Does it not work to add these characters to *whitespace-chars*?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Dodier - 2022-11-07

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Dodier - 2022-11-07

Fixed by commit 682395f, which does indeed just put the Unicode space characters on *WHITESPACE-CHARS* (for Unicode-aware Lisps; no change otherwise) in src/nparse.lisp. Closing this ticket as fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Dodier - 2024-06-11

labels: parser --> parser, unicode
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.