regexp:regexp-split module does not take into account the difference between character and encoded byte sequence when extracting substrings encoded with an encoding that is not 1-1:
(defparameter latitude "6° 45' 22.90\" S")
(defparameter longitude "35° 7' 23.60\" E")
(setf CUSTOM:*MISC-ENCODING* charset:utf-8)
=> #<ENCODING CHARSET:UTF-8 :UNIX>
(regexp:regexp-split " " latitude)
=> ("6° " "5'" "22.90\"" "" "" "" "S")
(regexp:regexp-split " " longitude)
=> ("35° " "'" "23.60\"" "" "E")
If you convert instead the string into a 1-1 encoding, then it will work correctly:
(setf CUSTOM:*MISC-ENCODING* charset:iso-8859-1)
=> #<ENCODING CHARSET:ISO-8859-1 :UNIX>
(regexp:regexp-split " " latitude)
=> ("6°" "45'" "22.90\"" "" "" "" "S")
(regexp:regexp-split " " longitude)
=> ("35°" "7'" "23.60\"" "" "E")
Diff:
Diff:
Citizen X buys a cheap house near an airport. It's cheap, because nobody wants to live & sleep near an airport. Then he complaints about the noise at night. In some cases, the jurisdiction is such that indeed, the airport is condemned to restrict flights at night.
Here, instead of implementing a regexp in Lisp (e.g. cl-pcre) which would not suffer from that bug, a poorly advised idea was to interface to an external C library, which from day 1 was clear would lead to problems outside the 1:1 area. No surprise (Pascal, no offense), people file the obvious bug. And now, it is expected that the bug be fixed.
The trouble is that working with strings is an illusion. Instead, the library operates with offsets. The foreign library has no idea of UTF-8 nor any other encoding. The first thing that need be done on the way back from the foreign function call would be to compute the Lisp string index based on the foreign byte offset. IIRC, no such function is exported yet from CLISP for use within modules.
The second step -- a corner case -- would be to discuss whether to introduce and use "modified UTF-8"[*] in CLISP with the regexp library, i.e. encode ASCII NUL as C0 80. Then one could hope to match strings containing NUL characters.
[*] https://en.wikipedia.org/wiki/UTF-8#MUTF-8
And we didn’t approach the difficulties related to having utf-8 in the regular expression itself (and the corresponding misinterpretation, since operators such as * or + would apply to the last code, not to the last character).
Perhaps this should be the reason why we should not make the proposed correction, but instead, signal an error if the misc-encoding is not 1-1 when using the regex module.
--
Pascal J. Bourguignon
Related
Bugs:
#691we could create
misc_8bit_encoding(similar toforeign_8bit_encoding) and use that inregexpinstead ofmisc_encoding.Jorg - are you saying we should drop the regexp module?
can someone check whether the clisp pcre module suffers from the same bug?
thanks
> instead of implementing a regexp in Lisp ... which would not suffer from that bug
All character-by-character processing in clisp is slow. This is due to the fact that clisp does not compile to machine code and does not exploit type declarations while compiling. String processing in clisp is only reasonably fast if all character-by-character processing is done in C.
yep, that was first reaction too.
so, our options are:
regexpregexpto use wide charspcre(which probably does not suffer from this problem)or:
• signal an error only when
*misc-encoding*is not1-1.Last edit: Sam Steingold 2017-11-17
"wide chars" is no solution if we mean the same thing. IIRC regexp could be compiled with something like 16 bit wide characters. That changes nothing (1:2 instead of 1:1) and would even waste space when given UTF-8.
Using UTF-16 or UCS-2 doesn't appeal as a solution either, because not every unicode fits into 16 bit.
I forgot whether CLISP uses UCS-2 internally for its characters and strings. If yes, compiling regexp.c with a 16bit character type would indeed be a good match.
Otherwise, as I said earlier, what's needed would be an access to function much like mblen() to convert a byte offset output by regexp from UTF-8 into the corresponding Lisp string index. That function must be usable within modules, e.g. it could update the offset pairs rm_so & rm_eo in regexi.c.
Why do you believe pcre to be exempt from this problem? It has exactly the same offset based interface. But Wait!
... Looking at the source, I see that the above mentioned function already exists: cpcre.c uses Encoding_mblen() with a fixed UTF-8 instead of misc_encoding. regexp could do the same (in a build with UNICODE).
Furthermore, I believe pcre can cope with NUL characters, because its interface function accepts a separate length parameter, beside the string pointer, while regexp does not. pcre has the better interface.
does regexp handle multibyte chars correctly?
iow, even if we do in
regexpwhat we already do inpcre, we still need to make sure that the underlying regexp library DRTR.but it’s not yet available on Linux apparently.
Related
Bugs:
#691Pascal is right, I'm rusty and forgot about no naïve regexp handling multibyte correctly, e.g. .{n,m} might stop amid multibyte sequences. However, I recall some regexp APIs do have a multibyte flag.
I'm not that much in favour of yet another encoding. These globals don't scale.
gnulib regex: I scanned the documentation and found no mention of multibyte or utf-8.Therefore, there is a problem, and I don't know what to do (deprecate in favour of pcre? wrap to pcre? Use 8bit_encoding so ASCII works? Ignore the problem, i.e. just make sure that returned indices match somewhere, e.g. round up (not good for correctness)?)
The important points are:
The available options are not only (citing Sam)
1. drop regexp
2. fix regexp to use wide chars
3. use pcre (which probably does not suffer from this problem)
4. ignore the issue because it works on ASCII and that all that matters. ;-)
but also
5. convert the indices (from Lisp string indices to byte indices upon entry, in the opposite direction upon return from the regex functions).
6. (citing Pascal) signal an error when
*misc-encoding*is not 1-1 and the strings contain non-ASCII characters.Jörg,
> I believe pcre can cope with NUL characters
This is completely unimportant. NUL characters are also forbidden in file names and XML documents, and no one complains about it.
*> I'm rusty and forgot about no naïve regexp handling multibyte correctly
Your data is way out of date. Multibyte support (i.e. not only UTF-8 support, but also support for all encodings used in locales) was added to GNU regex in 2002, by Isamu Hasegawa. It took until 2005 until it was stabilized and reasonably bug-free. Then this implementation was added to gnulib, and this is what we currently use in clisp.
Pascal,
> On Darwin 17.2.0, there is an API for wide characters:
Good point. If there was a good quality (means: POSIX compliant, reasonably bug-free, and fast) regex facility with wchar_t or ucs4_t strings as input or output, we could use that instead instead of the gnulib one.
* Is the license of this Darwin implementation so that we could use it in clisp?
* How does its quality evaluate?
Option 1 is not good, because it means users who have to use cl-ppcre, which is surely slow.
Option 2 is a non-trivial task, because the regex implementation uses many tables that are indexed by an 'unsigned char'. While you can have tables of size 256, things get an unacceptable performance (memory-wise) when these tables are grown to size 65536 or 0x110000.
Option 3 is not a solution.
Options 4,6 would be acceptable if there is no other way out. (It was acceptable in 1995.)
Option 5 is the way to go, IMO. I've written such code for conversion of indices as part of GNU libunistring. It is a linear pass on the inputs and the results, therefore not a performance issue.
The header indicates Apple Public Source License and BSD licenses.
It’s a Darwin extension. It’s in the source package Libc:
https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/ https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/
https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gzhttps://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz
Related
Bugs:
#691Thanks for the pointer, Pascal.
> It’s a Darwin extension. It’s in the source package Libc:
> https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/
This package looks promising: It has a test suite of reasonable size, is portable, and is maintained (at https://github.com/laurikari/tre/commits/master ).
If someone wants to create a module for using this library in clisp, this would be very welcome! Call it 'tregexp' or so. We can then compare (features, bugs, and speed) against the existing 'regexp' module.
YMMV. That TRE library explicitly mentions support for NUL on its title page (section "binary pattern and data support" in README.md).
Last edit: Jörg Höhle 2017-12-05
Here's a patch to compute the Lisp-side string index on the way back, according to the encoding used previously. My feeling is that it's only step 1, the second one being: force UTF-8 as encoding, because that's the only onw that gnulib regexp knows about, and the third would possibly export something like locale_encoding, instead of misc_encoding, because that's actually what gnulib's regex considers (searching for "UTF[-]8" in the locale setting).
Bruno, I don't understand your reference to libunistring at all. Did dyou forget that modules have access to Encoding_mblen? That's all that's needed. Sam, I believe that libunistring stuff is wasted effort.
Jörg, you have write access to the mercurial repository.
Please apply this patch (#1). It fixes the initial bug report in all configurations compiled with ENABLE_UNICODE (presumably >95% these days). The code (past and present) doesn't work in a built --without-unicode.
Please apply this patch (#2). It forces UTF-8 upon the string, because that's what gnulib's regexp recognizes (it has special code for it and handles multibyte UTF-8 correctly).
For the error message from regeerror(), it still uses O(misc_encoding), because for the time being, one can see it as the user overridable locale_encoding. So we can postpone the debate upon whether to introduce O(locale_encoding).
The regexp module still does not support --without-unicode
Both patches are in the repository. Shall we close this bug report, or wait until a further patch enables compilation in a build --without-unicode? (Note that it didn't compile previously either, so it's arguably a different bug).
Also, perhaps I should have added a testcase, however there are already tests that fail (IIRC ext-clisp.tst) in --without-unicode because of embedded non-ASCII character literals, and #-UNICODE "..." is no help.
I added a regexp test case