CLISP - an ANSI Common Lisp / Bugs / #691 regexp:regexp-split uses wrong offsets when *misc-encoding* is not 1-1.

labels: --> unicode, regexp
Description has changed:

Diff:

--- old
+++ new
@@ -1,21 +1,23 @@
 **regexp:regexp-split** module does not take into account the difference between character and encoded byte sequence when extracting substrings encoded with an encoding that is not 1-1:

-> (defparameter latitude "6° 45' 22.90\"    S")
-> (defparameter longitude "35° 7' 23.60\"  E")
-> 
-> CL-USER> (setf CUSTOM:*MISC-ENCODING* charset:utf-8)
-> #<ENCODING CHARSET:UTF-8 :UNIX>
-> CL-USER> (regexp:regexp-split " " latitude)
-> ("6° " "5'" "22.90\"" "" "" "" "S")
-> CL-USER> (regexp:regexp-split " " longitude)
-> ("35° " "'" "23.60\"" "" "E")
-> CL-USER> 
+```
+(defparameter latitude "6° 45' 22.90\"    S")
+(defparameter longitude "35° 7' 23.60\"  E")
+(setf CUSTOM:*MISC-ENCODING* charset:utf-8)
+=> #<ENCODING CHARSET:UTF-8 :UNIX>
+(regexp:regexp-split " " latitude)
+=> ("6° " "5'" "22.90\"" "" "" "" "S")
+(regexp:regexp-split " " longitude)
+=> ("35° " "'" "23.60\"" "" "E")
+ ```

 If you convert instead the string into a 1-1 encoding, then it will work correctly:
- 
-> CL-USER> (setf CUSTOM:*MISC-ENCODING* charset:iso-8859-1)
-> #<ENCODING CHARSET:ISO-8859-1 :UNIX>
-> CL-USER> (regexp:regexp-split " " latitude)
-> ("6°" "45'" "22.90\"" "" "" "" "S")
-> CL-USER> (regexp:regexp-split " " longitude)
-> ("35°" "7'" "23.60\"" "" "E")
+
+```
+(setf CUSTOM:*MISC-ENCODING* charset:iso-8859-1)
+=> #<ENCODING CHARSET:ISO-8859-1 :UNIX>
+(regexp:regexp-split " " latitude)
+=> ("6°" "45'" "22.90\"" "" "" "" "S")
+(regexp:regexp-split " " longitude)
+=> ("35°" "7'" "23.60\"" "" "E")
+```

assigned_to: Bruno Haible

Sam Steingold - 2017-11-16

Description has changed:

Diff:

--- old +++ new @@ -9,7 +9,7 @@ => ("6° " "5'" "22.90\"" "" "" "" "S") (regexp:regexp-split " " longitude) => ("35° " "'" "23.60\"" "" "E") - ``` +``` If you convert instead the string into a 1-1 encoding, then it will work correctly:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-11-17

Citizen X buys a cheap house near an airport. It's cheap, because nobody wants to live & sleep near an airport. Then he complaints about the noise at night. In some cases, the jurisdiction is such that indeed, the airport is condemned to restrict flights at night.

Here, instead of implementing a regexp in Lisp (e.g. cl-pcre) which would not suffer from that bug, a poorly advised idea was to interface to an external C library, which from day 1 was clear would lead to problems outside the 1:1 area. No surprise (Pascal, no offense), people file the obvious bug. And now, it is expected that the bug be fixed.

The trouble is that working with strings is an illusion. Instead, the library operates with offsets. The foreign library has no idea of UTF-8 nor any other encoding. The first thing that need be done on the way back from the foreign function call would be to compute the Lisp string index based on the foreign byte offset. IIRC, no such function is exported yet from CLISP for use within modules.

The second step -- a corner case -- would be to discuss whether to introduce and use "modified UTF-8"[*] in CLISP with the regexp library, i.e. encode ASCII NUL as C0 80. Then one could hope to match strings containing NUL characters.

[*] https://en.wikipedia.org/wiki/UTF-8#MUTF-8

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pascal J. Bourguignon - 2017-11-17
  
  On 17 Nov 2017, at 13:24, Jörg Höhle hoehle@users.sf.net wrote:
  
  Citizen X buys a cheap house near an airport. It's cheap, because nobody wants to live & sleep near an airport. Then he complaints about the noise at night. In some cases, the jurisdiction is such that indeed, the airport is condemned to restrict flights at night.
  
  Here, instead of implementing a regexp in Lisp (e.g. cl-pcre) which would not suffer from that bug, a poorly advised idea was to interface to an external C library, which from day 1 was clear would lead to problems outside the 1:1 area. No surprise (Pascal, no offense), people file the obvious bug. And now, it is expected that the bug be fixed.
  
  The trouble is that working with strings is an illusion. Instead, the library operates with offsets. The foreign library has no idea of UTF-8 nor any other encoding. The first thing that need be done on the way back from the foreign function call would be to compute the Lisp string index based on the foreign byte offset. IIRC, no such function is exported yet from CLISP for use within modules.
  
  The second step -- a corner case -- would be to discuss whether to introduce and use "modified UTF-8"[*] in CLISP with the regexp library, i.e. encode ASCII NUL as C0 80. Then one could hope to match strings containing NUL characters.
  
  And we didn’t approach the difficulties related to having utf-8 in the regular expression itself (and the corresponding misinterpretation, since operators such as * or + would apply to the last code, not to the last character).
  
  Perhaps this should be the reason why we should not make the proposed correction, but instead, signal an error if the misc-encoding is not 1-1 when using the regex module.
  
  [*] https://en.wikipedia.org/wiki/UTF-8#MUTF-8 https://en.wikipedia.org/wiki/UTF-8#MUTF-8
  [bugs:#691] https://sourceforge.net/p/clisp/bugs/691/ regexp:regexp-split uses wrong offsets when misc-encoding is not 1-1.
  
  Status: open
  Group: lisp error
  Labels: unicode regexp
  Created: Fri Dec 16, 2016 01:23 AM UTC by Pascal J. Bourguignon
  Last Updated: Thu Nov 16, 2017 06:52 PM UTC
  Owner: Bruno Haible
  
  regexp:regexp-split module does not take into account the difference between character and encoded byte sequence when extracting substrings encoded with an encoding that is not 1-1:
  
  (defparameter latitude "6° 45' 22.90\" S")
  (defparameter longitude "35° 7' 23.60\" E")
  (setf CUSTOM:MISC-ENCODING charset:utf-8)
  => #<encoding charset:utf-8="" :unix="">
  (regexp:regexp-split " " latitude)
  => ("6° " "5'" "22.90\"" "" "" "" "S")
  (regexp:regexp-split " " longitude)
  => ("35° " "'" "23.60\"" "" "E")
  If you convert instead the string into a 1-1 encoding, then it will work correctly:</encoding>
  
  (setf CUSTOM:MISC-ENCODING charset:iso-8859-1)
  => #<encoding charset:iso-8859-1="" :unix="">
  (regexp:regexp-split " " latitude)
  => ("6°" "45'" "22.90\"" "" "" "" "S")
  (regexp:regexp-split " " longitude)
  => ("35°" "7'" "23.60\"" "" "E")
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/clisp/bugs/691/ https://sourceforge.net/p/clisp/bugs/691/
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ https://sourceforge.net/auth/subscriptions/</encoding>
  
  --
  Pascal J. Bourguignon
  
  Related
  
  Bugs: ~~#691~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sam Steingold - 2017-11-17
    
    we could create misc_8bit_encoding (similar to foreign_8bit_encoding) and use that in regexp instead of misc_encoding.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Steingold - 2017-11-17

Jorg - are you saying we should drop the regexp module?
can someone check whether the clisp pcre module suffers from the same bug?
thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruno Haible - 2017-11-17

> instead of implementing a regexp in Lisp ... which would not suffer from that bug

All character-by-character processing in clisp is slow. This is due to the fact that clisp does not compile to machine code and does not exploit type declarations while compiling. String processing in clisp is only reasonably fast if all character-by-character processing is done in C.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sam Steingold - 2017-11-17
  
  All character-by-character processing in clisp is slow
  
  yep, that was first reaction too.
  so, our options are:
  
  drop regexp
  
  fix regexp to use wide chars
  
  use pcre (which probably does not suffer from this problem)
  
  ignore the issue because it works on ASCII and that all that matters. ;-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Pascal J. Bourguignon - 2017-11-17
    
    On 17 Nov 2017, at 16:40, Sam Steingold sds@users.sf.net wrote:
    
    All character-by-character processing in clisp is slow
    
    yep, that was first reaction too.
    so, our options are:
    
    drop regexp
    fix regexp to use wide chars
    use pcre (which probably does not suffer from this problem)
    ignore the issue because it works on ASCII and that all that matters. ;-)
    
    or:
    • signal an error only when *misc-encoding* is not 1-1.
    
    Last edit: Sam Steingold 2017-11-17
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-11-17

"wide chars" is no solution if we mean the same thing. IIRC regexp could be compiled with something like 16 bit wide characters. That changes nothing (1:2 instead of 1:1) and would even waste space when given UTF-8.
Using UTF-16 or UCS-2 doesn't appeal as a solution either, because not every unicode fits into 16 bit.

I forgot whether CLISP uses UCS-2 internally for its characters and strings. If yes, compiling regexp.c with a 16bit character type would indeed be a good match.

Otherwise, as I said earlier, what's needed would be an access to function much like mblen() to convert a byte offset output by regexp from UTF-8 into the corresponding Lisp string index. That function must be usable within modules, e.g. it could update the offset pairs rm_so & rm_eo in regexi.c.

Why do you believe pcre to be exempt from this problem? It has exactly the same offset based interface. But Wait!

... Looking at the source, I see that the above mentioned function already exists: cpcre.c uses Encoding_mblen() with a fixed UTF-8 instead of misc_encoding. regexp could do the same (in a build with UNICODE).

Furthermore, I believe pcre can cope with NUL characters, because its interface function accepts a separate length parameter, beside the string pointer, while regexp does not. pcre has the better interface.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sam Steingold - 2017-11-17
  
  does regexp handle multibyte chars correctly?
  iow, even if we do in regexp what we already do in pcre, we still need to make sure that the underlying regexp library DRTR.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Pascal J. Bourguignon - 2017-11-17
    
    On 17 Nov 2017, at 19:52, Sam Steingold sds@users.sf.net wrote:
    
    does regexp handle multibyte chars correctly?
    iow, even if we do in regexp what we already do in pcre, we still need to make sure that the underlying regexp library DRTR.
    
    On Darwin 17.2.0, there is an API for wide characters:
    
    int regwcomp(regex_t *restrict preg, const wchar_t *restrict widepat, int cflags); int regwexec(const regex_t *restrict preg, const wchar_t *restrict widestr, size_t nmatch, regmatch_t pmatch[restrict], int eflags); int regwncomp(regex_t *restrict preg, const wchar_t *restrict widepat, size_t len, int cflags); int regwnexec(const regex_t *restrict preg, const wchar_t *restrict widestr, size_t len, size_t nmatch, regmatch_t pmatch[restrict], int eflags);
    
    but it’s not yet available on Linux apparently.
    
    [bugs:#691] https://sourceforge.net/p/clisp/bugs/691/ regexp:regexp-split uses wrong offsets when misc-encoding is not 1-1.
    
    Status: open
    Group: lisp error
    Labels: unicode regexp
    Created: Fri Dec 16, 2016 01:23 AM UTC by Pascal J. Bourguignon
    Last Updated: Fri Nov 17, 2017 05:01 PM UTC
    Owner: Bruno Haible
    
    regexp:regexp-split module does not take into account the difference between character and encoded byte sequence when extracting substrings encoded with an encoding that is not 1-1:
    
    (defparameter latitude "6° 45' 22.90\" S")
    (defparameter longitude "35° 7' 23.60\" E")
    (setf CUSTOM:MISC-ENCODING charset:utf-8)
    => #<encoding charset:utf-8="" :unix="">
    (regexp:regexp-split " " latitude)
    => ("6° " "5'" "22.90\"" "" "" "" "S")
    (regexp:regexp-split " " longitude)
    => ("35° " "'" "23.60\"" "" "E")
    If you convert instead the string into a 1-1 encoding, then it will work correctly:</encoding>
    
    (setf CUSTOM:MISC-ENCODING charset:iso-8859-1)
    => #<encoding charset:iso-8859-1="" :unix="">
    (regexp:regexp-split " " latitude)
    => ("6°" "45'" "22.90\"" "" "" "" "S")
    (regexp:regexp-split " " longitude)
    => ("35°" "7'" "23.60\"" "" "E")
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/clisp/bugs/691/ https://sourceforge.net/p/clisp/bugs/691/
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ https://sourceforge.net/auth/subscriptions/</encoding>
    
    Related
    
    Bugs: ~~#691~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-11-17

Pascal is right, I'm rusty and forgot about no naïve regexp handling multibyte correctly, e.g. .{n,m} might stop amid multibyte sequences. However, I recall some regexp APIs do have a multibyte flag.

pcre: I'm surprised that cpcre.c doesn't set the flag pcre_utf8 automatically, since it unconditionally uses utf_8 as encoding. Setting that flag would put us back on the safe side.

PCRE:PCRE-NAME-TO-INDEX side note: It's illogical and buggy that pcre_get_stringnumber gets called with misc_encoding, whereas the pattern was compiled with UTF-8. The name is part of a pattern.

pcre: Maybe checking pcre_config(PCRE_CONFIG_UTF8) to select between UTF-8 and either ASCII or some known 1:1 encoding (foreign_encoding) at run-time would prevent an unusable build?

I'm not that much in favour of yet another encoding. These globals don't scale.

gnulib regex: I scanned the documentation and found no mention of multibyte or utf-8.Therefore, there is a problem, and I don't know what to do (deprecate in favour of pcre? wrap to pcre? Use 8bit_encoding so ASCII works? Ignore the problem, i.e. just make sure that returned indices match somewhere, e.g. round up (not good for correctness)?)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bruno Haible - 2017-11-17

The important points are:

People want regular expressions that work, i.e. are correct with non-ASCII character and are not terribly slow.

98% of the users on Unix are in UTF-8 locales nowadays.

The available options are not only (citing Sam)
1. drop regexp
2. fix regexp to use wide chars
3. use pcre (which probably does not suffer from this problem)
4. ignore the issue because it works on ASCII and that all that matters. ;-)
but also
5. convert the indices (from Lisp string indices to byte indices upon entry, in the opposite direction upon return from the regex functions).
6. (citing Pascal) signal an error when *misc-encoding* is not 1-1 and the strings contain non-ASCII characters.

Jörg,
> I believe pcre can cope with NUL characters

This is completely unimportant. NUL characters are also forbidden in file names and XML documents, and no one complains about it.

*> I'm rusty and forgot about no naïve regexp handling multibyte correctly

I scanned the documentation and found no mention of multibyte or utf-8.*

Your data is way out of date. Multibyte support (i.e. not only UTF-8 support, but also support for all encodings used in locales) was added to GNU regex in 2002, by Isamu Hasegawa. It took until 2005 until it was stabilized and reasonably bug-free. Then this implementation was added to gnulib, and this is what we currently use in clisp.

Pascal,

> On Darwin 17.2.0, there is an API for wide characters:

Good point. If there was a good quality (means: POSIX compliant, reasonably bug-free, and fast) regex facility with wchar_t or ucs4_t strings as input or output, we could use that instead instead of the gnulib one.
* Is the license of this Darwin implementation so that we could use it in clisp?
* How does its quality evaluate?

Option 1 is not good, because it means users who have to use cl-ppcre, which is surely slow.

Option 2 is a non-trivial task, because the regex implementation uses many tables that are indexed by an 'unsigned char'. While you can have tables of size 256, things get an unacceptable performance (memory-wise) when these tables are grown to size 65536 or 0x110000.

Option 3 is not a solution.

Options 4,6 would be acceptable if there is no other way out. (It was acceptable in 1995.)

Option 5 is the way to go, IMO. I've written such code for conversion of indices as part of GNU libunistring. It is a linear pass on the inputs and the results, therefore not a performance issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pascal J. Bourguignon - 2017-11-17
  
  On 17 Nov 2017, at 21:26, Bruno Haible haible@users.sf.net wrote:
  
  The important points are:
  
  People want regular expressions that work, i.e. are correct with non-ASCII character and are not terribly slow.
  98% of the users on Unix are in UTF-8 locales nowadays.
  The available options are not only (citing Sam)
  1. drop regexp
  2. fix regexp to use wide chars
  3. use pcre (which probably does not suffer from this problem)
  4. ignore the issue because it works on ASCII and that all that matters. ;-)
  but also
  5. convert the indices (from Lisp string indices to byte indices upon entry, in the opposite direction upon return from the regex functions).
  6. (citing Pascal) signal an error when misc-encoding is not 1-1 and the strings contain non-ASCII characters.
  
  Jörg,
  
  I believe pcre can cope with NUL characters
  
  This is completely unimportant. NUL characters are also forbidden in file names and XML documents, and no one complains about it.
  
  *> I'm rusty and forgot about no naïve regexp handling multibyte correctly
  
  I scanned the documentation and found no mention of multibyte or utf-8.*
  
  Your data is way out of date. Multibyte support (i.e. not only UTF-8 support, but also support for all encodings used in locales) was added to GNU regex in 2002, by Isamu Hasegawa. It took until 2005 until it was stabilized and reasonably bug-free. Then this implementation was added to gnulib, and this is what we currently use in clisp.
  
  Pascal,
  
  On Darwin 17.2.0, there is an API for wide characters:
  
  Good point. If there was a good quality (means: POSIX compliant, reasonably bug-free, and fast) regex facility with wchar_t or ucs4_t strings as input or output, we could use that instead instead of the gnulib one.
  Is the license of this Darwin implementation so that we could use it in clisp?
  How does its quality evaluate?
  
  The header indicates Apple Public Source License and BSD licenses.
  It’s a Darwin extension. It’s in the source package Libc:
  
  https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/ https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/
  https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz https://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gzhttps://opensource.apple.com/tarballs/Libc/Libc-1158.1.2.tar.gz
  
  Option 1 is not good, because it means users who have to use cl-ppcre, which is surely slow.
  
  Option 2 is a non-trivial task, because the regex implementation uses many tables that are indexed by an 'unsigned char'. While you can have tables of size 256, things get an unacceptable performance (memory-wise) when these tables are grown to size 65536 or 0x110000.
  
  Option 3 is not a solution.
  
  Options 4,6 would be acceptable if there is no other way out. (It was acceptable in 1995.)
  
  Option 5 is the way to go, IMO. I've written such code for conversion of indices as part of GNU libunistring. It is a linear pass on the inputs and the results, therefore not a performance issue.
  
  [bugs:#691] https://sourceforge.net/p/clisp/bugs/691/ regexp:regexp-split uses wrong offsets when misc-encoding is not 1-1.
  
  Status: open
  Group: lisp error
  Labels: unicode regexp
  Created: Fri Dec 16, 2016 01:23 AM UTC by Pascal J. Bourguignon
  Last Updated: Fri Nov 17, 2017 07:42 PM UTC
  Owner: Bruno Haible
  
  regexp:regexp-split module does not take into account the difference between character and encoded byte sequence when extracting substrings encoded with an encoding that is not 1-1:
  
  (defparameter latitude "6° 45' 22.90\" S")
  (defparameter longitude "35° 7' 23.60\" E")
  (setf CUSTOM:MISC-ENCODING charset:utf-8)
  => #<encoding charset:utf-8="" :unix="">
  (regexp:regexp-split " " latitude)
  => ("6° " "5'" "22.90\"" "" "" "" "S")
  (regexp:regexp-split " " longitude)
  => ("35° " "'" "23.60\"" "" "E")
  If you convert instead the string into a 1-1 encoding, then it will work correctly:</encoding>
  
  (setf CUSTOM:MISC-ENCODING charset:iso-8859-1)
  => #<encoding charset:iso-8859-1="" :unix="">
  (regexp:regexp-split " " latitude)
  => ("6°" "45'" "22.90\"" "" "" "" "S")
  (regexp:regexp-split " " longitude)
  => ("35°" "7'" "23.60\"" "" "E")
  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/clisp/bugs/691/ https://sourceforge.net/p/clisp/bugs/691/
  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ https://sourceforge.net/auth/subscriptions/</encoding>
  
  Related
  
  Bugs: ~~#691~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Bruno Haible - 2017-11-18
    
    Thanks for the pointer, Pascal.
    > It’s a Darwin extension. It’s in the source package Libc:
    > https://opensource.apple.com/source/Libc/Libc-1158.1.2/regex/TRE/lib/
    
    This package looks promising: It has a test suite of reasonable size, is portable, and is maintained (at https://github.com/laurikari/tre/commits/master ).
    
    If someone wants to create a module for using this library in clisp, this would be very welcome! Call it 'tregexp' or so. We can then compare (features, bugs, and speed) against the existing 'regexp' module.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Jörg Höhle - 2017-12-05
      
      I believe pcre can cope with NUL characters
      
      This is completely unimportant.
      
      YMMV. That TRE library explicitly mentions support for NUL on its title page (section "binary pattern and data support" in README.md).
      
      Last edit: Jörg Höhle 2017-12-05
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-11-29

Here's a patch to compute the Lisp-side string index on the way back, according to the encoding used previously. My feeling is that it's only step 1, the second one being: force UTF-8 as encoding, because that's the only onw that gnulib regexp knows about, and the third would possibly export something like locale_encoding, instead of misc_encoding, because that's actually what gnulib's regex considers (searching for "UTF[-]8" in the locale setting).

Bruno, I don't understand your reference to libunistring at all. Did dyou forget that modules have access to Encoding_mblen? That's all that's needed. Sam, I believe that libunistring stuff is wasted effort.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sam Steingold - 2017-11-29
  
  Jörg, you have write access to the mercurial repository.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-12-04

Please apply this patch (#1). It fixes the initial bug report in all configurations compiled with ENABLE_UNICODE (presumably >95% these days). The code (past and present) doesn't work in a built --without-unicode.

bug-regexp1-misc.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jörg Höhle - 2017-12-04
  
  Please apply this patch (#2). It forces UTF-8 upon the string, because that's what gnulib's regexp recognizes (it has special code for it and handles multibyte UTF-8 correctly).
  For the error message from regeerror(), it still uses O(misc_encoding), because for the time being, one can see it as the user overridable locale_encoding. So we can postpone the debate upon whether to introduce O(locale_encoding).
  The regexp module still does not support --without-unicode
  
  bug-regexp2-utf8.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jörg Höhle - 2017-12-05

Both patches are in the repository. Shall we close this bug report, or wait until a further patch enables compilation in a build --without-unicode? (Note that it didn't compile previously either, so it's arguably a different bug).
Also, perhaps I should have added a testcase, however there are already tests that fail (IIRC ext-clisp.tst) in --without-unicode because of embedded non-ASCII character literals, and #-UNICODE "..." is no help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Steingold - 2017-12-05

status: open --> closed-fixed

assigned_to: Bruno Haible --> Jörg Höhle
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Steingold - 2017-12-05

I added a regexp test case

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

regexp:regexp-split uses wrong offsets when misc-encoding is not 1-1.

Group

Searches

Help

#691 regexp:regexp-split uses wrong offsets when misc-encoding is not 1-1.

Related

Discussion

Related

Related

Related