Tcl / Read-Only Bugs / #4735 Non-BMP characters folded back to BMP

The Tool Command Language implementation

#4735 Non-BMP characters folded back to BMP

Milestone: obsolete: 8.6b1

Status: closed-fixed

Owner: Jan Nijtmans

Labels: 43. Regexp (104)

Priority: 5

Updated: 2013-01-24

Created: 2010-10-11

Creator: Lars Hellström

Private: No

Checking whether a string is a valid XML name is conveniently done using a regexp, although doing this by the book is slightly cumbersome as there are rather many character ranges to include in the character ranges. Still, the other day I set out to do it. One of the character ranges is from U+10000 to U+EFFFF, i.e., it consists entirely of characters outside the Basic Multilingual Plane that Tcl currently is restricted to. Since the regexp engine provides the \U<8 hex digits> syntax for going beyond the BMP, I thought I could throw in this \U00010000-\U000EFFFF range as well for Forward Compatibility. Unfortunately, this turned out to not be the case.

What appears to happen is that an \Uxxxxyyyy is treated as \uyyyy, since

% regexp {[\U00010020]} " "
1
% regexp {\U00010020} " "
1
% regexp {\U00010020} !
0
% regexp {[\U00010020-\U00010022]} !
1
% regexp {[\U00010020-\U00020020]} !
0

What I would expect to happen is that \U sequences for characters outside the BMP are simply treated as something which doesn't match any string, but I understand that this is a bit esoteric and not of high priority. However, at the very least there should be tests (perhaps with "known bug" constraints) that e.g. \U00010020 does not match \u0020. If I understand reg.test correctly, there currently aren't *any* tests of \U where the match should fail.

Discussion

Donal K. Fellows - 2010-10-12

Right now, we're using a 16-bit for characters *and* we don't do surrogate pairs either. Thus what you want can't _currently_ work. :-(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lars Hellström - 2010-10-12

@dkf: Well, what I was expecting was merely that it should Do Nothing Gracefully. As long as the regexp engine only can be executed against 16-bit character codes, it seems obvious that a regular expression such as \U00010020 shouldn't get a match, since a 16-bit integer cannot assume the value 0x10020.

As it is, the regexp engine Silently Does Something Stupid, which is never good but this case is esoteric enough that I can live with it. There should however at least be tests to acknowledge that this is a Known Bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2010-10-12

Surely you realize that "don't match" is harder than "barf at RE compile time" ;-)
IMHO the latter would be better than the current status (better warn the programmer than silently fail to match by design), but I'm not the RE maintainer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2010-10-12

um, no one is the RE maintainer

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lars Hellström - 2010-10-12

@ferrieux: At least inside a bracket expression, it could be just as easy:
1. Parse range as lower..upper (inclusively).
2. if (upper>0xFFFF) {upper = 0xFFFF}
3. If now lower<=upper then encode thus restricted range as usual, otherwise forget it was ever mentioned.

I can see that such a change should probably happen somewhere inside the brackpart function, but it's too cryptic for me to see exactly how. OTOH, I don't see how one would tell it to barf at \U00010000 either.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2010-10-12

Taking so much care of bracket expressions only to project onto the BMP also sounds like Silently Do Something Stupid to me. As a programmer, knowing that Tcl is (currently) unable to do anything meaningful outside the BMP, I'd rather be told immediately, at RE compile time, that \u12345678 is not something we can do anything meaningful with, be it inside a character range or not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Nijtmans - 2013-01-24

assigned_to: pvgoran --> nijtmans

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jan Nijtmans - 2013-01-24

See TIP #388:
The reference implementation just replaces any character in the range \U010000 - \U10ffff with
\ufffd, but as soon as Tcl has support for characters outside the BMP this range is reserved
for exactly that.

For example:
>tclsh86
% regexp {[\U00010020]} " "
0
% regexp {\U00010020} " "
0
% regexp {\U00010020} !
0
% regexp {[\U00010020-\U00010022]} !
0
% regexp {[\U00010020-\U00020020]} !
0
% regexp {[\U00010020]} \ufffd
1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Non-BMP characters folded back to BMP

The Tool Command Language implementation

Group

Searches

Help

#4735 Non-BMP characters folded back to BMP

Discussion