#4735 Non-BMP characters folded back to BMP

obsolete: 8.6b1
closed-fixed
5
2013-01-24
2010-10-11
No

Checking whether a string is a valid XML name is conveniently done using a regexp, although doing this by the book is slightly cumbersome as there are rather many character ranges to include in the character ranges. Still, the other day I set out to do it. One of the character ranges is from U+10000 to U+EFFFF, i.e., it consists entirely of characters outside the Basic Multilingual Plane that Tcl currently is restricted to. Since the regexp engine provides the \U<8 hex digits> syntax for going beyond the BMP, I thought I could throw in this \U00010000-\U000EFFFF range as well for Forward Compatibility. Unfortunately, this turned out to not be the case.

What appears to happen is that an \Uxxxxyyyy is treated as \uyyyy, since

% regexp {[\U00010020]} " "
1
% regexp {\U00010020} " "
1
% regexp {\U00010020} !
0
% regexp {[\U00010020-\U00010022]} !
1
% regexp {[\U00010020-\U00020020]} !
0

What I would expect to happen is that \U sequences for characters outside the BMP are simply treated as something which doesn't match any string, but I understand that this is a bit esoteric and not of high priority. However, at the very least there should be tests (perhaps with "known bug" constraints) that e.g. \U00010020 does not match \u0020. If I understand reg.test correctly, there currently aren't *any* tests of \U where the match should fail.

Discussion

  • Donal K. Fellows

    Right now, we're using a 16-bit for characters *and* we don't do surrogate pairs either. Thus what you want can't _currently_ work. :-(

     
  • Lars Hellström

    Lars Hellström - 2010-10-12

    @dkf: Well, what I was expecting was merely that it should Do Nothing Gracefully. As long as the regexp engine only can be executed against 16-bit character codes, it seems obvious that a regular expression such as \U00010020 shouldn't get a match, since a 16-bit integer cannot assume the value 0x10020.

    As it is, the regexp engine Silently Does Something Stupid, which is never good but this case is esoteric enough that I can live with it. There should however at least be tests to acknowledge that this is a Known Bug.

     
  • Alexandre Ferrieux

    Surely you realize that "don't match" is harder than "barf at RE compile time" ;-)
    IMHO the latter would be better than the current status (better warn the programmer than silently fail to match by design), but I'm not the RE maintainer.

     
  • Don Porter

    Don Porter - 2010-10-12

    um, no one is the RE maintainer

     
  • Lars Hellström

    Lars Hellström - 2010-10-12

    @ferrieux: At least inside a bracket expression, it could be just as easy:
    1. Parse range as lower..upper (inclusively).
    2. if (upper>0xFFFF) {upper = 0xFFFF}
    3. If now lower<=upper then encode thus restricted range as usual, otherwise forget it was ever mentioned.

    I can see that such a change should probably happen somewhere inside the brackpart function, but it's too cryptic for me to see exactly how. OTOH, I don't see how one would tell it to barf at \U00010000 either.

     
  • Alexandre Ferrieux

    Taking so much care of bracket expressions only to project onto the BMP also sounds like Silently Do Something Stupid to me. As a programmer, knowing that Tcl is (currently) unable to do anything meaningful outside the BMP, I'd rather be told immediately, at RE compile time, that \u12345678 is not something we can do anything meaningful with, be it inside a character range or not.

     
  • Jan Nijtmans

    Jan Nijtmans - 2013-01-24
    • assigned_to: pvgoran --> nijtmans
    • status: open --> closed-fixed
     
  • Jan Nijtmans

    Jan Nijtmans - 2013-01-24

    See TIP #388:
    The reference implementation just replaces any character in the range \U010000 - \U10ffff with
    \ufffd, but as soon as Tcl has support for characters outside the BMP this range is reserved
    for exactly that.

    For example:
    >tclsh86
    % regexp {[\U00010020]} " "
    0
    % regexp {\U00010020} " "
    0
    % regexp {\U00010020} !
    0
    % regexp {[\U00010020-\U00010022]} !
    0
    % regexp {[\U00010020-\U00020020]} !
    0
    % regexp {[\U00010020]} \ufffd
    1