#1345 Wrong utf-8 handling in ncgi::decode

closed-fixed
ncgi (23)
8
2013-01-30
2013-01-24
No

ncgi::decode uses regular expressions to handle utf-8 characters. However, the patterns are wrong. They do not cover all utf-8 sequences and also cover illegal sequences.

#These are wrong:
# regsub -all -- {%([A-Fa-f][A-Fa-f0-9])%([A-Fa-f][A-Fa-f0-9])%([A-Fa-f][A-Fa-f0-9])} \ $str {[encoding convertfrom utf-8 [DecodeHex \1\2\3]]} str
# regsub -all -- {%([A-Fa-f][A-Fa-f0-9])%([A-Fa-f][A-Fa-f0-9])} \ $str {[encoding convertfrom utf-8 [DecodeHex \1\2]]} str

#These are right:
regsub -all -- {%([Ee][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])} \ $str {[encoding convertfrom utf-8 [DecodeHex \1\2\3]]} str
regsub -all -- {%([CDcd][A-Fa-f0-9])%([89ABab][A-Fa-f0-9])} \ $str {[encoding convertfrom utf-8 [DecodeHex \1\2]]} str

Discussion

  • Metal Brewer

    Metal Brewer - 2013-01-24

    Additionally, if there should protection from overlong forms, the second one could say:
    {%([CDcd][A-Fa-f2-9]|[Dd][01])%([89ABab][A-Fa-f0-9])}

     
  • Andreas Kupries

    Andreas Kupries - 2013-01-29
    • assigned_to: welch --> andreas_kupries
    • priority: 5 --> 8
     
  • Andreas Kupries

    Andreas Kupries - 2013-01-29

    Quantifier, do you have test cases which demonstrate which characters the original regexes are not covering, and test cases demonstrating which illegal sequences are accepted ?

    If yes, please attach here.

     
  • Andreas Kupries

    Andreas Kupries - 2013-01-29

    Tried the changes and they pass the existing test suite, so we should not be worse off than we were before. Put them into the repository, on a branch --> "bug-3601995-ncgi". The moment we have test cases it should be ready for merging.

     
  • Metal Brewer

    Metal Brewer - 2013-01-30

    Correct utf-8 characters that are not accepted by original regexps, are any sequences that have continuation byte in range 0x80-0x9F:
    % http::formatQuery ą ;# a with ogonek, \u0105
    %C4%85
    % ncgi::decode %C4%85
    %C4%85
    % http::formatQuery † ;# dagger, \u2020
    %E2%80%A0
    % ncgi::decode %E2%80%A0
    %E2%80%A0
    %http::formatQuery \u2810 ;# a braille pattern
    %E2%A0%90
    %ncgi::decode %E2%A0%90 ;# accepted as two-byte sequence, third byte left hanging
    â %90

    Legal sequence that are erroneously accepted as utf-8 character can be for example two consecutive characters (regexp converts 'ó' and half of 'ę', leaving naked continuation byte):
    % http::formatQuery óę ;# o with acute, e with ogonek, \u00F3\u0119
    %C3%B3%C4%99
    % ncgi::decode %C3%B3%C4%99
    óÄ%99

    Illegal sequence accepted could be three-byte with last byte missing:
    % http::formatQuery Ⱡ
    %E2%B1%A0
    % ncgi::decode %E2%B1 ;# not a two byte sequence, but accepted as one.
    â±

     
  • Andreas Kupries

    Andreas Kupries - 2013-01-30

    Thank you. That should help me in coming up with test cases.

     
  • Andreas Kupries

    Andreas Kupries - 2013-01-30

    Test cases made, and verified that they fail before the change, and pass after.
    Commit [b6632970f6] @ bug-3601995-ncgi branch.
    Commit [66adc26a8b] @ tcllib-1-15-rc (merged to release work).
    bug-3601995-ncgi branch closed.

     
  • Andreas Kupries

    Andreas Kupries - 2013-01-30
    • status: open --> closed-fixed