From: Gustaf N. <ne...@wu...> - 2022-05-12 11:27:03
|
Dear David, NaviServer is less strict than the W3C-document, since it does not send automatically an error back. Such invalid characters can show up during decode operations of ns_urldecode and ns_getform. So, a custom application can catch exceptions and try alternative encodings if necessary. Since there is currently a large refactoring concerning Unicode handling going on for the Tcl community (with potentially different handling in Tcl 8.6, 8.7 and 9.0, ... hopefully there will be full support for Unicode already in Tcl 8.7, the voting is happening right now) it is not a good idea to come up with a special handling by NaviServer. These byte sequences have to be processed sooner or later by Tcl in various versions... I do not think it is a good idea to swallow incorrect incoming data by transforming this on the fly, this will cause sooner or later user concerns (e.g. "why is this funny character in the user name", ...) When the legacy application sends e.g. iso8859 encoded data, then it should set the appropriate charset, and it will be properly converted by NaviServer. If for whatever reason this is not feasible to get a proper charset, then the NaviServer approach allows to make a second attempt of decoding the data with a different charset. all the best -gn On 12.05.22 11:05, David Osborne wrote: > > Thanks again Gustaf, > > I can see the W3C spec you reference seems quite unequivocal in saying > an error message should be sent back when decoding invalid UTF-8 form > data. > > But I was curious why other implementations appear to use the UTF-8 > replacement character (U+FFFD) instead, and found a bit of discussion > in the unicode standard itself [1] & [2]. > > [1] specifically refers to the WHATWG(W3C) spec for encoding/decoding > [3] which defines an "error" condition when decoding UTF-8 as being > one of two possible error modes: > Namely: > > * fatal - "return the error" > * replacement - "Push U+FFFD (�) to output." > > This aligns with the behaviour of, say, Python's bytes.decode() where > the default is to raise an error for encoding errors ("strict" error > handling), but optionally, you can specify "replace" error handling > which will utilise the U+FFFD character instead. I can see this > working in cases where we're told the data should be UTF-8, or where > we're assuming by default it's UTF-8. > > But I'm not sure how much work this would be to implement and whether > it is seen as worthwhile to others? > > As it stands, we have legacy applications which POSTs data to us which > regularly (and, by now, expectedly) sends invalid characters despite > best efforts to fix it. > I guess we would redirect the POSTs to another non-naviserver system, > sanitise the data there, then send it on to NaviServer, but it would > be nice to be able to deal with it within NaviServer itself. > > [1] https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Section > 3.9 "U+FFFD Substitution of Maximal Subparts") > [2] https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf (Section > 5.22 "U+FFFD Substitution in Conversion") > [3] https://encoding.spec.whatwg.org/#decoder > [4] https://docs.python.org/3/library/stdtypes.html#bytes.decode > > > On Mon, 2 May 2022 at 13:30, Gustaf Neumann <ne...@wu...> wrote: > > Dear David and all, > > I looked into this issue, and I do not like the current situation > either. > In the current snapshot, a GET request with invalid coded > query variables is rejected, while the POST request leads just > to the warning, and the invalid entry is omitted. > > W3C [1] says in the reference for Multilingual form encoding: > > If non-UTF-8 data is received, an error message should be sent back. > > This means, that the only defensible logic is to reject in both cases > the request as invalid. One can certainly send single-byte funny > character > data in URLs, which is invalid UTF8 (e.g. "%9C" or "%E6" etc.), > but for these requests, the charset has to be specified, either > via content type, or via the default URL encoding in the NaviServer > configuration... see example (2) below. > > As mentioned earlier, there are increasingly many attacks with invalid > UTF-8 data (also by vulnerability scanners), so we to be strict here. > > I will try to address the outstanding issues ASAP and provide then > another RC. > > All the best > > -gn > > [1] https://www.w3.org/International/questions/qa-forms-utf-8 > > > # POST request with already encoded form data (x-www-form-urlencoded) > $ curl -X POST -d "p1=a%C5%93Cb&p2=a%E6b" localhost:8100/upload.tcl > > # POST request with already encoded form data, but proper encoding > $ curl -X POST -H "Content-Type: application/x-www-form-urlencoded; charset=iso-8859-1" -d "p2=a%E6b" localhost:8100/upload.tcl > > # POST + x-www-form-urlencoded, but let curl do the encoding > $ curl -X POST -d "p1=aœb" -d $(echo -e 'p2=a\xE6b') localhost:8100/upload.tcl > > # POST + multipart/form-data, let curl do the encoding > $ curl -X POST -F "p1=aœb" -F $(echo -e 'p2=a\xE6b') localhost:8100/upload.tcl > > POST request with already encoded form data (x-www-form-urlencoded) > $ curl -X GET "localhost:8100/upload.tcl?p1=a%C5%93Cb&p2=a%E6b" > > > > > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel -- Univ.Prof. Dr. Gustaf Neumann Head of the Institute of Information Systems and New Media of Vienna University of Economics and Business Program Director of MSc "Information Systems" |