From: Nicky J. <ni...@qc...> - 2025-08-14 12:35:55
|
Hello, We've been investigating validating UTF-8 to prevent a few issues we've been encountering lately. I noted that validation of UTF-8 has been provided in 5.0.0 and have been investigating ns_valid_utf8 as well as the updated ns_getform as they would appear to be very helpful to us! I put together a number of test cases of invalid UTF-8 and verified that these were invalid sequences using iconv. I repeated these tests with ns_valid_utf8 and it disagreed with iconv in 6 cases. To demonstrate these cases, I have added comprehensive unit tests and noted my findings in a pull request: https://github.com/naviserver-project/naviserver/pull/14/files Furthermore, I cannot get ns_getform to throw the NS_INVALID_UTF8 error when the form data contains invalid UTF-8. I have tried many different combinations of configurations and arguments. It is my understanding that if there is no fallbackcharset specified, and the config param formfallbackcharset has not been set, then an error should be thrown with invalid UTF-8 being present in the form. I have used the nsd-config.tcl provided in the GitHub repo on a fresh install. The encoding params under ns/parameters have been set up so that only URLCharset is specified. OutputCharset and formfallbackcharset are both commented out. # Encoding settings # # ns_param OutputCharset utf-8 ns_param URLCharset utf-8 # ns_param formfallbackcharset iso8859-1 A simple proc will get the form data and return it to the client after logging the result of ns_valid_utf8: proc /test {} { set set_id [ns_getform] set test [ns_set get $set_id test] ns_log Notice "Valid UTF-8: [ns_valid_utf8 $test]" ns_return 200 text/html "<p>Received: $test</p>" } ns_register_proc POST /test /test ns_register_proc GET /test /test POST and GET requests with invalid UTF8 sequences in the form data are made using curl: curl -X POST "http://127.0.0.1:8080/test" -H "Content-Type: application/x-www-form-urlencoded" --data-binary "$(printf 'test=test\x80test')" --output - <p>Received: testtest</p> curl "http://127.0.0.1:8080/test?test=$(printf 'test\x80test')" --output - <p>Received: testtest</p> No errors have been thrown. In both cases, ns_valid_utf8 has determined that the value for the form variable "test" is invalid. The log statement from NS_ParseRequest appears to acknowledge that there is an invalid sequence in the query string on the GET request too: POST request: [14/Aug/2025:11:56:33][2408.7f1b52ffd6c0][-conn:default:default:4:4-] Notice: Valid UTF-8: 0 GET request: [14/Aug/2025:11:58:03][2408.7f1b52ffd6c0][-driver:http:0-] Warning: Ns_ParseRequest: line <GET /test?test=test\x80test HTTP/1.1> contains 8-bit character data. Future versions might reject it. [14/Aug/2025:11:58:03][2408.7f1b52ffd6c0][-conn:default:default:4:5-] Notice: Valid UTF-8: 0 These tests have repeated with different configurations and arguments. For example, passing the encoding and/or the fallbackcharset to ns_getform: ns_getform utf-8 ns_getform -fallbackcharset "" ns_getform -fallbackcharset "" utf-8 ns_getform -fallbackcharset "utf-8" utf-8 Configuration changes included uncommenting the formfallbackcharset param and setting it to the empty string or to "utf-8". Is this a misunderstanding on my part of how ns_getform should work? If there is any more information that I can provide then please let me know. Kind regards, Nicky -- Qcode Software *Nicky Johnstone | Software Engineer* *Email:* ni...@qc... | *Phone:* 01463 896 487 www.qcode.co.uk |