From: William S. <sp...@rh...> - 2009-09-03 20:03:25
|
Okay it is obvious there is a failure to understand what I need, but that conversely I have been proposing solutions that are incompatible with existing yaml. Here is a new proposal that I believe is entirely compatible, and can be implemented by a libyaml-using program (though I would still use my own library to avoid the overhead of multiple passes through the string): UTF-8 TAG PROPOSAL: The tag "!!utf8" means the string must be interpreted as UTF-8 but that invalid bytes are preserved. Ideally the writer only produces this tag if the string really contains invalid encoding: * Invalid UTF-8 bytes must be written as "%nn" in the text where nn is the hex value. The UTF-8 encodings of U+D800..U+DFFF and > U+10FFFF are considered invalid and are thus written as 3 %nn sequences. * A '%' sign must be written as "%25" if it would otherwise be misread by the parser. * A writer *MAY* write any byte >= 0x80 and '%' as "%nn". * The reader turns a '%' followed by "25" into a '%' sign, and '%' followed by "80".."FF" hex into a raw byte. Otherwise the '%' is literal, this includes '%' followed by hex less than 0x80. UTF-16: * Invalid UTF-16 is also be written with this tag, with each unmatched surrogate half written as "%nn%nn%nn" where nn are the values of the bytes of the "obvious" UTF-8 encoding. (this is lossless, in case you are wondering) * Reading UTF-16 must undo the above, and also convert any %nn sequence that is valid UTF-8 into the UTF-16 equivalent. Other %nn sequences can cause an error, or be changed into something "safe" like the replacement character or 0xDCxx. ADVANTAGES: * The primary advantage of this scheme is that unaware yaml processors will not mangle the % quoting when copying the file. * The valid UTF-8 and ASCII portion of the string is readable and editable in a text editor. This encourages users to convert to and use valid Unicode. Most other proposals have the opposite effect. * Similar to the "binary" base64 proposal. * Matches how yaml tags are written and how bytes are quoted in URLs. * Uses 3 characters rather than the 4 used by "\XNN". DISADVANTAGES: * Adds a second escape character rather than re-using '\'. * Tag is required, just like "binary". This means tag cannot be used for it's original purpose, and the file cannot be converted to JSON. * %-encoded URL's and printf formats with a width after the '%' are mangled. * Without modifying libyaml, requires a slow second pass to examine strings and potentially allocation of another temporary buffer to hold the converted string. FINAL COMMENTS: Like the base64 proposal, this can be entirely done by the program calling yaml. However it helps considerably if everybody can agree on the tag and various nuances of the encoding, so there are not a hundred variations. I do seem to be having difficulty conveying why this is necessary, but you have to believe me that we will NEVER see Unicode used uniformly for byte files unless this is supported. For a simple example, without this it is impossible to make a yaml file that takes a list of files to rename, and use it to rename invalid filenames into proper Unicode. The only way to fix it is for the program to interpret *all* input filenames as ISO-8859-1. Thus you are actively forcing everybody to not use Unicode, despite believing you are doing the contrary! If you don't believe this is a problem, please look at the millions of pieces of sample Python code that force ISO-8859-1 encoding on all input in order to avoid errors, and look at what Python 3.0 did to environement variables. |