[Yaml-core] New invalid UTF-8 proposal using tags and %nn

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Okay it is obvious there is a failure to understand what I need, but 
that conversely I have been proposing solutions that are incompatible 
with existing yaml.

Here is a new proposal that I believe is entirely compatible, and can be 
implemented by a libyaml-using program (though I would still use my own 
library to avoid the overhead of multiple passes through the string):

UTF-8 TAG PROPOSAL:

The tag "!!utf8" means the string must be interpreted as UTF-8 but that 
invalid bytes are preserved. Ideally the writer only produces this tag 
if the string really contains invalid encoding:

* Invalid UTF-8 bytes must be written as "%nn" in the text where nn is 
the hex value. The UTF-8 encodings of U+D800..U+DFFF and > U+10FFFF are 
considered invalid and are thus written as 3 %nn sequences.

* A '%' sign must be written as "%25" if it would otherwise be misread 
by the parser.

* A writer *MAY* write any byte >= 0x80 and '%' as "%nn".

* The reader turns a '%' followed by "25" into a '%' sign, and '%' 
followed by "80".."FF" hex into a raw byte. Otherwise the '%' is 
literal, this includes '%' followed by hex less than 0x80.

UTF-16:

* Invalid UTF-16 is also be written with this tag, with each unmatched 
surrogate half written as "%nn%nn%nn" where nn are the values of the 
bytes of the "obvious" UTF-8 encoding. (this is lossless, in case you 
are wondering)

* Reading UTF-16 must undo the above, and also convert any %nn sequence 
that is valid UTF-8 into the UTF-16 equivalent. Other %nn sequences can 
cause an error, or be changed into something "safe" like the replacement 
character or 0xDCxx.

ADVANTAGES:

* The primary advantage of this scheme is that unaware yaml processors 
will not mangle the % quoting when copying the file.

* The valid UTF-8 and ASCII portion of the string is readable and 
editable in a text editor. This encourages users to convert to and use 
valid Unicode. Most other proposals have the opposite effect.

* Similar to the "binary" base64 proposal.

* Matches how yaml tags are written and how bytes are quoted in URLs.

* Uses 3 characters rather than the 4 used by "\XNN".

DISADVANTAGES:

* Adds a second escape character rather than re-using '\'.

* Tag is required, just like "binary". This means tag cannot be used for 
it's original purpose, and the file cannot be converted to JSON.

* %-encoded URL's and printf formats with a width after the '%' are mangled.

* Without modifying libyaml, requires a slow second pass to examine 
strings and potentially allocation of another temporary buffer to hold 
the converted string.

FINAL COMMENTS:

Like the base64 proposal, this can be entirely done by the program 
calling yaml. However it helps considerably if everybody can agree on 
the tag and various nuances of the encoding, so there are not a hundred 
variations.

I do seem to be having difficulty conveying why this is necessary, but 
you have to believe me that we will NEVER see Unicode used uniformly for 
byte files unless this is supported.

For a simple example, without this it is impossible to make a yaml file 
that takes a list of files to rename, and use it to rename invalid 
filenames into proper Unicode. The only way to fix it is for the program 
to interpret *all* input filenames as ISO-8859-1. Thus you are actively 
forcing everybody to not use Unicode, despite believing you are doing 
the contrary! If you don't believe this is a problem, please look at the 
millions of pieces of sample Python code that force ISO-8859-1 encoding 
on all input in order to avoid errors, and look at what Python 3.0 did 
to environement variables.