Re: [Yaml-core] YAML2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

PLEASE!!! This is the main reason we cannot use unaltered YAML:

SUPPORT FOR INVALID UTF-8 AND UTF-16

There needs to be a way to put an arbitrary byte sequence into a scalar 
without losing the ability to make valid byte sequences human-readable.

Currently YAML is limited to only putting byte sequences that are valid 
UTF-8 into scalars, unless some transformation is done that makes some 
(often all) Unicode unreadable in the YAML input. This has the 
counter-productive effect of *discouraging* use of Unicode on any 
backend that uses bytes where there is no guarantee that the backend 
limits the byte sequences to valid UTF-8. Examples are all byte-based 
file formats, most internet protocols, Unix filenames, and Windows 
resource identifiers.

My recommendation is here. However any solution that allows an arbitrary 
byte stream to be produced, while allowing valid UTF-8 bytes to be 
represented by the correct Unicode character in the YAML source, would 
be acceptable:

1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the 
given value. This is only different from current YAML for 0x80-0xFF. The 
sequence \u00NN must be used for actual Unicode code points in this range.

2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted 
between the UTF-8 encoding of all other characters, as raw data.

3. An api that requests scalers in some other form, such as UTF-16, gets 
these bytes as unchanged code units. This makes \xNN work identically to 
current YAML/JSON when the UTF-16 api is used. It may also allow invalid 
forms of other encodings to be supported.

In addition invalid UTF-16 must also be supported. Support of invalid 
UTF-16 is more common, due to it's use on Windows and therefore the 
realization by otherwise ignorant programmers of the inability to work 
without supporting them. Technically the YAML spec does not allow 
invalid UTF-16, but my proposal here formalizes the actual support that 
is in most (all?) YAML and JSON implementations:

1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF 
represents a "raw UTF-16 code unit".

2. An api that requests UTF-16 or other 16-bit code units will get these 
codes unchanged.

3. An api that requests bytes will get 3 for each of these, these three 
bytes match the encoding you get from UTF-8 if you extend it to these 
invalid code points.