Re: [Yaml-core] YAML2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

William, I put the content of this email here:
https://github.com/yaml2/YAML2/wiki/Unicode-Strictness

Oren, I started listing some bad parts. I am making a page for each bad
point so that they can each have their own long discussion.

On Fri, Oct 28, 2011 at 3:49 PM, William Spitzak <sp...@rh...> wrote:

> PLEASE!!! This is the main reason we cannot use unaltered YAML:
>
> SUPPORT FOR INVALID UTF-8 AND UTF-16
>
> There needs to be a way to put an arbitrary byte sequence into a scalar
> without losing the ability to make valid byte sequences human-readable.
>
> Currently YAML is limited to only putting byte sequences that are valid
> UTF-8 into scalars, unless some transformation is done that makes some
> (often all) Unicode unreadable in the YAML input. This has the
> counter-productive effect of *discouraging* use of Unicode on any backend
> that uses bytes where there is no guarantee that the backend limits the byte
> sequences to valid UTF-8. Examples are all byte-based file formats, most
> internet protocols, Unix filenames, and Windows resource identifiers.
>
> My recommendation is here. However any solution that allows an arbitrary
> byte stream to be produced, while allowing valid UTF-8 bytes to be
> represented by the correct Unicode character in the YAML source, would be
> acceptable:
>
> 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the
> given value. This is only different from current YAML for 0x80-0xFF. The
> sequence \u00NN must be used for actual Unicode code points in this range.
>
> 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted
> between the UTF-8 encoding of all other characters, as raw data.
>
> 3. An api that requests scalers in some other form, such as UTF-16, gets
> these bytes as unchanged code units. This makes \xNN work identically to
> current YAML/JSON when the UTF-16 api is used. It may also allow invalid
> forms of other encodings to be supported.
>
> In addition invalid UTF-16 must also be supported. Support of invalid
> UTF-16 is more common, due to it's use on Windows and therefore the
> realization by otherwise ignorant programmers of the inability to work
> without supporting them. Technically the YAML spec does not allow invalid
> UTF-16, but my proposal here formalizes the actual support that is in most
> (all?) YAML and JSON implementations:
>
> 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF
> represents a "raw UTF-16 code unit".
>
> 2. An api that requests UTF-16 or other 16-bit code units will get these
> codes unchanged.
>
> 3. An api that requests bytes will get 3 for each of these, these three
> bytes match the encoding you get from UTF-8 if you extend it to these
> invalid code points.
>