Re: [Yaml-core] YAML2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> There needs to be a way to put an arbitrary byte sequence into a scalar
> without losing the ability to make valid byte sequences human-readable.

Mixing byte data into what is ostensibly unicode seems like a bad
idea. Either have a particular part of the document be unicode, or
bytes, not both.

That said, if you mix unicode and bytes in the same file, it ceases to
be exactly readable in a standard text editor. So I don't like that
either. Maybe two separate types of YAML file (text /
binary/compressed)? more than one protocol has had a "binary" version
made of it.

Devin

On Fri, Oct 28, 2011 at 6:49 PM, William Spitzak <sp...@rh...> wrote:
> PLEASE!!! This is the main reason we cannot use unaltered YAML:
>
> SUPPORT FOR INVALID UTF-8 AND UTF-16
>
> There needs to be a way to put an arbitrary byte sequence into a scalar
> without losing the ability to make valid byte sequences human-readable.
>
> Currently YAML is limited to only putting byte sequences that are valid
> UTF-8 into scalars, unless some transformation is done that makes some
> (often all) Unicode unreadable in the YAML input. This has the
> counter-productive effect of *discouraging* use of Unicode on any
> backend that uses bytes where there is no guarantee that the backend
> limits the byte sequences to valid UTF-8. Examples are all byte-based
> file formats, most internet protocols, Unix filenames, and Windows
> resource identifiers.
>
> My recommendation is here. However any solution that allows an arbitrary
> byte stream to be produced, while allowing valid UTF-8 bytes to be
> represented by the correct Unicode character in the YAML source, would
> be acceptable:
>
> 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the
> given value. This is only different from current YAML for 0x80-0xFF. The
> sequence \u00NN must be used for actual Unicode code points in this range.
>
> 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted
> between the UTF-8 encoding of all other characters, as raw data.
>
> 3. An api that requests scalers in some other form, such as UTF-16, gets
> these bytes as unchanged code units. This makes \xNN work identically to
> current YAML/JSON when the UTF-16 api is used. It may also allow invalid
> forms of other encodings to be supported.
>
> In addition invalid UTF-16 must also be supported. Support of invalid
> UTF-16 is more common, due to it's use on Windows and therefore the
> realization by otherwise ignorant programmers of the inability to work
> without supporting them. Technically the YAML spec does not allow
> invalid UTF-16, but my proposal here formalizes the actual support that
> is in most (all?) YAML and JSON implementations:
>
> 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF
> represents a "raw UTF-16 code unit".
>
> 2. An api that requests UTF-16 or other 16-bit code units will get these
> codes unchanged.
>
> 3. An api that requests bytes will get 3 for each of these, these three
> bytes match the encoding you get from UTF-8 if you extend it to these
> invalid code points.
>
> ------------------------------------------------------------------------------
> The demand for IT networking professionals continues to grow, and the
> demand for specialized networking skills is growing even more rapidly.
> Take a complimentary Learning@Cisco Self-Assessment and learn
> about Cisco certifications, training, and career opportunities.
> http://p.sf.net/sfu/cisco-dev2dev
> _______________________________________________
> Yaml-core mailing list
> Yam...@li...
> https://lists.sourceforge.net/lists/listinfo/yaml-core
>