Hi Devin,

I added your comments to https://github.com/yaml2/YAML2/wiki/Unicode-Strictness

This topic seems like it would be more poignant if people actually wrote some test cases about what they were talking about. It's so easy to be misunderstood on this type of issue. But with actual test files, it's much less so.

The best way to do this would be to simply create a repo on github and push some files up. Even if it's not completely running code, it would be helpful. Of course, running code in something like pyyaml would be even better.

I really don't want to have YAML2 discussions, without actual tests to make people's points. William, it would be great if you could post some files that elegantly show off your concerns. Otherwise it just feels like conjecture.

Ingy

On Fri, Oct 28, 2011 at 7:54 PM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
> There needs to be a way to put an arbitrary byte sequence into a scalar
> without losing the ability to make valid byte sequences human-readable.

Mixing byte data into what is ostensibly unicode seems like a bad
idea. Either have a particular part of the document be unicode, or
bytes, not both.

That said, if you mix unicode and bytes in the same file, it ceases to
be exactly readable in a standard text editor. So I don't like that
either. Maybe two separate types of YAML file (text /
binary/compressed)? more than one protocol has had a "binary" version
made of it.

Devin

On Fri, Oct 28, 2011 at 6:49 PM, William Spitzak <spitzak@rhythm.com> wrote:
> PLEASE!!! This is the main reason we cannot use unaltered YAML:
>
> SUPPORT FOR INVALID UTF-8 AND UTF-16
>
> There needs to be a way to put an arbitrary byte sequence into a scalar
> without losing the ability to make valid byte sequences human-readable.
>
> Currently YAML is limited to only putting byte sequences that are valid
> UTF-8 into scalars, unless some transformation is done that makes some
> (often all) Unicode unreadable in the YAML input. This has the
> counter-productive effect of *discouraging* use of Unicode on any
> backend that uses bytes where there is no guarantee that the backend
> limits the byte sequences to valid UTF-8. Examples are all byte-based
> file formats, most internet protocols, Unix filenames, and Windows
> resource identifiers.
>
> My recommendation is here. However any solution that allows an arbitrary
> byte stream to be produced, while allowing valid UTF-8 bytes to be
> represented by the correct Unicode character in the YAML source, would
> be acceptable:
>
> 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the
> given value. This is only different from current YAML for 0x80-0xFF. The
> sequence \u00NN must be used for actual Unicode code points in this range.
>
> 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted
> between the UTF-8 encoding of all other characters, as raw data.
>
> 3. An api that requests scalers in some other form, such as UTF-16, gets
> these bytes as unchanged code units. This makes \xNN work identically to
> current YAML/JSON when the UTF-16 api is used. It may also allow invalid
> forms of other encodings to be supported.
>
> In addition invalid UTF-16 must also be supported. Support of invalid
> UTF-16 is more common, due to it's use on Windows and therefore the
> realization by otherwise ignorant programmers of the inability to work
> without supporting them. Technically the YAML spec does not allow
> invalid UTF-16, but my proposal here formalizes the actual support that
> is in most (all?) YAML and JSON implementations:
>
> 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF
> represents a "raw UTF-16 code unit".
>
> 2. An api that requests UTF-16 or other 16-bit code units will get these
> codes unchanged.
>
> 3. An api that requests bytes will get 3 for each of these, these three
> bytes match the encoding you get from UTF-8 if you extend it to these
> invalid code points.
>
> ------------------------------------------------------------------------------
> The demand for IT networking professionals continues to grow, and the
> demand for specialized networking skills is growing even more rapidly.
> Take a complimentary Learning@Cisco Self-Assessment and learn
> about Cisco certifications, training, and career opportunities.
> http://p.sf.net/sfu/cisco-dev2dev
> _______________________________________________
> Yaml-core mailing list
> Yaml-core@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/yaml-core
>

------------------------------------------------------------------------------
Get your Android app more play: Bring it to the BlackBerry PlayBook
in minutes. BlackBerry App World&#153; now supports Android&#153; Apps
for the BlackBerry&reg; PlayBook&#153;. Discover just how easy and simple
it is! http://p.sf.net/sfu/android-dev2dev
_______________________________________________
Yaml-core mailing list
Yaml-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/yaml-core