Thread: [Yaml-core] FW: Re: Invalid UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I think I forgot to include the group in on this one ^^

  _____  

From: Bl...@gm... [mailto:Bl...@gm...] 
Sent: Thursday, September 03, 2009 3:37 PM
To: William Spitzak
Subject: Re: Re: [Yaml-core] Invalid UTF-8

On Sep 3, 2009 2:22pm, William Spitzak <sp...@rh...> wrote:
> I believe you are suggesting that the filenames be written in base64 or
something so that the raw bytes are preserved.

Took me a little while to figure out why you thought that. By "filenames"
you mean the scalar value (filenames being only one example of what might be
contained in that value). And you thought I was suggesting those values be
represented using base64 on the stream.

On Sep 3, 2009 2:22pm, William Spitzak <sp...@rh...> wrote:
> This is not what I want, since if I wanted the file to be unreadable I
would just use a binary dump and not bother with yaml at all!
> 
> 
> 
> Rule # 1: If the string is *VALID* UTF-8 I want the *SAME* Unicode in the
file! Every single suggestion that does not follow this rule is useless and
in fact extremely damaging to attempts to get software to use Unicode!

I understood that you didn't want a binary dump or a base64 encoding when
you are storing a valid Unicode string as a scalar. The full paragraph that
i think you may be looking at read like this:

"Now, if the concern were just how to transmit the invalid data, then this
could all be accomplished using a new data type with a format that supports
the encoding of raw bytes. The data type, would again, have to be something
other than the normal YAML string, but it would still be stored as a mostly
readable scalar in the YAML file. Yaml already supports doing this."

The key phrase is "would still be stored as a mostly readable scalar in the
YAML file". In other words, if it doesn't need to be escaped, it wouldn't
be. Each data type in YAML has its own rules for how scalars of that type
are validated and parsed. Many take a value that is not itself a Unicode
value and encode it. The boolean, integer, float, and binary data types are
all examples of this (it would take me a while to look up the formal name of
each in YAML). Basically, for our discussion, each scalar node has three
components, its type, its value, and its representation. Its representation
is determined by its type. That type doesn't have to be the YAML string or
YAML binary type. The YAML string is defined to be a valid Unicode string
and so it is inappropriate (changing that definition would break existing
applications) and the YAML binary encodes using base64, which you don't
want. That is why I said "new data type" (which is something you can define
independently of the YAML spec anyways).

But that's why I took the time to step through each component of this
process. One of the things I was trying to understand was whether you want
it so that you can have strings that contain unexpected byte sequences or
scalar representations that can have unexpected byte sequences. The first is
concerned with what is returned to your application whereas the other is
concerned with what appears in the YAML stream (or file). And from your
recent comments, it sounds to me like the grievance is more with the
limitations of the string data type (as defined by YAML) than it is with its
scalar representation. You obviously also don't want to use the binary data
type because of the way it is represented, so when I talk about a "new data
type" that is what I'm discussing. Something that looks exactly like a
string except for the unexpected bytes, which are escaped.

Is that what you are looking for?

Thread: [Yaml-core] FW: Re: Invalid UTF-8

yaml-core