From: BlueGM <bl...@gm...> - 2009-09-03 22:16:28
|
I think I forgot to include the group in on this one ^^ _____ From: Bl...@gm... [mailto:Bl...@gm...] Sent: Thursday, September 03, 2009 3:37 PM To: William Spitzak Subject: Re: Re: [Yaml-core] Invalid UTF-8 On Sep 3, 2009 2:22pm, William Spitzak <sp...@rh...> wrote: > I believe you are suggesting that the filenames be written in base64 or something so that the raw bytes are preserved. Took me a little while to figure out why you thought that. By "filenames" you mean the scalar value (filenames being only one example of what might be contained in that value). And you thought I was suggesting those values be represented using base64 on the stream. On Sep 3, 2009 2:22pm, William Spitzak <sp...@rh...> wrote: > This is not what I want, since if I wanted the file to be unreadable I would just use a binary dump and not bother with yaml at all! > > > > Rule # 1: If the string is *VALID* UTF-8 I want the *SAME* Unicode in the file! Every single suggestion that does not follow this rule is useless and in fact extremely damaging to attempts to get software to use Unicode! I understood that you didn't want a binary dump or a base64 encoding when you are storing a valid Unicode string as a scalar. The full paragraph that i think you may be looking at read like this: "Now, if the concern were just how to transmit the invalid data, then this could all be accomplished using a new data type with a format that supports the encoding of raw bytes. The data type, would again, have to be something other than the normal YAML string, but it would still be stored as a mostly readable scalar in the YAML file. Yaml already supports doing this." The key phrase is "would still be stored as a mostly readable scalar in the YAML file". In other words, if it doesn't need to be escaped, it wouldn't be. Each data type in YAML has its own rules for how scalars of that type are validated and parsed. Many take a value that is not itself a Unicode value and encode it. The boolean, integer, float, and binary data types are all examples of this (it would take me a while to look up the formal name of each in YAML). Basically, for our discussion, each scalar node has three components, its type, its value, and its representation. Its representation is determined by its type. That type doesn't have to be the YAML string or YAML binary type. The YAML string is defined to be a valid Unicode string and so it is inappropriate (changing that definition would break existing applications) and the YAML binary encodes using base64, which you don't want. That is why I said "new data type" (which is something you can define independently of the YAML spec anyways). But that's why I took the time to step through each component of this process. One of the things I was trying to understand was whether you want it so that you can have strings that contain unexpected byte sequences or scalar representations that can have unexpected byte sequences. The first is concerned with what is returned to your application whereas the other is concerned with what appears in the YAML stream (or file). And from your recent comments, it sounds to me like the grievance is more with the limitations of the string data type (as defined by YAML) than it is with its scalar representation. You obviously also don't want to use the binary data type because of the way it is represented, so when I talk about a "new data type" that is what I'm discussing. Something that looks exactly like a string except for the unexpected bytes, which are escaped. Is that what you are looking for? |
From: William S. <sp...@rh...> - 2009-09-03 22:58:27
|
I wrote a proposal that I think will be more to the group's liking, what I am proposing now is a new tag, similar to "binary", which I called "utf8". I'm afraid my proposal is rather long-winded, anybody who wants to shorten it or clarify it, go ahead! Basically invalid UTF-8 is written like this, in this example both a correct UTF-8 Aacute and a 1-byte error are in the string: - !!utf8 "Aacute = Á, UTF-8 error = %80" Valid UTF-8 multi-byte characters can also be written with %nn, this may be useful for getting UTF-8 out of an editor that insists on producing some other encoding and thus ASCII letters are the only ones that work. For a valid string there are many equivalent ways of writing it, though the last here is preferred: - !!utf8 "Aacute = %C3%81" - !!utf8 "Aacute = Á" - !!utf8 "Aacute = \xC1" - "Aacute = \xC1" - "Aacute = Á" The main goal is to allow lossless storage of arbitrary bytes streams but not discourage use of UTF-8 in these streams. User should be able to read any valid UTF-8 and insert valid UTF-8 using a Unicode-aware text editor. Not allowing this causes users to treat the source as being in some other encoding, such as ASCII only, and prevents them from ever switching to UTF-8. I changed our software to use this % encoding, although I am currently using the fact that the text is double-quoted rather than the tag to indicate if this is needed. Need some agreement on questionable aspects of my design before I continue: 1. The exact name of the tag. I chose "utf8" because there is no guarantee the string is invalid. 2. My idea that only %25 and %80-%FF are interpreted, %20 for instance is not a space but instead '%','2','0'. This is to make it less-mangling of %-escaped urls. 3. Any case requirements on the hex letters (I made it accept both, just like url encoding and the \x in yaml). 4. exactly how to escape a '%', though I used %25 just like url encoding. |
From: BlueGM <bl...@gm...> - 2009-09-03 23:15:18
|
You read my mind while I was writing my last message. I wondered why you didn't encode bytes below 0x80. I'm unsure in my opinion if it is more useful to avoid mangling urls or more useful to enable those bytes in non-quoted scalars, however. If the url were by itself, it could still be set to the normal YAML string type and so avoid the mangling, but if it was part of a larger block, it could be an issue. Yes, I think you're right because that use case would be more common. -----Original Message----- From: William Spitzak [mailto:sp...@rh...] Sent: Thursday, September 03, 2009 6:58 PM Cc: yam...@li... Subject: Re: [Yaml-core] FW: Re: Invalid UTF-8 I wrote a proposal that I think will be more to the group's liking, what I am proposing now is a new tag, similar to "binary", which I called "utf8". I'm afraid my proposal is rather long-winded, anybody who wants to shorten it or clarify it, go ahead! Basically invalid UTF-8 is written like this, in this example both a correct UTF-8 Aacute and a 1-byte error are in the string: - !!utf8 "Aacute = Á, UTF-8 error = %80" Valid UTF-8 multi-byte characters can also be written with %nn, this may be useful for getting UTF-8 out of an editor that insists on producing some other encoding and thus ASCII letters are the only ones that work. For a valid string there are many equivalent ways of writing it, though the last here is preferred: - !!utf8 "Aacute = %C3%81" - !!utf8 "Aacute = Á" - !!utf8 "Aacute = \xC1" - "Aacute = \xC1" - "Aacute = Á" The main goal is to allow lossless storage of arbitrary bytes streams but not discourage use of UTF-8 in these streams. User should be able to read any valid UTF-8 and insert valid UTF-8 using a Unicode-aware text editor. Not allowing this causes users to treat the source as being in some other encoding, such as ASCII only, and prevents them from ever switching to UTF-8. I changed our software to use this % encoding, although I am currently using the fact that the text is double-quoted rather than the tag to indicate if this is needed. Need some agreement on questionable aspects of my design before I continue: 1. The exact name of the tag. I chose "utf8" because there is no guarantee the string is invalid. 2. My idea that only %25 and %80-%FF are interpreted, %20 for instance is not a space but instead '%','2','0'. This is to make it less-mangling of %-escaped urls. 3. Any case requirements on the hex letters (I made it accept both, just like url encoding and the \x in yaml). 4. exactly how to escape a '%', though I used %25 just like url encoding. ---------------------------------------------------------------------------- -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Yaml-core mailing list Yam...@li... https://lists.sourceforge.net/lists/listinfo/yaml-core |
From: Josh b. J. <jbe...@wh...> - 2009-09-04 00:29:35
|
On 9/3/09 3:58 PM, "William Spitzak" <sp...@rh...> wrote: > I wrote a proposal that I think will be more to the group's liking, what > I am proposing now is a new tag, similar to "binary", which I called > "utf8". I'm afraid my proposal is rather long-winded, anybody who wants > to shorten it or clarify it, go ahead! Consider using a type name like utf-8-lenient to communicate that you have a UTF-8 derived encoding instead of a name that can be confused with the normative spec. I quote from Perl's own Encode which describes the Perlish difference between UTF-8 and utf8. I consider this perly name a mistake in that it doesn't explain itself to the reader unless they happen to have read this section of the Encode manual. UTF-8 vs. utf8 ....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed. That has been the perl's notion of UTF-8 but official UTF-8 is more strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al). Now that is overruled by Larry Wall himself. From: Larry Wall <la...@wa...> Date: December 04, 2004 11:51:58 JST To: per...@pe... Subject: Re: Make Encode.pm support the real UTF-8 Message-Id: <200...@wa...> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but "UTF-8" is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry Do you copy? As of Perl 5.8.7, B<UTF-8> means strict, official UTF-8 while B<utf8> means liberal, lax, version thereof. And Encode version 2.10 or later thus groks the difference between C<UTF-8> and C"utf8". encode("utf8", "\x{FFFF_FFFF}", 1); # okay encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks C<UTF-8> in Encode is actually a canonical name for C<utf-8-strict>. Yes, the hyphen between "UTF" and "8" is important. Without it Encode goes "liberal" find_encoding("UTF-8")->name # is 'utf-8-strict' find_encoding("utf-8")->name # ditto. names are case insensitive find_encoding("utf8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'. Josh |
From: Oren Ben-K. <or...@be...> - 2009-09-04 09:53:11
|
On Thu, 2009-09-03 at 17:19 -0700, Josh ben Jore wrote: > Do you copy? As of Perl 5.8.7, B<UTF-8> means strict, official UTF-8 > while B<utf8> means liberal, lax, version thereof. And Encode version > 2.10 or later thus groks the difference between C<UTF-8> and C"utf8". > > encode("utf8", "\x{FFFF_FFFF}", 1); # okay > encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks Just like "only *p*erl can parse *P*erl" and so on. Sigh. This is "true, but mostly irrelevant". The proposed tag is not compatible with Perl's "universal" utf8 encoding, because Perl's utf8 encoding does not process %nn escape sequences. This is why I think it is best to call this tag !!utf-u. The 'u' could be taken to be either 'Universal' or 'Url-encoded' or both. This is exactly what it is: a (universal) UTF encoding combined with URL encoding. As for %DEFAULT...SCALAR directives (suggested by BlueGM): Adding a new standard tag is easy and has no effect on the spec. We just add it to the tag repository and we are done. Adding a directive, on the other hand, is a *huge* deal, and brings us to YAML 1.3 territory. So this is not on the table for a "long while" at this point. In addition I think this directive a serious overkill. It is a blunt instrument and I can see many problems with it. A much better approach would be to work on a schema language for YAML that would allow one to specify how tags are associated with nodes at a much more controlled and fine grained manner. As far as '%20' goes: I strongly believe that '%20' should be a space and not '%' '2' '0' inside such !!utf-u. The only argument against it was mangling of URLs, and it fails on two accounts. First, there's simply no reason to ever use this tag for encoding URLs. All the URLs (actually all the URIs) in the world can be easily processed as normal YAML strings. They have their own (%nn) built-in escape mechanism. { path: "file://foo%ff" } _already_ works, without having to annotate it with !!utf-u. So why would you ever want to? Second, even if you decided to pass a URL inside a !!utf-u tag (for some strange reason), the fact that %20 would be preserved but %80 would be mangled to %2580 is _extremely_ confusing. Requiring % to always be escaped as %25 is a no-brainer; it is simple, consistent and follows the rule of least surprise. Have fun, Oren Ben-Kiki |
From: William S. <sp...@rh...> - 2009-09-04 17:05:29
|
Oren Ben-Kiki wrote: > As for %DEFAULT...SCALAR directives (suggested by BlueGM): I think a huge problem with the proposal is that it makes YAML "modal". If you cut a piece of text out and insert it elsewhere in the file, it might not mean the same thing (yet still parse) because you put it before/after the directive. I'm thinking that allowing a *set* of tags might work, though I am not sure if this will parse. For instance: - !!MyClass "ascii representation" - !!MyClass !!binary "adf89asdf00asdfasdf..." This has some appeal because YAML already allows "zero or one" tag in the file. So there already is a variable sized array. > As far as '%20' goes: I strongly believe that '%20' should be a space > and not '%' '2' '0' inside such !!utf-u. The only argument against it > was mangling of URLs, and it fails on two accounts. I agree with this. Your explanation is good: > First, there's simply no reason to ever use this tag for encoding URLs. > All the URLs (actually all the URIs) in the world can be easily > processed as normal YAML strings. They have their own (%nn) built-in > escape mechanism. { path: "file://foo%ff" } _already_ works, without > having to annotate it with !!utf-u. So why would you ever want to? > > Second, even if you decided to pass a URL inside a !!utf-u tag (for some > strange reason), the fact that %20 would be preserved but %80 would be > mangled to %2580 is _extremely_ confusing. Requiring % to always be > escaped as %25 is a no-brainer; it is simple, consistent and follows the > rule of least surprise. I think I was confused by thinking that *all* strings would be quoted this way. It is only necessary if it has some invalid UTF-8 in it. A URL would never have it (as it would *already* be expanded to %nn). And I suspect a printf format should not have any in it either, as that means it is producing invalid UTF-8 on the output. |
From: Brad R <bl...@gm...> - 2009-09-04 21:21:06
|
On Fri, Sep 4, 2009 at 5:53 AM, Oren Ben-Kiki <or...@be...> wrote: > > As for %DEFAULT...SCALAR directives (suggested by BlueGM): Adding a new > standard tag is easy and has no effect on the spec. We just add it to > the tag repository and we are done. Adding a directive, on the other > hand, is a *huge* deal, and brings us to YAML 1.3 territory. So this is > not on the table for a "long while" at this point. > > In addition I think this directive a serious overkill. It is a blunt > instrument and I can see many problems with it. A much better approach > would be to work on a schema language for YAML that would allow one to > specify how tags are associated with nodes at a much more controlled and > fine grained manner. > Yes, the %DEFAULTSCALAR directives were an idea for how to make the document prettier in the future. The immediate problem requires a new data type. The idea for the directive was in response to a comment that the tags would clutter the document, something I've encountered as well. I also agree with you that it is a "blunt instrument" that would, in fact, only solve some problems. Your idea of using a schema language is much better, if more involved. In the meantime, applications can usually, if the processor (usually from a library) supports it, assume a schema when tags are not explicit. The problem with that, of course, is that a generic YAML application, such as an editor, would then not have all that information available and would have to treat the scalars as strings. Is there currently a movement under way to define a schema language? > As far as '%20' goes: I strongly believe that '%20' should be a space > and not '%' '2' '0' inside such !!utf-u. The only argument against it > was mangling of URLs, and it fails on two accounts. > > First, there's simply no reason to ever use this tag for encoding URLs. > All the URLs (actually all the URIs) in the world can be easily > processed as normal YAML strings. They have their own (%nn) built-in > escape mechanism. { path: "file://foo%ff" } _already_ works, without > having to annotate it with !!utf-u. So why would you ever want to? > > Second, even if you decided to pass a URL inside a !!utf-u tag (for some > strange reason), the fact that %20 would be preserved but %80 would be > mangled to %2580 is _extremely_ confusing. Requiring % to always be > escaped as %25 is a no-brainer; it is simple, consistent and follows the > rule of least surprise. Actually, having a URL inside one of these scalars would not be that strange. Say for instance that we had a YAML document that represented an e-mail message and the url was part of the body of that e-mail message (user A sends a message to user B saying, "hey, you have to check out this site: http:\\www.my%20cool%20site.com" for example). The document may very well be using the new data type for its body because e-mail is expected to have different encodings, but (in English speaking countries, at least) will almost always contain mostly ASCII characters. Still, I'm also in favor of having the % sign always signal an escape sequence. My own reason, stated more clearly than what I said before, is so that control characters could be escaped when not using a double quoted scalar (before I only said bytes below 0x80). I backed off because of the URL's though. I'm not sure which is more valuable in general. |
From: Oren Ben-K. <or...@be...> - 2009-09-04 22:25:11
|
On Fri, 2009-09-04 at 17:20 -0400, Brad R wrote: > Yes, the %DEFAULTSCALAR directives were an idea for how to make the > document prettier in the future. The immediate problem requires a new > data type. The idea for the directive was in response to a comment > that the tags would clutter the document, something I've encountered > as well. This is a real problem but there's no helping it at this point in time. That said, I agree that for some specific YAML files there would be a lot of !!utf8u tags, but if you only use them when actually needed, across the universe of YAML files as a whole, I suspect this would impact only a very small number of files. > I also agree with you that it is a "blunt instrument" that would, in > fact, only solve some problems. Your idea of using a schema language > is much better, if more involved. In the meantime, applications can > usually, if the processor (usually from a library) supports it, assume > a schema when tags are not explicit. The problem with that, of course, > is that a generic YAML application, such as an editor, would then not > have all that information available and would have to treat the > scalars as strings. Well, you could say there are three types of YAML files in the world. 1. Those edited by notepad are less of a problem; the user, by definition, needs to be aware of the "schema" (even if it is never formally defined). 2. Those created by a specialized program, using the equivalent of "printf" statements. Such programs embody the schema (as executable code if nothing else) so again, there no real problem. 3. Those created by a program calling "Yaml.Dump(something)". Here things get tricky. Realistically, there should be some setup code where the application informs the YAML library that some strings (or all strings) should pass through some filter (which possibly looks at the string content) to decide whether to emit them as !!utf8u instead of !! str. This requires some generic library API. If this turns out to be a common use case (which I personally doubt :-), the library API can evolve to make this specific operation as easy as you can want. BTW, this problem is not unique to !!utf8u. You face it with formatting of numbers and dates, choice of scalar types, whether or not to sort mapping keys, and other related issues. Breaking it to the above three cases helps zeroing in on the real issue, which is that YAML was intended to be a human-editable format, and _automatically_ generating a "pretty" YAML file is a non trivial operation. I'd love to see a powerful "YAML beautifier". Being anal about the data model helps such a tool a *lot*. But writing such a tool still remains quite a challenge. > Is there currently a movement under way to define a schema language? We wish :-( I'm trying to find the time to fix the errata in the current spec and bring YamlReference up to par. Xitology still needs to validate libyaml and we need to somehow get rid of syck (say, by turning libyaml into a drop-in replacement by using wrapper code). Defining a schema language is _very_ hairy, although we have some ideas on how to proceed there. > ... there's simply no reason to ever use this tag for encoding > URLs. > Actually, having a URL inside one of these scalars would not be that > strange. Say for instance that we had a YAML document that represented > an e-mail message... Ok, I take it back. I'd say there's _hardly_ ever a reason to use this tag for encoding URLs. One should never make absolute statements! :-) > Still, I'm also in favor of having the % sign always signal an escape > sequence. Ok then :-) Have fun, Oren Ben-Kiki |
From: Zenaan H. <ze...@fr...> - 2009-09-05 05:45:57
|
On Fri, Sep 04, 2009 at 03:25:41PM -0700, Oren Ben-Kiki wrote: > BTW, this problem is not unique to !!utf8u. You face it with formatting +1 for !!utf8u. Hyphens have class-name:tag-name 1:1 correspondence issue. Reject. Prefix/ primary name component: - utf8 rather than utf, since we are at this point in discussion realising relatively close alignment with utf8, vs utf16/32. - utf8 + some scattered binary migration pains. We require a postfix: - x is ambiguous with perl which is different again. - u I like. Tag case/ capitalization: - all uppercase: for tags? yuck! - all lowercase: fine by me - all either case: perhaps practically useful? Someone elses - mixed case: UTF8u? not desirable to me. cheers zenaan -- Homepage: www.SoulSound.net -- Free Australia: www.UPMART.org Please respect the confidentiality of this email as sensibly warranted. |