From: Kirill S. <xi...@ga...> - 2006-06-21 19:47:38
|
The current rules for tabs in YAML are somewhat unclear to me. The spec says: ...spaces are used for indentation and separation between tokens. To maintain portability, tab characters must not be used in these cases, since different systems treat tabs differently... Although I understand why tab characters cannot be used to denote indentation, the reasoning behind the second part of the rule is not clear to me. I don't see any reason why tabs cannot be used to separate tokens. Moreover, I believe it was Ingy who told on #yaml that some JSON processors use tabs for pretty-printing. Thus support for tabs which separate tokens may be required for JSON compatibility. Third, these rules are not strictly followed in the spec productions. For instance, plain scalars cannot contain tab characters, but tabs can be used at the end of a line of a multi-line plain scalar. Another example is block scalars in which tabs may occur at the first line of the scalar. The issues are illustrated with the following examples (the tab character is denoted by '>>'). --- #1 - >> multi line plain scalar --- #2 - multi >> line plain scalar --- #3 - multi line >> plain scalar --- #4 - multi line >> plain scalar --- #5 - multi line plain scalar >> --- #6 - | >> >> mutli line >> block scalar Could you tell which of the above documents are well-formed? The answer is #3 #6 The result looks non-intuitive. Each of the examples #1-#5 can be parsed without any problems and it's not clear what is so special about #3. The last example is also confusing because it feels like it violates the rule that tabs cannot be used for intendation. Although the tabs #6 does not form intendation, they can be easily confused with it. I'm not aware of any YAML processor that strictly follows the current tab rules. Both Syck and PyYAML have their own strange quirks. My proposal is to introduce the following rule: Tab characters are allowed in any place where they cannot be confused with intendation spaces. This rule can be also expressed in the form of a commutative diagram: Suppose that a well-formed YAML stream is given. Consider two processes: 1. Replace all tab characters in the stream with X spaces and construct the representation tree. 2. Construct the representation tree for the original stream and replace all tab characters with X spaces in all scalar nodes. Then both processes will lead to identical representation trees. This rule will make the examples #1-#5 well formed, but will make #6 ill formed. The latter may break some existing YAML documents. By intendation spaces I mean the spaces at the beginning of the line or after the indicators '-', '?', ':' that form an intended block. Although this proposal might have its own problems, I feel that it, at least, is no worse than the current rules. What do you think about it? Note that I've tried to make the rule as permissive as possible. Another more restrictive rule, like forbidding #1-#6, might be more suitable though. -- xi |
From: Oren Ben-K. <or...@be...> - 2006-06-21 20:05:31
|
On Wednesday 21 June 2006 12:47, Kirill Simonov wrote: > The current rules for tabs in YAML are somewhat unclear to me. The > spec says: > > ..spaces are used for indentation and separation between tokens. To > maintain portability, tab characters must not be used in these cases, > since different systems treat tabs differently... > > Although I understand why tab characters cannot be used to denote > indentation, the reasoning behind the second part of the rule is not > clear to me. I don't see any reason why tabs cannot be used to > separate tokens. Due to JSON compatibility we are changing the spec to allow tabs to separate tokens. Have fun, Oren Ben-Kiki |
From: Kirill S. <xi...@ga...> - 2006-06-21 20:17:35
|
On Wed, Jun 21, 2006 at 01:05:23PM -0700, Oren Ben-Kiki wrote: > On Wednesday 21 June 2006 12:47, Kirill Simonov wrote: > > The current rules for tabs in YAML are somewhat unclear to me. The > > spec says: > > > > ..spaces are used for indentation and separation between tokens. To > > maintain portability, tab characters must not be used in these cases, > > since different systems treat tabs differently... > > > > Although I understand why tab characters cannot be used to denote > > indentation, the reasoning behind the second part of the rule is not > > clear to me. I don't see any reason why tabs cannot be used to > > separate tokens. > > Due to JSON compatibility we are changing the spec to allow tabs to > separate tokens. What about plain scalars? Why foo<TAB> bar is allowed while foo <TAB>bar or foo<TAB>bar aren't? -- xi |
From: Oren Ben-K. <or...@be...> - 2006-06-21 23:04:19
|
On Wednesday 21 June 2006 13:17, Kirill Simonov wrote: > What about plain scalars? Well, the consistent thing would be to allow TAB before/after (each line of) plain scalars (these are separation spaces after all, we always strip them), and preserve them inside (these are content characters, we try hard to preserve them). Each line would still need to be indented using spaces, of course. That is: # . = space > = tab foo:>aaa.> .>bbb.>ccc.> Would be the same as foo: "aaa bbb \tccc" Have fun, Oren Ben-Kiki |
From: <in...@tt...> - 2006-06-21 22:07:46
|
On 21/06/06 13:05 -0700, Oren Ben-Kiki wrote: > On Wednesday 21 June 2006 12:47, Kirill Simonov wrote: > > The current rules for tabs in YAML are somewhat unclear to me. The > > spec says: > > > > ..spaces are used for indentation and separation between tokens. To > > maintain portability, tab characters must not be used in these cases, > > since different systems treat tabs differently... > > > > Although I understand why tab characters cannot be used to denote > > indentation, the reasoning behind the second part of the rule is not > > clear to me. I don't see any reason why tabs cannot be used to > > separate tokens. > > Due to JSON compatibility we are changing the spec to allow tabs to > separate tokens. You didn't answer as to #6. Should: --- #6 - | >> >> mutli line >> block scalar be illegal or be [">>\n>> multi line\n>> block scalar\n"] as it is today? I think "illegal" is ok because this is less ambiguous: --- #6 - |2 >> >> mutli line >> block scalar Cheers, Ingy |
From: Oren Ben-K. <or...@be...> - 2006-06-21 23:05:15
|
On Wednesday 21 June 2006 15:07, Ingy dot Net wrote: > Should: > > --- #6 > - | > > >> mutli line > >> block scalar > > be illegal or be [">>\n>> multi line\n>> block scalar\n"] as it is > today? No reason I see to change today's rules... you can make it a YAML 1.2 issue if you want :-) Have fun, Oren Ben-Kiki |
From: Ingy d. N. <in...@tt...> - 2006-06-21 23:10:18
|
On 21/06/06 16:05 -0700, Oren Ben-Kiki wrote: > On Wednesday 21 June 2006 15:07, Ingy dot Net wrote: > > Should: > > > > --- #6 > > - | > > > > >> mutli line > > >> block scalar > > > > be illegal or be [">>\n>> multi line\n>> block scalar\n"] as it is > > today? > > No reason I see to change today's rules... you can make it a YAML 1.2 > issue if you want :-) Good point there. > > Have fun, > > Oren Ben-Kiki |
From: Andrey <ae...@go...> - 2006-07-01 07:28:55
|
Hi, I am using PySyck for my project. One of the things I definitely need is dumping Unicode strings in UTF-8. When I pass some Unicode data to syck.dump(), it is encoded in UTF-8 as expected, but all the bytes greater than 127 get encoded with \xx. Is it ever possible to dump unescaped UTF-8 strings? I believe YAML spec allows it. Andrey |
From: Kirill S. <xi...@ga...> - 2006-07-02 06:42:10
|
Hi Andrey, On Sat, Jul 01, 2006 at 02:28:39PM +0700, Andrey wrote: > Hi, > > I am using PySyck for my project. One of the things I definitely need is > dumping Unicode strings in UTF-8. When I pass some Unicode data to > syck.dump(), it is encoded in UTF-8 as expected, but all the bytes greater > than 127 get encoded with \xx. Is it ever possible to dump unescaped UTF-8 > strings? I believe YAML spec allows it. No, Syck does not support this. However if speed is not a constraint, you may switch to PyYAML, which allows it with yaml.dump(..., allow_unicode=True) -- xi |
From: Andrey <ae...@go...> - 2006-07-02 06:49:42
|
Kirill Simonov wrote: > No, Syck does not support this. > > However if speed is not a constraint, you may switch to PyYAML, which > allows it with > > yaml.dump(..., allow_unicode=True) Thanks, I've already discovered this nice feature. =) Your PyYAML rocks a lot, and I'll most likely switch to use it. Andrey |