From: Brian I. <in...@tt...> - 2003-08-05 00:45:30
|
Is there a default indentation if, on a block scalar, no explicit indentation is given, and none can be detected. _Why gives this example in his YamlInFiveMinutes tutorial: Concerning "Jim O'Connor": | You are receiving Jim O'Connor's mail for several reasons: - The nameplate on your mailbox still says his name. - He has told our postman that you screen his mail. - He is living in your ceiling. Since we can't detect the indentation on the first line, what do we do? I would bet it's a parse error, but maybe the default should be 1 column. Or maybe two. In the past we've argued about scanning forward, because of the lookahead problem. Maybe we could readdress this lookahead strictness. The lookahead problem exists because of the following assumption: In some implementations (like C possibly) it might not be efficient to read in an entire 50kb string into a growable buffer, just to analyze it. So the implementation might just read 2kb at a time into a static buffer and report the scalar chunk by chunk. If this is the case, the parser must know ahead of time what the indentation is. That's why we have the explicit indicator. If the indentation is not indicated by the first line of the scalar, then explicit indentation *must* be used. But frankly, that's just being over strict. All the implementations to date don't chunk scalars, and have stretchy buffers built right in. Also, most scalars are less than 2kb or 4kb or whatever. I'd say 99.999% or more. So why can't we just do the right thing? Examine the whole scalar, and make the line with the least indentation be the indicator. As for chunky parsers, they can just throw an exception when they've guessed wrong. They would have had to throw the exception anyway. Cheers, Brian |
From: Clark C. E. <cc...@cl...> - 2003-08-05 01:45:49
|
On Mon, Aug 04, 2003 at 05:45:11PM -0700, Brian Ingerson wrote: | Since we can't detect the indentation on the first line, what do we do? | I would bet it's a parse error, but maybe the default should be 1 | column. Or maybe two. Quotith the spec... 3.6.5 Literal A literal scalar is the simplest scalar form. No processing is performed on literal scalar characters aside from end of line normalization and stripping away the indentation. Indentation is detected from the first content line. Explicit indentation must be specified in case this yields the wrong result. So, this comes down to what it means by the "first content line", as this is somewhat unclear. I think it would mean "non-comment" line, but, we could update the spec to say "first line which is not an empty line or a comment". This would then fit with _why's usage. | The lookahead problem exists because of the following assumption: | | In some implementations (like C possibly) it might not be efficient | to read in an entire 50kb string into a growable buffer, just to | analyze it. So the implementation might just read 2kb at a time into | a static buffer and report the scalar chunk by chunk. And this is a good conservative assumption (think small devices). | But frankly, that's just being over strict. ... | So why can't we just do the right thing? Examine the whole scalar, and | make the line with the least indentation be the indicator. How about we just scan for the first non-blank line. Scannig the whole scalar for the line with the least indentation is, IMHO, not very clean and could even be dangerous (think a syntax error where someone accidentlty un-indented 1/2 way through the scalar by typing "x" in vim on the way out of the editor. | As for chunky parsers, they can just throw an exception when they've | guessed wrong. They would have had to throw the exception anyway. I'd rather not have YAML compliance vary across parsers. Bings, Clark |
From: why t. l. s. <yam...@wh...> - 2003-08-05 03:47:59
|
Clark C. Evans (cc...@cl...) wrote: > > How about we just scan for the first non-blank line. Scannig the > whole scalar for the line with the least indentation is, IMHO, not > very clean and could even be dangerous (think a syntax error where > someone accidentlty un-indented 1/2 way through the scalar by > typing "x" in vim on the way out of the editor. > This is how Syck works. When I hit that pipe, I read in whitespace until I hit printables. It's just like any other nested content. Although, unlike most other nodes, I _do_ count newlines as I go. I figured these should be prefixed to the content. Is this true? Prefixed: | Content begins here, but actually opens with a couple of newlines. Mmm? _why |
From: Clark C. E. <cc...@cl...> - 2003-08-05 16:48:27
|
On Mon, Aug 04, 2003 at 09:45:29PM -0600, why the lucky stiff wrote: | Although, unlike most other nodes, I _do_ count newlines as I go. I | figured these should be prefixed to the content. Is this true? | | Prefixed: | | | | Content begins here, but | actually opens with a couple of | newlines. This actually isn't as cut and dry as it may seem. In the example below let '.' be a whitespace space character: Prefixed: | .... ....... ....Content.is.... ....here Specifying an unambiguous rule which would handle the case above could get quite ugly. I think the most "logical" interpretation for this would be "\n \n\nContent is \nhere\n" but I'm just not sure. Currently the spec is good that it rules this sort of ambiguity out... perhaps we could make a special case for... Prefixed: | ....Content is here Which could be interpreted as "\n\nContent is here\n". That is, we let a sequence of N starting new lines without any spaces grace a scalar. But even this is hard to specify, for example: Nested: ....Prefixed: | .... ....Content is here ick. this gets ugly fast, and is probably why we left out the case of leading spaces. The productions are already difficult enough without adding in this sort of magic. | > How about we just scan for the first non-blank line. Scannig the | > whole scalar for the line with the least indentation is, IMHO, not | > very clean and could even be dangerous (think a syntax error where | > someone accidentlty un-indented 1/2 way through the scalar by | > typing "x" in vim on the way out of the editor. | > | | This is how Syck works. When I hit that pipe, I read in whitespace | until I hit printables. It's just like any other nested content. How does your implementation handle the above "strange" cases? If it does so in a simple (clean) way then perhaps we could modify the spec to reflect this sort of behavior. But if it is a hairball, which, looking at the examples above, it probably is, then perhaps the spec is best as it is... leaving these cases of autodetection to be errors. Best, Clark |
From: Oren Ben-K. <or...@be...> - 2003-08-06 05:16:03
|
Clark C. Evans wrote: > This actually isn't as cut and dry as it may seem. In the > example below let '.' be a whitespace space character: > > Prefixed: | > .... > ....... > > ....Content.is.... > ....here > > Specifying an unambiguous rule which would handle the case > above could get quite ugly. > I think the most "logical" > interpretation for this would be "\n \n\nContent is \nhere\n" > but I'm just not sure. Nope, it is an error, because the first non-empty line contains leading spaces. It is all in the "more indented" and "less indented" concept. For example, the productions state: l-blk-empty-line-feed(n) ::= i-spaces(<=n) b-as-line-feed As long as the number of spaces is less than the indentation level, the line is considered "empty". Of course, the question is "what is the indentation level?". The parser needs to detect it. We already have a simple rule saying that the first non-empty line must not have any leading spaces. So if there's no explicit indentation, and a leading line contains only spaces - they must be all indentation spaces... This complicates the implementation a little bit; you need both a counter and a memory of the maximal number of spaces in the empty lines. Then, if it turns out the indentation is less than that maximum, the file is in error. For example ('.' stands for a space): ....block: | ...... ........Indented four spaces; above line is OK. => "\nIndented four spaces; above line is OK.\n" ....block: | ........ ......Indented two spaces; above line is in error. => Error. ....block: |2 ........ ......Above line is OK due to explicit indentation. => " \nAbove line is OK due to explicit indentation.\n" If you want to pinpoint the error, you need to also remember where the line with the maximal number of spaces was. That's still acceptable (you need three variables instead of one, big deal). The *real* problem is that if you want to generate a *list* of all the errors, you'd need to maintain an array with the number of spaces for each line... But that's really an overkill. This is a sick enough case that reporting just the "most offending" line, using a message such as "*some* empty leading lines are more indented than content" should be sufficient. > Currently the spec is good that it > rules this sort of ambiguity out... perhaps we could make > a special case for... > > Prefixed: | > > > ....Content is here > > Which could be interpreted as "\n\nContent is here\n". That is, > we let a sequence of N starting new lines without any spaces > grace a scalar. No need for a special case. Less indented => empty => newline. That's already in the rules. Think of it another way. I believe in the principle that providing an explicit indentation that is equal to the result of the detection of implicit indentation should not change the result. Seems reasonable, right? Now, for explicit indentation: Nested: ....Prefixed: |4 .... ....Content is here Means "\n\nContent is here". This means we only have two options about how to interpret: Nested: ....Prefixed: | .... ....Content is here Either it is in error (the spec today) or it means "\n\nContent is here". No third option. In both cases, Nested: ....Prefixed: | ...... ....Content is here Is an error. Seems simple enough... Have fun, Oren Ben-Kiki |
From: Oren Ben-K. <or...@be...> - 2003-08-05 06:21:05
|
Clark C. Evans [mailto:cc...@cl...] wrote: > On Mon, Aug 04, 2003 at 05:45:11PM -0700, Brian Ingerson wrote: > | Since we can't detect the indentation on the first line, what do we > | do? I would bet it's a parse error, And so it is. > | but maybe the default > | should be 1 column. Or maybe two. > > Quotith the spec... > > > 3.6.5 Literal > > A literal scalar is the simplest scalar form. No processing is > performed on literal scalar characters aside from end of line > normalization and stripping away the indentation. Indentation > is detected from the first content line. Explicit indentation > must be specified in case this yields the wrong result. Also quoth the spec: 3.6.3 Explicit Indentation Typically the indentation level of a block scalar node is detected from its first content line. This detection fails when this first line is empty, contains a leading '#' character, or contains leading white space characters. In such cases YAML requires that the indentation level for the scalar node text content be given explicitly. This level is specified as the integer number of the additional indentation spaces used for the text content. So, Why's example requires explicit indentation. As it stands it is in error. > So, this comes down to what it means by the "first content line", as > this is somewhat unclear. I think it would mean "non-comment" line, > but, we could update the spec to say "first line which is not an > empty line or a comment". This would then fit with _why's usage. There can be no comment lines inside a block scalar (they may follow it). So there's no ambiguity about what "content line" means. Also, if one examines the productions: c-l-literal(n) ::= c-literal ==> that is, the '|' ns-ns-blk-modifiers? ==> none in this case s-b-trailing-comment ==> just the \n in this case l-l-literal-value(n)? ==> ok, let's see below... l-l-empty-trailing(n)? ( l-text-comment(n) l-comment(any)* )? l-l-literal-value(n) ::= l-l-literal-chunk(n)+ ==> let's see below again... l-l-literal-chunk(n) ::= l-blk-empty-line-feed(n)* ==> AHA! ( l-literal-text(n) | l-blk-empty-specific(n) ) So it is "clear" :-) that the empty line is content rather than comment in this case. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-08-05 16:22:28
|
On Tue, Aug 05, 2003 at 08:57:06AM +0300, Oren Ben-Kiki wrote: | 3.6.3 Explicit Indentation | | Typically the indentation level of a block scalar node is detected | from its first content line. This detection fails when this first | line is empty, contains a leading '#' character, or contains | leading white space characters. Hmm. I wasn't expecting ", contains a leading '#' character", is this intended to prevent the following? this: | # this is an error commented: out YAML I'm not certain I like this beacuse it makes it non-trivial to take a block of YAML (perhaps with a leading comment '#' ) and indenting it as a block scalar. | In such cases YAML requires that the indentation level for the | scalar node text content be given explicitly. This level is | specified as the integer number of the additional indentation | spaces used for the text content. Yes, this is clear. My bad for missing this. | There can be no comment lines inside a block scalar (they may follow it). Right. And they should follow it at a different indentation. My bad for not remembering this decision (which is a very good one as it makes cases like this easier to digest). | l-l-literal-chunk(n) ::= | l-blk-empty-line-feed(n)* ==> AHA! | ( l-literal-text(n) | | l-blk-empty-specific(n) ) Right. The empty line is content. Ok. ... After your post, I now agree that the spec rules out Why's example. This brings up two issues: 1. Does '#' exception need to be in 3.6.3? I'm not clear why it is done this way. 2. Should we allow for the indentation to be set at the first non-empty line? This would probably be a good policy, and would not cause any "look-ahead" issues, as the number of blank lines could be stored in a counter. Thank you Oren. Best, Clark |
From: Oren Ben-K. <or...@be...> - 2003-08-06 04:59:41
|
Clark C. Evans wrote: > Hmm. I wasn't expecting ", contains a leading '#' > character", is this intended to prevent the following? > > this: | > # this is an error > commented: out YAML Yes. > I'm not certain I like this because it makes it non-trivial > to take a block of YAML (perhaps with a leading comment '#' ) > and indenting it as a block scalar. Hmmm. > 1. Does '#' exception need to be in 3.6.3? I'm not clear why > it is done this way. > > 2. Should we allow for the indentation to be set at the first > non-empty line? This would probably be a good policy, and > would not cause any "look-ahead" issues, as the number of > blank lines could be stored in a counter. Two good questions. We tended to "when in doubt, restrict" thinking we can always relax the rules "later". Is now "later"? I'm somewhat uncomfortable with #1. It somehow feels wrong to have a '#' not start a comment without explicit marker causing it to be treated as content. I can see the sense in it though (for indenting YAML text). I'll go either way here. I feel better about #2. As you point out it can easily be implemented by a counter, and there's no ambiguity involved at all (unlike the '#' case). Brian, what do you think? I can whip up a patched version of the spec in no time... But while we are at it, can we please announce it at least as a release candidate? We are very overdue for starting a formal freeze process. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-08-06 18:12:32
|
On Wed, Aug 06, 2003 at 07:51:22AM +0300, Oren Ben-Kiki wrote: | > 1. Does '#' exception need to be in 3.6.3? I'm not clear why | > it is done this way. | | I'm somewhat uncomfortable with #1. It somehow feels wrong to have a '#' | not start a comment without explicit marker causing it to be treated as | content. I can see the sense in it though (for indenting YAML text). | I'll go either way here. Ahh. I think you are correct. This case is ambiguous: key: | # content or an empty scalar with a comment So, let's keep 3.6.3, but add the example above to the examples (if it isn't already there) so that we don't forget why this seemingly out-of-place exception is there. | > 2. Should we allow for the indentation to be set at the first | > non-empty line? This would probably be a good policy, and | > would not cause any "look-ahead" issues, as the number of | > blank lines could be stored in a counter. | | I feel better about #2. As you point out it can easily be implemented by | a counter, and there's no ambiguity involved at all (unlike the '#' case). Cool. Your other post seems like it matches the intent of _why's implementation and makes sense to me. It is a bit more to implement, but, till now we've not shyed away from making the implementation tougher to have a more intuitive syntax. And, being able to have blank lines at the start of a literal scalar is quite intuitive. | I can whip up a patched version of the spec in no time... But while we | are at it, can we please announce it at least as a release candidate? We | are very overdue for starting a formal freeze process. Sure. And I'll work on that formal part like we agreed some 6 months ago. (*ducks for cover*) Best, Clark |
From: Brian I. <in...@tt...> - 2003-08-08 06:36:55
|
On 06/08/03 18:14 +0000, Clark C. Evans wrote: > On Wed, Aug 06, 2003 at 07:51:22AM +0300, Oren Ben-Kiki wrote: > | > 1. Does '#' exception need to be in 3.6.3? I'm not clear why > | > it is done this way. > | > | I'm somewhat uncomfortable with #1. It somehow feels wrong to have a '#' > | not start a comment without explicit marker causing it to be treated as > | content. I can see the sense in it though (for indenting YAML text). > | I'll go either way here. > > Ahh. I think you are correct. This case is ambiguous: > > key: | > # content or an empty scalar with a comment I don't see this as ambiguous. It is definitely content. > > So, let's keep 3.6.3, but add the example above to the > examples (if it isn't already there) so that we don't > forget why this seemingly out-of-place exception is there. > > | > 2. Should we allow for the indentation to be set at the first > | > non-empty line? This would probably be a good policy, and > | > would not cause any "look-ahead" issues, as the number of > | > blank lines could be stored in a counter. > | > | I feel better about #2. As you point out it can easily be implemented by > | a counter, and there's no ambiguity involved at all (unlike the '#' case). > > Cool. Your other post seems like it matches the intent of > _why's implementation and makes sense to me. It is a bit > more to implement, but, till now we've not shyed away from > making the implementation tougher to have a more intuitive > syntax. And, being able to have blank lines at the start > of a literal scalar is quite intuitive. I say, screw the lookahead and scan the whole scalar for the least indented line. That definitely does the most service to the user. For most implementations, this is a no brainer. For the C implementations on embedded devices, let them croak if the indentation can't be determined within the static buffer size. It's just a limitation of that parser. Just my opinion. I know I'm probably outvoted here, but at least try to convice me that your way is right. Cheers, Brian |
From: Oren Ben-K. <or...@be...> - 2003-08-08 07:37:47
|
Brian Ingerson [mailto:in...@tt...] wrote: > > Ahh. I think you are correct. This case is ambiguous: > > > > key: | > > # content or an empty scalar with a comment > > I don't see this as ambiguous. It is definitely content. I don't see anything "definite" about it. Consider: --- Key0: | # Thing (is this comment or content?) --- ### Section 1 Key1: value # This key... Key2: | # That key... Key3: | # This one requires # a long comment (or # is it content?) Key4: | # This one doesn't, but... ### Section 2 (is this comment or content?) Key5: | # Not to mention: # Content or comment? ### Section 3 (this keeps getting better...) --- I agree that people may tend to read '# Thing' as content, but I'm certain that they'll read all the other cases as comments. I don't see how we can draft a rule that decides one way or the other. The spec does the right thing - announce the above cases as ambiguous (== "in such a terrible style that it is impossible to figure it out"). As you pointed out, it is very important to define what is _illegal_ as well as what is legal... I really think this is justified here. And yes, I know, this means that if you want to throw in a chunk of YAML as a literal block, and it starts with a common, you need to provide an explicit indentation. BUT, it is also true that if you throw in a chunk of YAML as a literal block, and it ends with significant empty lines, you need to provide the '+' modifier. For example: --- # Original block literal: |+ --- Quoted: |4+ # Original block literal: |+ --- So you can't use a simple '|' to quote YAML-in-YAML. To be safe you need '|4+'. No big deal, IMVHO. > I say, screw the lookahead and scan the whole scalar for the > least indented line. That definitely does the most service to > the user. For most implementations, this is a no brainer. For > the C implementations on embedded devices, let them croak if > the indentation can't be determined within the static buffer > size. It's just a limitation of that parser. +1. I just feel better knowing a truly streaming parser is possible (even if it is more complicated). It is OK for a parser to have (reasonable) implementation restrictions. At some level, all computers re finite state machines pretending to be Turing machines anyway. Implementation restrictions are inevitable. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-08-08 22:07:28
|
On Fri, Aug 08, 2003 at 10:17:21AM +0300, Oren Ben-Kiki wrote: | I agree that people may tend to read '# Thing' as content, but I'm | certain that they'll read all the other cases as comments. I don't see | how we can draft a rule that decides one way or the other. The spec does | the right thing - announce the above cases as ambiguous (== "in such a | terrible style that it is impossible to figure it out"). As you pointed | out, it is very important to define what is _illegal_ as well as what is | legal... I really think this is justified here. Anyway we can force full-line comments to start in column 0? --- # this is a comment this: is content # with a comment # another comment this: | is a scalar value # perhaps with a comment??? and this continues the scalar... Hmm. I actually like being able to add comments in column 0. Thoughts, or is this just way off? Best, Clark |
From: Shane H. (IEEE) <sha...@ie...> - 2003-08-08 23:46:40
|
I like that Expat calls you back for comments, and I think it would be a good idea for YAML as well. So in that vein, perhaps the parser should read the scalar with the comments included, and have the returned object have a setting to determin whether the comments should be included. With this, there could be a scalar block setting to specify whether to exclude comments by default in the content? Or, perhaps make the comment leader itself a block setting, with a setting for no-comments/all data? As for column 0 comments, I really don't care for the look of them. I like my comments to flow at the same level as my content -- whether it be code or data. ;) Just some thoughts. Thanks -Shane Clark C. Evans wrote: > On Fri, Aug 08, 2003 at 10:17:21AM +0300, Oren Ben-Kiki wrote: > | I agree that people may tend to read '# Thing' as content, but I'm > | certain that they'll read all the other cases as comments. I don't see > | how we can draft a rule that decides one way or the other. The spec does > | the right thing - announce the above cases as ambiguous (== "in such a > | terrible style that it is impossible to figure it out"). As you pointed > | out, it is very important to define what is _illegal_ as well as what is > | legal... I really think this is justified here. > > Anyway we can force full-line comments to start in column 0? > > --- > # this is a comment > this: is content # with a comment > # another comment > this: | > is a scalar value > # perhaps with a comment??? > and this continues the scalar... > > Hmm. I actually like being able to add comments in column 0. > Thoughts, or is this just way off? > > Best, > > Clark > > > ------------------------------------------------------- > This SF.Net email sponsored by: Free pre-built ASP.NET sites including > Data Reports, E-commerce, Portals, and Forums are available now. > Download today and enter to win an XBOX or Visual Studio .NET. > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 > _______________________________________________ > Yaml-core mailing list > Yam...@li... > https://lists.sourceforge.net/lists/listinfo/yaml-core |
From: Oren Ben-K. <or...@be...> - 2003-08-09 05:49:08
|
Shane Holloway (IEEE)wrote: > I like that Expat calls you back for comments, and I think it > would be a good idea for YAML as well. > > So in that vein, perhaps the parser should read the scalar > with the comments included... There's no way to put comments "in" a scalar, just before/after it. So a callback of the type you want is easily possible. Allowing comments _in_ a scalar messes things up considerably - it really isn't worth the bother. > As for column 0 comments, I really don't care for the look of > them. I like my comments to flow at the same level as my > content -- whether it be code or data. ;) +10! Restricting them to column 0 is too restrictive. Have fun, Oren Ben-Kiki |
From: Brian I. <in...@tt...> - 2003-08-09 06:51:54
|
On 08/08/03 10:17 +0300, Oren Ben-Kiki wrote: > Brian Ingerson [mailto:in...@tt...] wrote: > > > Ahh. I think you are correct. This case is ambiguous: > > > > > > key: | > > > # content or an empty scalar with a comment > > > > I don't see this as ambiguous. It is definitely content. To me the following cases are all obvious. Here's my rule: In the abscence of an explicit indentation indicator, all lines following the line with the pipe '|', are *content* until a line is reached that is the same or less indented than the line with the pipe. Dead simple. > I don't see anything "definite" about it. Consider: > > --- > Key0: | > # Thing (is this comment or content?) > --- > ### Section 1 > > Key1: value # This key... > Key2: | # That key... > > Key3: | # This one requires > # a long comment (or > # is it content?) > > Key4: | # This one doesn't, but... > > ### Section 2 (is this comment or content?) > > Key5: | # Not to mention: > # Content or comment? > > ### Section 3 (this keeps getting better...) > --- > > I agree that people may tend to read '# Thing' as content, but I'm > certain that they'll read all the other cases as comments. I don't see > how we can draft a rule that decides one way or the other. The spec does > the right thing - announce the above cases as ambiguous (== "in such a > terrible style that it is impossible to figure it out"). As you pointed > out, it is very important to define what is _illegal_ as well as what is > legal... I really think this is justified here. > > And yes, I know, this means that if you want to throw in a chunk of YAML > as a literal block, and it starts with a common, you need to provide an > explicit indentation. BUT, it is also true that if you throw in a chunk > of YAML as a literal block, and it ends with significant empty lines, > you need to provide the '+' modifier. For example: > > --- > # Original block > literal: |+ > > --- > Quoted: |4+ > # Original block > literal: |+ > > --- > > So you can't use a simple '|' to quote YAML-in-YAML. To be safe you need > '|4+'. No big deal, IMVHO. Agreed, although I think accidentally losing trailing lines rarely causes any problem, where messing up indentation is much worse. > > I say, screw the lookahead and scan the whole scalar for the > > least indented line. That definitely does the most service to > > the user. For most implementations, this is a no brainer. For > > the C implementations on embedded devices, let them croak if > > the indentation can't be determined within the static buffer > > size. It's just a limitation of that parser. > > +1. I just feel better knowing a truly streaming parser is possible > (even if it is more complicated). It is OK for a parser to have > (reasonable) implementation restrictions. At some level, all computers > re finite state machines pretending to be Turing machines anyway. > Implementation restrictions are inevitable. OK. Streaming is still entirely possible as long as the parser reports which quoting style indicators were used, and the receiver maintains these same quoting style indicators. Let me give you a very simple example of why this is necessary: 1) A scalar with an embedded null can only be represented with double quotes. 2) If a parser reports scalars in 2k chunks... 3) And a scalar has a null character several megabytes into the string 4) The receiver would have to puke if it had chosen the wrong emission style. 5) So the safe thing is for the receiver to emit using the original style. The parser API that Mike Orr and I laid out at: http://yaml.freepan.org/index.cgi?ParserEmitterApi reports style. BTW, comments could pretty easily be added to this API. I comments should be reported with whether they were on their own line or not. Cheers, Brian |
From: Clark C. E. <cc...@cl...> - 2003-08-09 14:34:13
|
On Fri, Aug 08, 2003 at 11:51:34PM -0700, Brian Ingerson wrote: | On 08/08/03 10:17 +0300, Oren Ben-Kiki wrote: | > Brian Ingerson [mailto:in...@tt...] wrote: | > > > Ahh. I think you are correct. This case is ambiguous: | > > > key: | | > > > # content or an empty scalar with a comment | > > | > > I don't see this as ambiguous. It is definitely content. | | To me the following cases are all obvious. Here's my rule: | | In the abscence of an explicit indentation indicator, all lines | following the line with the pipe '|', are *content* until a | line is reached that is the same or less indented than the line | with the pipe. Dead simple. So, according to this rule... | > --- | > Key0: | | > # Thing (is this comment or content?) content | > --- | > ### Section 1 comment | > Key1: value # This key... comment | > Key2: | # That key... | > single key | > Key3: | # This one requires comment | > # a long comment (or | > # is it content? second two lines are content | > Key4: | # This one doesn't, but... comment | > | > ### Section 2 (is this comment or content?) content | > Key5: | # Not to mention: comment | > # Content or comment? | > | > ### Section 3 (this keeps getting better...) content Let me add another example: --- # comment one: | # more comment # content this: is content # comment two: # comment nested: | content # comment another: | # content # comment # comment Oren, Does this work? If so, I'm game. | > > I say, screw the lookahead and scan the whole scalar for the | > > least indented line. That definitely does the most service to | > > the user. For most implementations, this is a no brainer. For | > > the C implementations on embedded devices, let them croak if | > > the indentation can't be determined within the static buffer | > > size. It's just a limitation of that parser. | > | > +1. I just feel better knowing a truly streaming parser is possible | > (even if it is more complicated). It is OK for a parser to have | > (reasonable) implementation restrictions. At some level, all computers | > re finite state machines pretending to be Turing machines anyway. | > Implementation restrictions are inevitable. Oren, you are a bit vague here. Here is what I think Brian is proposing (please correct me if I'm wrong Brian): The specification requires you to have random access to the entire scalar value (so that you can detect the least indented line). In this case, streaming YAML parsers which do not do this (infinite buffer size) are not compliant; however, such parsers could provide a user-defined maximum scalar buffer size. IMHO, we have worked very hard up to this point to reduce lookahead to a very minimal amount. I would very much like to keep it this way. While Brian's suggestion may be good, I'd rather not change direction at this point in time. | poem: | | YAML by Brian Ingerson | The once was a language called YAML, | That appealed to both reptile and mammal, | Though neither would agree, | On how scalars should be, | So they fight on; the snake and the camel. In particular, the above example is currently malformed YAML and I'd rather keep it that way. | OK. Streaming is still entirely possible as long as the parser reports | which quoting style indicators were used, and the receiver maintains these | same quoting style indicators. Ick. The point of the built-in scalar styles is so that the parser can handle this stuff for you. Am I mis-understanding? | Let me give you a very simple example of why this is necessary: | | 1) A scalar with an embedded null can only be represented with | double quotes. | 2) If a parser reports scalars in 2k chunks... | 3) And a scalar has a null character several megabytes into | the string | 4) The receiver would have to puke if it had chosen the wrong | emission style. | 5) So the safe thing is for the receiver to emit using the | original style. I think you are mixing what the emitter and parser requirements. Certainly if the entire scalar is not provided to the emitter or if the style is not given, then the emitter must use the double quoted style, "just in case". However, I don't think this is related to the parser issue above. | http://yaml.freepan.org/index.cgi?ParserEmitterApi Yep. | BTW, comments could pretty easily be added to this API. I comments | should be reported with whether they were on their own line or not. Right. On Fri, Aug 08, 2003 at 05:45:57PM -0600, Shane Holloway (IEEE) wrote: | I like that Expat calls you back for comments, and I think it would be a | good idea for YAML as well. Yes, the YAML parser should report comments. | So in that vein, perhaps the parser should read the scalar with the | comments included, and have the returned object have a setting to determin | whether the comments should be included. With this, there could be a | scalar block setting to specify whether to exclude comments by default in | the content? Or, perhaps make the comment leader itself a block setting, | with a setting for no-comments/all data? See Oren's response. Mixing content and comments gets ugly quick. | As for column 0 comments, I really don't care for the look of them. I | like my comments to flow at the same level as my content -- whether it | be code or data. ;) Brian's rule effectively does this (only that it is column 0 with respect to the current indentation level). On Wed, Aug 06, 2003 at 08:15:34AM +0300, Oren Ben-Kiki wrote: | l-blk-empty-line-feed(n) ::= | i-spaces(<=n) b-as-line-feed | | As long as the number of spaces is less than the indentation level, the | line is considered "empty". Of course, the question is "what is the | indentation level?". The parser needs to detect it. We already have a | simple rule saying that the first non-empty line must not have any | leading spaces. So if there's no explicit indentation, and a leading | line contains only spaces - they must be all indentation spaces... Ok. So, in this model, as long as the number of spaces in each indentation is less than the indentation in the first non-whitespace printable, then all is ok. | This complicates the implementation a little bit; you need both a | counter and a memory of the maximal number of spaces in the empty lines. | Then, if it turns out the indentation is less than that maximum, the | file is in error. For example ('.' stands for a space): | | ....block: | | ...... | ........Indented four spaces; above line is OK. | | => "\nIndented four spaces; above line is OK.\n" | | ....block: | | ........ | ......Indented two spaces; above line is in error. | | => Error. | | ....block: |2 | ........ | ......Above line is OK due to explicit indentation. | | => " \nAbove line is OK due to explicit indentation.\n" ....block: | .... ...... ........ ...... .... ........Six leading lines | If you want to pinpoint the error, you need to also remember where the | line with the maximal number of spaces was. That's still acceptable (you | need three variables instead of one, big deal). _why, does this sound resonable to you. It makes sense to me. Best, Clark |
From: Oren Ben-K. <or...@be...> - 2003-08-09 19:00:34
|
> Brian Ingerson [mailto:in...@tt...] wrote: > | To me the following cases are all obvious. Here's my rule: > | > | In the abscence of an explicit indentation indicator, all lines > | following the line with the pipe '|', are *content* until a > | line is reached that is the same or less indented than the line > | with the pipe. Dead simple. And for implicit indent: - The first non-empty line sets the indent level; - It is an error for an all-spaces line preceding this first non-empty line to be longer than this indent level; - It is an error for a line following it to be more indented than the line with the '|', but less indented than the first non empty line. Clark C. Evans [mailto:cc...@cl...] wrote: > Oren, Does this work? If so, I'm game. It works. It is even streaming (no need for unbound lookahead buffers in the parser). I'll go with it... People will have to get used to YAML being more line-oriented in its comments (setting up comments "on the right side" just doesn't work). Well, since YAML is very line oriented anyway, I guess that's no big deal. I spent this weekend working on my parser generator (coming along...) instead of mucking with the spec. If we are all agreed on the above I'll create a version with the above in it. > | > +1. I just feel better knowing a truly streaming parser > | > is possible > | > (even if it is more complicated). It is OK for a parser to have > | > (reasonable) implementation restrictions. At some level, all > | > computers re finite state machines pretending to be > | > Turing machines > | > anyway. Implementation restrictions are inevitable. > > Oren, you are a bit vague here. Yeah :-) > Here is what I think Brian > is proposing (please correct me if I'm wrong Brian): > > The specification requires you to have random access to the > entire scalar value (so that you can detect the least indented > line). In this case, streaming YAML parsers which do not do > this (infinite buffer size) are not compliant; however, such > parsers could provide a user-defined maximum scalar buffer size. I don't like it. What I meant is that while it is _possible_ to write a fully streaming YAML parser (no infinite lookahead), it may be _inconvenient_ to write it that way; in which case it is OK to write a parser using some bound lookahead as long as it is "reasonable". The above indentation rule is on the edge of streaming; streaming works fine for valid input, but if there's an error, a streaming parser is limited to reporting some constant number of offending indentation line instead of all of them (it is guaranteed to detect all of them, it just can't report were they were). This limitation is probably acceptable to anyone who is streaming large amounts of YAML and insists on efficiency. Actually, it is probably acceptable for everyone (who wants more than, say, 10 offending line numbers for such a case? "Too many errors" is perfectly acceptable). > IMHO, we have worked very hard up to this point to reduce > lookahead to a very minimal amount. I would very much like > to keep it this way. While Brian's suggestion may be good, > I'd rather not change direction at this point in time. +1. Luckily, the above rule is compatible with streaming parsers (just barely :-). So we don't have to argue about this <grin>. > | poem: | > | YAML by Brian Ingerson > | The once was a language called YAML, > | That appealed to both reptile and mammal, > | Though neither would agree, > | On how scalars should be, > | So they fight on; the snake and the camel. > > In particular, the above example is currently malformed YAML > and I'd rather keep it that way. +1. > | Let me give you a very simple example of why this is necessary: > | > | 1) A scalar with an embedded null can only be represented with > | double quotes. > | 2) If a parser reports scalars in 2k chunks... > | 3) And a scalar has a null character several megabytes into > | the string > | 4) The receiver would have to puke if it had chosen the wrong > | emission style. > | 5) So the safe thing is for the receiver to emit using the > | original style. > > I think you are mixing what the emitter and parser requirements. > Certainly if the entire scalar is not provided to the emitter > or if the style is not given, then the emitter must use the > double quoted style, "just in case". However, I don't think this is > related to the parser issue above. +1. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-08-09 19:56:12
|
On Sat, Aug 09, 2003 at 09:00:00PM +0200, Oren Ben-Kiki wrote: | > Brian Ingerson [mailto:in...@tt...] wrote: | > | To me the following cases are all obvious. Here's my rule: | > | | > | In the abscence of an explicit indentation indicator, all lines | > | following the line with the pipe '|', are *content* until a | > | line is reached that is the same or less indented than the line | > | with the pipe. Dead simple. | | And for implicit indent: | - The first non-empty line sets the indent level; | - It is an error for an all-spaces line preceding this first non-empty | line to be longer than this indent level; | - It is an error for a line following it to be more indented than the | line with the '|', but less indented than the first non empty line. Ok. | It works. It is even streaming (no need for unbound lookahead buffers in | the parser). I'll go with it... People will have to get used to YAML | being more line-oriented in its comments (setting up comments "on the | right side" just doesn't work). Well, since YAML is very line oriented | anyway, I guess that's no big deal. YAML is already quite "line oriented", of course this doesn't affect comments which are at the end of a plain scalar # a comment this: is content # another comment and: | # yet another comment # content | I spent this weekend working on my parser generator (coming along...) | instead of mucking with the spec. If we are all agreed on the above I'll | create a version with the above in it. Well, it makes sense to me. Let us hear back from Brian. _why, do you have an opinion on this one? Best, Clark |
From: Brian I. <in...@tt...> - 2003-08-09 21:10:16
|
On 09/08/03 19:58 +0000, Clark C. Evans wrote: > On Sat, Aug 09, 2003 at 09:00:00PM +0200, Oren Ben-Kiki wrote: > | > Brian Ingerson [mailto:in...@tt...] wrote: > | > | To me the following cases are all obvious. Here's my rule: > | > | > | > | In the abscence of an explicit indentation indicator, all lines > | > | following the line with the pipe '|', are *content* until a > | > | line is reached that is the same or less indented than the line > | > | with the pipe. Dead simple. > | > | And for implicit indent: > | - The first non-empty line sets the indent level; > | - It is an error for an all-spaces line preceding this first non-empty > | line to be longer than this indent level; Not sure about this one. You think this is an error: foo: | ...... ..this.and ..that While I think it should mean: " \nthis and\nthat\n" Any problem with that? Oh, I get it. Lookahead. Right. > | - It is an error for a line following it to be more indented than the > | line with the '|', but less indented than the first non empty line. > > Ok. > > | It works. It is even streaming (no need for unbound lookahead buffers in > | the parser). I'll go with it... People will have to get used to YAML > | being more line-oriented in its comments (setting up comments "on the > | right side" just doesn't work). Well, since YAML is very line oriented > | anyway, I guess that's no big deal. > > YAML is already quite "line oriented", of course this doesn't > affect comments which are at the end of a plain scalar > > # a comment > this: is content # another comment > and: | # yet another comment > # content > > | I spent this weekend working on my parser generator (coming along...) > | instead of mucking with the spec. If we are all agreed on the above I'll > | create a version with the above in it. > > Well, it makes sense to me. Let us hear back from Brian. _why, > do you have an opinion on this one? I'll go with it. I didn't expect to get too far with my proposal for scanning entire scalars, but I thought it needed to be said. I guess it helped us refine the literal scalar a bit better. Cheers, Brian |
From: Oren Ben-K. <or...@be...> - 2003-08-09 21:19:54
|
Brian Ingerson [mailto:in...@tt...] wrote: > I'll go with it. I didn't expect to get too far with my > proposal for scanning entire scalars, but I thought it needed > to be said. I guess it helped us refine the literal scalar a > bit better. Great! I'll spec it out next week, then. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-08-09 21:21:00
|
On Sat, Aug 09, 2003 at 02:09:55PM -0700, Brian Ingerson wrote: | Not sure about this one. You think this is an error: | | foo: | | ...... | ..this.and | ..that | | While I think it should mean: | | " \nthis and\nthat\n" | | Any problem with that? | Well, I'm not sure how you would generalize this: foo: | ...... . ..this.and ..that What would this one be? " \n\n\n this and\n that\n" | Oh, I get it. Lookahead. Right. Yes, Lookahead, but also some ambiguity. | I'll go with it. I didn't expect to get too far with my proposal for | scanning entire scalars, but I thought it needed to be said. I guess it | helped us refine the literal scalar a bit better. Your feedback for handling scalars starting with the comment indicator '#' was very helpful. The spec is quite icky in that respect... it is much better with this fix. Bings, Clark |
From: Clark C. E. <cc...@cl...> - 2003-08-08 22:05:09
|
On Thu, Aug 07, 2003 at 11:36:32PM -0700, Brian Ingerson wrote: | > key: | | > # content or an empty scalar with a comment | | I don't see this as ambiguous. It is definitely content. I suppose that it wouldn't be ambiguous if we didn't allow comments with only preceding whitespace. --- # this is therefore an error and: | # this is clearly content Yea? | > | > 2. Should we allow for the indentation to be set at the first | > | > non-empty line? This would probably be a good policy, and | > | > would not cause any "look-ahead" issues, as the number of | > | > blank lines could be stored in a counter. | | I say, screw the lookahead and scan the whole scalar for the least | indented line. That definitely does the most service to the user. I don't like this for a few reasons: 0. The above case, allowing blank lines to start the scalar, handles the 90% of the unusual use cases. In particular, the 10% could be errors, (see below). 1. It is possible that actual errors will not be caught. For example: --- key: | This scalar is indented four spaces and goes on and on for several lines, etc. ... # 100 lines later oops! ... The oops is actually a mistake by the user, and the YAML parser instead indents the whole scalar. 2. It requires random access to the entire scalar value, this really makes it infiesable to use YAML as a streaming processor in some (admittedly rare) cases. We are quite "streaming friendly" no point in giving this up, especially for a case that we already handle (just specify explicit indentation or, use only blank lines). For example, someone may want to use a base64 value, processed incrementally. | For most implementations, this is a no brainer. For the C implementations on | embedded devices, let them croak if the indentation can't be determined | within the static buffer size. It's just a limitation of that parser. Ick. I don't want YAML processing to be dependent upon which parser you have... no? I know that this is the current state of things, but I hope we can fix this. | Just my opinion. I know I'm probably outvoted here, but at least try to | convice me that your way is right. How's that? BTW, thanks for challenging the first part. Indeed, there is another option which I didn't consider... It is so great to have you and Oren poking holes in stuff. Best, Clark |
From: Brian I. <in...@tt...> - 2003-08-09 07:21:36
|
On 08/08/03 22:07 +0000, Clark C. Evans wrote: > On Thu, Aug 07, 2003 at 11:36:32PM -0700, Brian Ingerson wrote: > | > key: | > | > # content or an empty scalar with a comment > | > | I don't see this as ambiguous. It is definitely content. > > I suppose that it wouldn't be ambiguous if we didn't allow > comments with only preceding whitespace. > > --- > # this is therefore an error > and: | > # this is clearly content > > Yea? > > | > | > 2. Should we allow for the indentation to be set at the first > | > | > non-empty line? This would probably be a good policy, and > | > | > would not cause any "look-ahead" issues, as the number of > | > | > blank lines could be stored in a counter. > | > | I say, screw the lookahead and scan the whole scalar for the least > | indented line. That definitely does the most service to the user. > > I don't like this for a few reasons: > > 0. The above case, allowing blank lines to start the scalar, > handles the 90% of the unusual use cases. In particular, > the 10% could be errors, (see below). Disagree. poem: | YAML by Brian Ingerson The once was a language called YAML, That appealed to both reptile and mammal, Though neither would agree, On how scalars should be, So they fight on; the snake and the camel. To show but one use case. > > 1. It is possible that actual errors will not be caught. > For example: > > --- > key: | > This scalar is indented four spaces and goes on and on > for several lines, etc. > ... # 100 lines later > oops! > ... > The oops is actually a mistake by the user, and the YAML > parser instead indents the whole scalar. C'mon! So what. Mistake in, mistake out. That's how computers work. This is just FUD. > 2. It requires random access to the entire scalar value, this > really makes it infiesable to use YAML as a streaming processor > in some (admittedly rare) cases. We are quite "streaming > friendly" no point in giving this up, especially for a case that > we already handle (just specify explicit indentation or, use > only blank lines). For example, someone may want to use a > base64 value, processed incrementally. But base64 values can always be formatted without wacky indentation contraints. So that's not a valid use case. I'm just saying that you can still have this streaming parser/filter/emitter set up, but you'll have be able to determine the indentation in the first 2k or use an explicit indicator. I don't see what's so wrong about failing after 2k. In your method (like your example 1 above), you would just fail anyway. For the other 99% of applications, reading in a large scalar is just no big deal. Either way, we're making an optimization. Let's optimize in favor of DWIMity for the user. Cheers, Brian |
From: Clark C. E. <cc...@cl...> - 2003-08-09 18:49:56
|
On Sat, Aug 09, 2003 at 12:21:14AM -0700, Brian Ingerson wrote: | > 0. The above case, allowing blank lines to start the scalar, | > handles the 90% of the unusual use cases. In particular, | > the 10% could be errors, (see below). | | Disagree. Well, your example was quite contrived, albeit quite witty. ;) I *do* like your take on comments (unless holes can be poked into it). | > 1. It is possible that actual errors will not be caught. | | C'mon! So what. Mistake in, mistake out. That's how computers work. Perhaps. You are right, this isn't very motivating. ;) | > 2. It requires random access to the entire scalar value, this | | But base64 values can always be formatted without wacky indentation | contraints. So that's not a valid use case. Right, I was being flippant. True, base64 values should be done as a folded scalar. Speaking of which, some of the current decisions we make here would also apply to folded scalar. Ok. So perhaps all literal scalars should be handled as a single chunk then, requiring random access to the whole thing. I don't like it (more of a gut feeling) but, I'd like to here _why's implementation comments, and a clarification from Oren. | I'm just saying that you can still have this streaming | parser/filter/emitter set up, but you'll have be able to determine the | indentation in the first 2k or use an explicit indicator. I don't see | what's so wrong about failing after 2k. In your method (like your | example 1 above), you would just fail anyway. I'd rather not have YAML fail in one parser and then succeed in another. Best, Clark |
From: Brian I. <in...@tt...> - 2003-08-09 21:13:30
|
On 09/08/03 18:52 +0000, Clark C. Evans wrote: > On Sat, Aug 09, 2003 at 12:21:14AM -0700, Brian Ingerson wrote: > | > 0. The above case, allowing blank lines to start the scalar, > | > handles the 90% of the unusual use cases. In particular, > | > the 10% could be errors, (see below). > | > | Disagree. > > Well, your example was quite contrived, albeit quite witty. ;) > I *do* like your take on comments (unless holes can be poked > into it). > > | > 1. It is possible that actual errors will not be caught. > | > | C'mon! So what. Mistake in, mistake out. That's how computers work. > > Perhaps. You are right, this isn't very motivating. ;) > > | > 2. It requires random access to the entire scalar value, this > | > | But base64 values can always be formatted without wacky indentation > | contraints. So that's not a valid use case. > > Right, I was being flippant. True, base64 values should be done as a > folded scalar. Speaking of which, some of the current decisions we > make here would also apply to folded scalar. > > Ok. So perhaps all literal scalars should be handled as a single > chunk then, requiring random access to the whole thing. I don't > like it (more of a gut feeling) but, I'd like to here _why's > implementation comments, and a clarification from Oren. > > | I'm just saying that you can still have this streaming > | parser/filter/emitter set up, but you'll have be able to determine the > | indentation in the first 2k or use an explicit indicator. I don't see > | what's so wrong about failing after 2k. In your method (like your > | example 1 above), you would just fail anyway. > > I'd rather not have YAML fail in one parser and then succeed in another. Maybe we need one of our good old irc summits. I'll try to hang out on #yaml a bit more. At least this week. But for now I'm willing to forgo this random access request. The explicit indicator solves everything. I was just trying to avoid it. Cheers, Brian |