From: Clark C . E. <cc...@cl...> - 2002-06-17 21:25:38
|
Summary: > Oren, Brian and I met for a short while on IRC today and had a brief chat over the "simple scalar". Let me first give a bit of background explanation as to the issues grappled with. Background: - > First, this scalar style satisfies the need for simple string values, such as enumerations, etc. It is not folded, escaped, quoted. I'd like to note that originally this scalar kind was called "folded" and could span multiple lines. Later on we divided the scalar type into a one-line variety and a multi-line folded variety to reduce many complications. - > Second, with the introduction of a type system in YAML, we wanted a way to express common types in an implicit manner that didn't hurt readability. Integers, floating point values, and date/time being the largest use cases. We generalized on the core use cases above to many use cases by using a regular expression mechanism. Basically, the content of this scalar style is comparied to each regular expression in the type library; if the regular expression matches, then it is typed accordingly. If none of the expressions match; then it is an implicit type error. - > Third, there are a few complications. The scalar style can be used as a key, and as such cannot contain ": " or it would signify the end of the key. Also, when used as a value, it cannot use the ": " due to the "- key: value" short-hand trick we have. Further, it can also be used in the context of inline mappings and values. In this case a value cannot contain '[', ']', '{', '}', ': ' and ', '. - > About two weeks ago, we loosened up things in the spec to allow in-line styles to span multiple lines. This is a change that we have yet to fully grapple with as it was a complicating factor that we set aside along time ago in the interest of complexity; but now, after getting quite a bit of user feedback now feel is needed for general acceptance. - > Furthermore, about this time frame (or a bit thereafter) we tightened up the regular expression for "text" matching to include only word characters insead of matching any string starting with an alpha char. This is what prompted the current meeting. This restriction was passed on a short voice vote without enough consideration and we are very happy that the YAML community has questioned this decision. Goals: - > We'd like to maintain an implicit type system so that it is easy to express and read integers, dates, null, and floating point values. - > Ideally, we'd like it open to allow other sorts of implicit types to be added in the future, for instance, a URL or a YPATH expression. Some of us feel that we could fix the implicit types at the above set and make all other types explicit. - > We'd like to keep the simple scalar simple for common one-line strings. This is especially important for enumerated values, etc. - > A few of us would like to use the simple scalar again for multi-line folded-like strings (although there is not universal agreement here) especially the usage for human-written paragraphs: example: Some would like, very much so, to be able to use this style for documentation (and other stuff...) that spans multiple lines in the original folded manner -- with full, access to punctuation! - > We'd like to keep it intuitive and easy to implement; although implmentation can be harder if required to make it easier to use. Note: > A few people are getting confused with what the simple scalar production allows and what the text regular expression is. In general, the scalar production applies to all simple scalars, having implied types (including text) or otherwise. While the text regular expression is applied to content allowable by the production to determine if it is a "text" according to the implicit rules. Note that restrictions to the regular expression do not apply if you simply use !text before the content... Implicit Typing Options: - > The first option, current in the spec, is that the implicit expression only allows word characters and the space. This is deemed problematic since people want to use punctuation. Not that this keeps many implict types (such as a URI) open for definition in the future. - > The second option is to further restrict the expression so that a space is not allowed. This solves the problem above by making it impossible to use the simple scalar for paragraphs without an explicit !text marker. - > The third option is to separate the implicit type detection from the simple scalar altogether and make another indicator out of it. The stand-alone ! was proposed for this purpose. Another proposal would use parenthesis so that implicit typed things are always included in parenthess, such as (mailto:cc...@cl...) for example, would be implicitly typed as a URI. In this extreme, lacking the indicator would mean that the type is text, so null values, dates, integers, etc. would all require an extra indicator, either parenthesis or ! or something else.. - > The fourth option is to roll-back a bit. In this option we say that all simple scalars starting with an alpha character are implicitly typed as text. This means that mailto:cc...@cl... cannot be implicitly typed as a URI, and that '32 Walker Drive' will not be implicitly typed as a string. This option would allow for most paragraphs, but provide for the ability to introduce new implicit types in the future as long as they don't start with an alphabetic character. By convention we could use parenthesis for implicit types which would otherwise start with an alphabetic character, such as a URI. - > The fifth option is to fix the implicit types; at least for this version. In this option anything not covered by an existing regular expression is a string. This option makes version numbers very important as the type of a node can depend upon it and getting the version number wrong can cause information model problems that do not show up as parse errors. - > The sixth option is to eliminate implicit types altogether, and make everything a string. Production Options: - > Have one simple scalar production which is used as both key/value and both nested within an in-line map/scalar and otherwise. This option prevents ", " from being used in top level scalars. - > Have two simple scalar productions. One for non-nested cases where ", " is useable; and another for the nested case where ", " isn't useable. Note that ": " is not useable in any circumstance. -> Go with 3/4 productions that reflect both necessary and sufficient restrictions but with added complexity. Choices: - > Thus far, we've chosen to go with two simple scalar productions. We choose this since we'd like to be able to write paragraphs using the simple form and since four productions are more complicated without any clear value. -> Thus far, we've chosen to go with the fourth implict typing option. There are several factors, first, this is how we had it and it works fairly well and current data/parsers use this setup currently. The other options didn't give us the ability to introduce other implicit types, such as currency, or were too restrictive on the text expression. In some ways it is reassuring to go back to a previous choice. Thoughts: > After the meeting, I had a few questions/reservations about how simple scalars would be used in a multi-line (not in-line) key. I'm wondering if the third production option wouldn't be more prudent from a readability perspective. In this case, making in-line keys restricted to a single line instead of allowing them to be multi-line. Remember, we have the ? : syntax for this sort of thing... this would probably enhance readability, otherwise I can see the whole thing getting ugly. Closing: > I hope that this accurately reflects the ideas discussed and informs the YAML user community about the factors involved and why the current solution emerged. Oren won't implement the spec changes till this next weekend... so you have till then to make a compelling case for your favorite option. |
From: Clark C . E. <cc...@cl...> - 2002-06-17 22:33:13
|
In the previous e-mail I've attempted to summarize the concensus of the group (and I may not have done it right) in this response I'd like to insert more of my personal convictions. | A few of us would like to use the simple scalar again for multi-line | folded-like strings (although there is not universal agreement here) | especially the usage for human-written paragraphs: | | example: Some would like, very much so, to be able to use | this style for documentation (and other stuff...) | that spans multiple lines in the original folded | manner -- with full, access to punctuation! See my previous post on this subject. I think that this is a bad idea, as good as it seems to look. It is making our implicit decisions harder and begs the question about what we do about multi-line keys. I'm now of the opinion that simple scalars should be primarly used for enumeration values, and for other cases, the quoted forms do a great job. | - > | The first option, current in the spec, is that the implicit expression | only allows word characters and the space. This is deemed problematic | since people want to use punctuation. Not that this keeps many implict | types (such as a URI) open for definition in the future. I agree that this kinda sucks. It is especially bad since allowing spaces makes people want to use other characters. | - > | The second option is to further restrict the expression so that a | space is not allowed. This solves the problem above by making it | impossible to use the simple scalar for paragraphs without an | explicit !text marker. This is my primary choice; Neil's (freudian?) suggestion. I think it is the cleanest and easiest to explain; it also has precident as CSV files require quotes if spaces are used within a value. It also gives us the most flexibility for implicit types. As Brian says, strings are king. However, Strings have single quoted, double quoted, folded, and block styles. I have no problem limiting the implicit detection of the string data type to simple enumeration values. The chief downside to this is that there are existing YAML documents which must be converted. I say better now... be restrictive. Also, this option is fixable later. | - > | The third option is to separate the implicit type detection from | the simple scalar altogether and make another indicator out of it. | The stand-alone ! was proposed for this purpose. Another proposal | would use parenthesis so that implicit typed things are always | included in parenthess, such as (mailto:cc...@cl...) for | example, would be implicitly typed as a URI. In this extreme, | lacking the indicator would mean that the type is text, so | null values, dates, integers, etc. would all require an extra | indicator, either parenthesis or ! or something else.. I think that this kinda defeats the whole purpose of implicit types. I think that the bang by itself is unclear or confusing. If we weaken implicit typing this much, we should just drop the idea and require explicit typing everwhere. | - > | The fourth option is to roll-back a bit. In this option we say | that all simple scalars starting with an alpha character are | implicitly typed as text. This means that mailto:cc...@cl... | cannot be implicitly typed as a URI, and that '32 Walker Drive' | will not be implicitly typed as a string. This option would allow | for most paragraphs, but provide for the ability to introduce new | implicit types in the future as long as they don't start with | an alphabetic character. By convention we could use parenthesis | for implicit types which would otherwise start with an alphabetic | character, such as a URI. This is the compromise that I agreed to. It gives us flexibility in the future while not requiring a data conversion today. Basically, it is the status-quo about 2-3 weeks ago. | - > | The fifth option is to fix the implicit types; at least for this | version. In this option anything not covered by an existing | regular expression is a string. This option makes version numbers | very important as the type of a node can depend upon it and | getting the version number wrong can cause information model | problems that do not show up as parse errors. This is quite dangerous option; see Neil's post. | - > | The sixth option is to eliminate implicit types altogether, | and make everything a string. Or explicitly typed; this is better than #3, IMHO. | Production Options: | - > | Have one simple scalar production which is used as | both key/value and both nested within an in-line map/scalar | and otherwise. This option prevents ", " from being used | in top level scalars. | - > | Have two simple scalar productions. One for non-nested | cases where ", " is useable; and another for the nested | case where ", " isn't useable. Note that ": " is not | useable in any circumstance. | -> | Go with 3/4 productions that reflect both necessary and | sufficient restrictions but with added complexity. I agreed to #2 in the meeting before I thought about how multi-line simple and quoted scalars are going to complicate mapping key readability. This brought back all kinds of memories as to why we didn't go down this road some 9-12 months ago. I'll still go with #2, however, I'd like to withdraw multi-line simple and quoted scalars. To summarize my take: I prefer Neil's proposal (accident that it may be) as it is the cleanest and most easy to understand, it also gives us the most options in the future. I'm willing to compromise with the status-quo a few weeks past (items starting with alpha are text) I'd also strongly like to withdraw multi-line simple and quoted scalars. This I feel is a mistake. I'm even willing to withdraw in-line collections that span multiple lines, although I think that this is tossing the baby out with the bathwater. I really am thrilled by the level of participation in the user communtiy on these issues! Best, Clark |
From: Brian I. <in...@tt...> - 2002-06-17 23:27:07
|
On 17/06/02 18:42 -0400, Clark C . Evans wrote: > In the previous e-mail I've attempted to summarize the > concensus of the group (and I may not have done it right) > in this response I'd like to insert more of my personal > convictions. > > | A few of us would like to use the simple scalar again for multi-line > | folded-like strings (although there is not universal agreement here) > | especially the usage for human-written paragraphs: > | > | example: Some would like, very much so, to be able to use > | this style for documentation (and other stuff...) > | that spans multiple lines in the original folded > | manner -- with full, access to punctuation! > > See my previous post on this subject. I think that this > is a bad idea, as good as it seems to look. It is making > our implicit decisions harder and begs the question about > what we do about multi-line keys. > > I'm now of the opinion that simple scalars should be primarly > used for enumeration values, and for other cases, the quoted > forms do a great job. I don't see much of a problem Clark. Remember the simple principle that when dealing with inlines, you first simply (conceptually) join all of the lines within the current indentation level. After that you simply parse as if it were all one big line. The inline spanning multiple lines is little more than a line continuation feature. Please show me some examples where this breaks down. Also remember that you can't use nesting indicators (? | >) within inlines. So I think that you really do want to support scalars spanning lines. Cheers, Brian |
From: Clark C . E. <cc...@cl...> - 2002-06-18 00:53:30
|
| I don't see much of a problem Clark. Remember the simple principle that when | dealing with inlines, you first simply (conceptually) join all of the lines | within the current indentation level. After that you simply parse as if it | were all one big line. I'm more taken back by the added complexity. It just allows for just a few too many ways to do it... if you have a multi-line string, just use folded or block, no? I also remember that we had some issues, especially using keys, but I can't recall specifics at the moment. Overall, are you sure the added complexity is worth it? --- Hi Ho, Kermit The Frog here with yet another Muppet News Flash: Today in the news the Cookie Monster has given up cookies! Let's go right to the source and ask the monster himself: Cookie Monster, how is life without Cookies? "Ohh! Cookies! Yum Yum : What would I do without Cookies!" | The inline spanning multiple lines is little more than a line continuation | feature. Please show me some examples where this breaks down. I'm less concerned with it breaking down than with it adding a bunch of unneeded complexity. | Also remember that you can't use nesting indicators (? | >) within inlines. | So I think that you really do want to support scalars spanning lines. Why? If your in-lines are that big, you shoudn't be using inlines. It's pretty simple. By not allowing for multi-line scalars we enforce good practice; making YAML easy to read. Otherwise we just make 1000's of ways people can do it. I was initially proposing allowing in-line collections to span multiple lines to _enhance_ readability; allowing for quoted or simple scalars to span multiple lines _reduces_ readability. Clark |
From: Brian I. <in...@tt...> - 2002-06-18 01:52:07
|
On 17/06/02 21:02 -0400, Clark C . Evans wrote: > | I don't see much of a problem Clark. Remember the simple principle that when > | dealing with inlines, you first simply (conceptually) join all of the lines > | within the current indentation level. After that you simply parse as if it > | were all one big line. > > I'm more taken back by the added complexity. It just allows > for just a few too many ways to do it... if you have a multi-line > string, just use folded or block, no? I also remember that we > had some issues, especially using keys, but I can't recall > specifics at the moment. Overall, are you sure the added > complexity is worth it? > > --- > Hi Ho, Kermit The Frog here with yet another > Muppet News Flash: Today in the news the Cookie > Monster has given up cookies! Let's go right > to the source and ask the monster himself: Cookie > Monster, how is life without Cookies? "Ohh! > Cookies! Yum Yum : What would I do without Cookies!" Well, I don't think this parses under the day's proposed rules, but it brings up a showstopper: --- foo: bar This is a simple key but I don't know where it stops: I guess this is the value but I'm in violation of continuation rules because my indentation can't get less now. and this is my third key, but how would the computer know that?: baz ... > | The inline spanning multiple lines is little more than a line continuation > | feature. Please show me some examples where this breaks down. > > I'm less concerned with it breaking down than with it > adding a bunch of unneeded complexity. > > | Also remember that you can't use nesting indicators (? | >) within inlines. > | So I think that you really do want to support scalars spanning lines. > > Why? If your in-lines are that big, you shoudn't be using > inlines. It's pretty simple. By not allowing for multi-line > scalars we enforce good practice; making YAML easy to read. > Otherwise we just make 1000's of ways people can do it. > > I was initially proposing allowing in-line collections to > span multiple lines to _enhance_ readability; allowing > for quoted or simple scalars to span multiple lines > _reduces_ readability. OK. Back to the drawing board. How about this as a compromise: --- - > In inline collections we only split lines after a separating comma. (In the spirit of enhancing readability.) Unquoted strings are limited to Neil's single word. Single and double quoted strings may not (by definition) span lines. This gives Clark his intent. # note: by using Brian's (implicit) notation, we can handle # implicits that contain spaces in the above collections. - > Key strings may not span multiple lines, without using the '?' nesting indicator. Single line variations must be quoted if they contain ': '. (Data::Denter had this restriction) - > Value strings will be pretty much unrestricted. Like today's proposals. (just start with alpha) We don't have to worry about ': ' or ', ' in unquoted. Just '# '. Steve and Ryan get their DWIM use case. - > The only remaining issue is escaping multiline keys. For this we have the double quote and the \n to keep it on one line. Rarely pretty, but pretty rare. This respects Oren's wish to eschew the '||' '>>' notation. - > There. Did I forget anybody? :) Cheers, Brian |
From: Brian I. <in...@tt...> - 2002-06-17 22:51:14
|
On 17/06/02 17:34 -0400, Clark C . Evans wrote: Thanks Clark for the whole scoop. I'm going to give a short form of it for the impatient :) Here's the way that we are currently leaning: 1) Keep unadorned implicit types for str, int, float, date, time, timestamp and null. 2) Unquoted strings must begin with alpha or underscore. After that they can contain anything except ': '. If they are part of an inline collection then they can't contain ', ' either. That opens up the door for most phrases, sentences and paragraphs. 3) Proposed parentheses for forcing implicit, rather than '! '. --- - (true) # !bool - (http://www.yaml.org) # !uri - (Mon Jun 17 15:33:30 PDT 2002) # !timestamp 4) Considering replacing boolean .true with (true) 5) Passing on proposed '>>' and '||' indicators for now. Cheers, Brian |
From: Brian I. <in...@tt...> - 2002-06-17 23:10:55
|
On 17/06/02 15:51 -0700, Brian Ingerson wrote: > On 17/06/02 17:34 -0400, Clark C . Evans wrote: > Thanks Clark for the whole scoop. I'm going to give a short form of it for > the impatient :) Since Clark has responded to his own posting, so will I :P > > Here's the way that we are currently leaning: > > 1) Keep unadorned implicit types for str, int, float, date, time, > timestamp and null. Clark really loves implicit dates/times. This is because they are native in SQL, and Python will soon have core support for dates. Clark thinks they rank higher than ints and floats. From my perspective, dates and times as types are basically unneeded in Perl. Actually, so are ints and floats. Perl supports them, but casts them at will, on the fly, so you never really know what's a string and a number. > 2) Unquoted strings must begin with alpha or underscore. After that they > can contain anything except ': '. If they are part of an inline > collection then they can't contain ', ' either. That opens up the > door for most phrases, sentences and paragraphs. I think this is a good start, but I'd really like to have strings be able to start with numbers too. There are two ways to do this. --- one: - 42 - 3.14 # Not a int or float, so it's a string - 123 Main St # all dates and times are forced - (2001-09-11) two: # Even ints and floats would need to be forced. Otherwise, STRING! - (42) - (3.14) - 123 Main St - (2001-09-11) BTW, just realized that '# ' would not be allowed in unquoted strings. > 3) Proposed parentheses for forcing implicit, rather than '! '. > > --- > - (true) # !bool > - (http://www.yaml.org) # !uri > - (Mon Jun 17 15:33:30 PDT 2002) # !timestamp I'd like to add the future numeric implicits as well: --- - (555-867-5309) - (192.168.1.10) - (19.99 USD) > 4) Considering replacing boolean .true with (true) > > 5) Passing on proposed '>>' and '||' indicators for now. Ok for now. Cheers, Brian |
From: Clark C . E. <cc...@cl...> - 2002-06-17 23:47:05
|
| > | > 1) Keep unadorned implicit types for str, int, float, date, time, | > timestamp and null. | | Clark really loves implicit dates/times. This is because they are native | in SQL, and Python will soon have core support for dates. Clark thinks | they rank higher than ints and floats. | | From my perspective, dates and times as types are basically unneeded in Perl. | Actually, so are ints and floats. Perl supports them, but casts them at will, | on the fly, so you never really know what's a string and a number. Yes; in general, Perl is not very strictly typed. It tends to convert between types nicely. This isn't true for many other languages. And implicit types make things readable! | | > 2) Unquoted strings must begin with alpha or underscore. After that they | > can contain anything except ': '. If they are part of an inline | > collection then they can't contain ', ' either. That opens up the | > door for most phrases, sentences and paragraphs. | | I think this is a good start, but I'd really like to have strings be able to | start with numbers too. There are two ways to do this. Ok. Here is a straw-man. Divide characters into 4 classes, GOOD, BAD, NUMBER, ALPHA. Let GOOD be "'`!&().,;/_- and BAD be |\~@#$%^*+={}[]:<>$ A simple scalar is a TEXT implicit when it does not contain BAD and either starts with an ALPHA or it starts with a NUMBER and contains an ALPHA. So, the parser would first scan for the existance of any of the "illegal" characters. If found, then it would run the implicit rules. --- - True # String: starts with alpha - 42 # Integer; does not contain ALPHA - 42 Walker Drive # String; starts numeric but contains ALPHA - 2.34 # Floating; does not contain ALPHA - 2002-02-01 # Date; does not contain ALPHA - Dasy's Diner # String - Windon-Walker # String - http://www.yaml.org # Undefined (URI) contains BAD(:) - 19.99$USD # Undefined (Currency) contains BAD($). - 34e+23 # Floating contains BAD(+) - 128.2.3.34 # Undefined (Tuple) doesn't contain ALPHA Thoughts? This gives 95% of what the TEXT needs, but provides lots of room for implicits. Clark |
From: Brian I. <in...@tt...> - 2002-06-18 14:01:49
|
On 17/06/02 19:55 -0400, Clark C . Evans wrote: > | > 2) Unquoted strings must begin with alpha or underscore. After that they > | > can contain anything except ': '. If they are part of an inline > | > collection then they can't contain ', ' either. That opens up the > | > door for most phrases, sentences and paragraphs. > | > | I think this is a good start, but I'd really like to have strings be able to > | start with numbers too. There are two ways to do this. > > Ok. Here is a straw-man. Divide characters into 4 classes, > GOOD, BAD, NUMBER, ALPHA. Let GOOD be "'`!&().,;/_- and BAD > be |\~@#$%^*+={}[]:<>$ > > A simple scalar is a TEXT implicit when it does not contain BAD > and either starts with an ALPHA or it starts with a NUMBER > and contains an ALPHA. So, the parser would first scan for the > existance of any of the "illegal" characters. If found, then it > would run the implicit rules. > > Thoughts? This gives 95% of what the TEXT needs, but provides > lots of room for implicits. There is a much simpler and clearer and 100% guaranteed forward compatible proposal on the table. --- - If it looks like an int, it's an int - If it looks like a float, it's a float. - If it begins with a word-char [_A-Za-z0-9], it's a string. - If it is quoted, it's a string. - If it is parenthesized, it is matched for a type. - If it is a ~, it is a null. - It is a parse error. Apply these rules in order and you've got it nailed. And it leaves all of our special characters available for the future. Parens give the general user clarity to what is not a string, without needing to be up to date on the latest YAML implicit types. Cheers, Brian |