From: Clark C . E. <cc...@cl...> - 2001-07-16 01:45:42
|
Hello Brian. Summary... 1. This is good news, nice to see progress. 2. I'll be back on the C impl by the 22nd, the interfaces are solid, coding won't be too tough. 3. See the new whitespace folding proposal below. In this proposal, new lines and leading/trailing whitespace is not significant. 4. I'd like to add binary values and leaves back into the core model since this problem is solved (see earlier posts) Details... | While we wait for your C engine... Very cool. | I really want to start playing ith this stuff *now*. Absolutely... | - Interface consists of 2 functions: serialize() and deserialize(). | - serialize() takes a list of hashes and returns a string | - deserialize() takes a string and returns a list of hashes Nice. A bulk of the "C" interface is really working on the parser/printer interface. Comments on the interface by implementers in other languages would be very cool. | - Support Hashes, Lists, Scalars, Objects and Undef(Null). | - BTW, why isn't "Null" in the data model currently? It should be added. It is in the "C" API. Also, due to the re-formulation of the "C" API and the discovery that binary values can be round-tripped reliably, keys should be allowed to be binary as well! | - Can you explain why folding is needed again. Nobody I'm | working with seems to get it. 0. See a compromise proposal below. 1. Folding is simple. Consecutive whitespace in the YAML text are condenced into a single space 0x20. Other whitespace must be escaped, just like $@# and other significant characters must be escaped. 2. For most of the business use cases, whitespace formatting is not significant. An extra space or two or a new line or a tab character are not important. If the distinctions *are* important, then they should be modeled, for example, like the HTML <P>aragraph marker. In my opinion, HTML's success was in no small part due to it's whitespace folding properties. 3. For readability, it is often important to keep the YAML file word-wrapped by column 76. If whitespace is significant then this becomes problematic or hard-to-define (undersand). This is important when re-structing texts (cut/paste). 4. I strongly object to "layering" this. If it is layered, then there can be different interpretations, etc. Confusing, YUCK. Ask any XML guru how xml:space works. You will be greeted with a sigh, or more likely blank stare. Further, YAML is different from XML in that whitespace has markup significance, it is helpful if it does not have application significance. | Why would you ever want "lossy" data serialization? It is not lossy. Not by any strech of the imagination. Lossy means what you put in you don't get back out. Not true. If you use the API to send "A B", then the extra space will have to be escaped. This isn't lossy, it is a simple rule. Much like :@#$ need to be escaped. | Maybe folding doesn't belong in the data model. | it could just be a standard YAML tool, like MIME | encoding. It would certainly simplify our syntax | considerably. I think I'd like to build in the BASE64 encoding into the core as well, using the [BASE64] mechanism per the proposal before coloring. Given the ability to strongly support this at the API level, I'm in favor. How the base64 encoding affects denter: (a) if you have a unicode string, then the input is encoded as unicode, (b) if you have a regular, 8-bit clean string, then... 1. If the binary value is valid UTF-8, (fitting the char production) then the value can be saved as unicode. 2. Otherwise, the value is Base64 encoded. When loading, the non-base64 strings can be read into the unicode scalar value. Otherwise, they should be loaded into the 8-bit clean string. Regardless, from what I understand about the perl binding, the unicode string is in UTF-8, so this can be treated as a binary if required without a data courruption. ... On the other platforms, there will be a function asBinary() that when applied to a unicode string will re-cast the string as a UTF-8 encoding so that the binary value isn't courrupted. It will work just fine... and I'd like to have binary values built-into the core. | - Changed the syntax for block mode. See below. Very cool. | - Don't support references yet. | - Just to make it simple for everyone to implement. | | - Detect and terminate on circular (but not duplicate) refs. | - Duplicate refs are merely serialized multiple times right now. Ok. | - Support "#classname" syntax for marking data as objects. | | - Use doublequote to remove ambiguity from certain "special" single | line values. Ok. | - Doublequotes do not fold whitespace in this impl. Ok. Let us compromise here (one I've been thinking about for a while now). Within a double quoted string value, leading and trailing whitespace is folded into a single space unless there is a trailing \ one: "This is a multi line scalar. Where these two spaces are preserved, but new lines are not preserved." two: "If you want a new line.\n It must be explicity given. Further if the line ends with a slash, it is contin\ ued without the intermediate space." Good compromise? The primary thing that this mechanism gives you the ability to re-format as... one: "This is a multi line scalar. \ Where these two spaces are preserved, but new lines are not preserved." two: "If you want a new line,\n It must be explicitly given. Further if the line ends with a slash, it is continued without the intermediate space." | foo : % | bar : @ | <<4 | The "4" above is the number of lines in the block. | It is basically an emitter comment, | not part of the | data model. | >> Ok. In this case is there a significant carriage return before The? And, as we talked, this block does not have any lines with leading whitespace? | <<2- | The minus sign above indicates that | Line #2 doesn't have trailing newline | >> This line does have leading whitespace? | baz : <<B5 | This text contains lines that | might otherwise confuse the parser | >> | A>> | See what I mean? | B>> Ok. But if you do this, I'd make it consistent and have it end with B5>>, this would mean changing the above examples. | <<2 : something | This syntax could easily be used | for multi-line keys | >> | <<2 : <<2 | a | key | >> | and a | value | >> Syntax error detected at the "and a", right? | Clark : <<3 | You should note that | this syntax does not suffer from the indentation | ambiguity that you worried about on the phone :) | >> If we used the proposed whitespace handling rule above then this is equivalent to... Clark : "You should note that\n this syntax does not suffer from the indentation\n ambiguity that you worried about on the phone ;)\n" Is the compomise... ok? At least it allows me to re-format the block so that it can be properly word-wrapped again if the map entry above has to be pushed in another 20 columns, so that a 76 column margin can be maintained (with exceptions of course). | Oren: <<3 | One huge reason for using this syntax | Is that it makes it easy to quote a YAML document | as a single string in another YAML doc | >> 1. I think this is nice, although I think what ever characters immediately follow the << should be used to match the EOS marker... <<EOS ..... EOS>> or something like that. 2. How do you handle a signifiant carriage return at the begin of the block? 3. I very much like whitespace folding rules not being in effect for this mechansim. However, it is important for there to be a mechansim where whitespace folding isn't a problem. So glad to see YAML moving forward by other parties! Best, Clark ----- End forwarded message ----- |
From: Clark C . E. <cc...@cl...> - 2001-07-16 02:52:05
|
(advance apologies for the cc list... are you all on the mailing list?) Everyone, Thank you for your offer to implement YAML. Very cool. The C impl will be forthcoming. It may seem overly complex if you look at it now, but it is primarly developed for a filter-architecture, where reading/writing are just the end-points. | Clark : <<3 | You should note that | this syntax does not suffer from the indentation | ambiguity that you worried about on the phone :) | >> My problem is the size of read-ahead before it can determine the column in which the data starts. To be consistent (readable), it should start at least one column past the start of the parent. And if you let this starting column vary, then you have to read the entire scalar in memory before you can determine the starting column. From what I gather from your examples, you are proposing that the starting column is the starting column of the parent? This does solve the problem ... but I think it comes a cost of readability? I had a thought. Instead of the number describing how long the block is (which may be subject to change if someone edits it...) why not have the number describe how many spaces the block is indented from it's parent? One: <<3 This block is indented three spaces past the start of the parent. It can go for as many lines as needed, and now the parser only needs a buffer as large as the first line to proceed! Two: <<1 This may look similar, but each line here as two additional leading spaces! What is cool about this proposal is that you don't need a terminator since the block terminates when the indention goes under the indicated column (parent start column + the number given). What do you think? No need for fancy <<B stuff. in effect, it is an invisible | line... Clark |
From: Clark C . E. <cc...@cl...> - 2001-07-16 20:23:20
|
On Mon, Jul 16, 2001 at 10:53:28AM -0700, Brian Ingerson wrote: | > One: <<3 | > This block is indented three spaces | > past the start of the parent. It | > can go for as many lines as needed, | > and now the parser only needs a buffer | > as large as the first line to proceed! | > Two: <<1 | > This may look similar, but each line | > here as two additional leading spaces! | > | > What is cool about this proposal is that you don't | > need a terminator since the block terminates when the | > indention goes under the indicated column (parent | > start column + the number given). | > | > What do you think? No need for fancy <<B stuff. | > in effect, it is an invisible | line... | | Oh. And how do you do lists? List: @ <<3 This does not have leading whitespace. <<1 This one has two leading whitespace. ;) Clark |
From: Neil K. <neilk@ActiveState.com> - 2001-07-16 22:03:04
|
Hi there. At Brian's request, I translated YAML.pm v0.11 to Python. yaml.py does basic serialization and deserialization, although it's still quite rough. BTW, I am not the other Neil, although we both work at ActiveState. You can use Neil W. / Neil K. to disambiguate. I have skimmed the yaml-core mailing list. It seems that people have different expectations of YAML, so let me state mine. I am not interested in a universal data serialization format, with enough features to be the superset of XML, RFC822, jar and shar. I think YAML is only interesting insofar that it is simple to use, readable, and expressive. Real world goals: I ought to be able to ask someone who knows basic word processing to make a new YAML document from scratch, just by showing them a similar YAML document. A raw printout of a YAML document ought to be usable as an invoice, or a shipping address and manifest. -- Neil Kandalgaonkar, ActiveState ASPN - ActiveState Programmer Network http://ASPN.ActiveState.com/ |
From: Clark C . E. <cc...@cl...> - 2001-07-17 04:53:16
|
| Hi there. At Brian's request, I translated YAML.pm v0.11 to | Python. yaml.py does basic serialization and deserialization, | although it's still quite rough. Neat. | I have skimmed the yaml-core mailing list. It seems that people | have different expectations of YAML, so let me state mine. | | I am not interested in a universal data serialization format, | with enough features to be the superset of XML, RFC822, | jar and shar. This is a good point. My domain is largely business documents (forms) which are usually constructed/saved to a relational database. So, this is my primary concern. This includes two sub-goals: (a) the file format should encourage forward-compatible processes to enable more dynamic change, and (b) it should be sutiable for messaging. My secondary concern is to have a filter architecture for processing these documents sequentially rather than requiring random access. I may put too much focus here, but in many of the problems that I've encountered, this is something that I feel is still lacking from the industry. In particular, business document filtering and routing. My third concern is a solid information model that blends nicely with various processing styles. In particular, it should have a low "impediance mismatch" with native programming constructs, and ideally with relational databases. The API should work equally well for random access as well as sequential access limitations. My forth concern is readability. Although I sincerely hope that the result is more readable then the other options. But I must admit... it isn't my primary concern. Although I do consider myself to rate aesthetics highly. ... That being said, I'm not looking for a "universal pattern" that will be the key to the universe. My goals are driven from experiences in the past, and attempts to make my life less painful when dealing with complex information processing requirements. | I think YAML is only interesting insofar that it is | simple to use, readable, and expressive. Definately on the same page here. There is an 90/10 rule in place here. And perhaps sometimes I'm in violation... especially when it comes to data typing. | Real world goals: I ought to be able to ask someone who knows | basic word processing to make a new YAML document from scratch, | just by showing them a similar YAML document. A raw printout of | a YAML document ought to be usable as an invoice, or a shipping | address and manifest. You'll be glad to hear that a friend of mine who is not a computer Junkie (does not know how to use Word all that well) was able to take a few simple business documents as examples and with a plain-old text editor produce other examples that were valid YAML. At first he didn't quite grock the difference between map and list... but after that concept was understood, he had no further problems. I had him try the same trick with XML and he mangled the job very badly... *evil grin* I'm not doing YAML in a vacuum. I have a real-world product that I'm developing and YAML is part of it. Right now I have domain specific parsers which read in the YAML for me. They only took 2 hours to write and only take in a limited subset... but they do work very well. And the software I use converses via YAML documents between the various processes. Best, Clark |
From: Jason D. <ja...@in...> - 2001-07-17 05:58:56
|
Hi, Neil. > | I am not interested in a universal data serialization format, > | with enough features to be the superset of XML, RFC822, > | jar and shar. I think we're on the same page. Do you think YAML as it stands today meets this requirements? > This is a good point. My domain is largely business > documents (forms) which are usually constructed/saved > to a relational database. So, this is my primary concern. > This includes two sub-goals: (a) the file format should > encourage forward-compatible processes to enable more > dynamic change, and (b) it should be sutiable for messaging. Clark, you don't think that YAML can already do this? I think that you did a fantastic job and have already met these goals! > My secondary concern is to have a filter architecture > for processing these documents sequentially rather than > requiring random access. I may put too much focus here, > but in many of the problems that I've encountered, this > is something that I feel is still lacking from the industry. > In particular, business document filtering and routing. This, to me, is outside of the domain of YAML as a markup language. I don't think that it's possible to come up with a processing model that will satisfy everyone. > My third concern is a solid information model that blends > nicely with various processing styles. In particular, it > should have a low "impediance mismatch" with native > programming constructs, and ideally with relational > databases. The API should work equally well for random > access as well as sequential access limitations. Again, I think that you've done a great job so far. It's not perfect but nothing can be. By keeping it as simple as possible, though, it will be easier to adapt YAML to new and unforseen applications. > | I think YAML is only interesting insofar that it is > | simple to use, readable, and expressive. > > Definately on the same page here. There is an 90/10 > rule in place here. And perhaps sometimes I'm in > violation... especially when it comes to data typing. I think that I'm in the minority here but I believe that YAML is currently already flexible enough to express just about any concept while still being simple enough for most people to grasp and work with. Every time I read a message about color or typing or formatting I get worried that YAML will get bogged down in complexity. Don't get me wrong--I think that the concept of color is great and should be used but I don't believe that it needs to be mandated by any spec. And I also don't believe that any more semantics need to be layered onto YAML other than the simple scalars, lists, and maps that it currently has. And while I'm at it, I no longer believe that the API should be part of the spec, either. Let's be realistic--one API will not fit everybody's needs. I'm not saying that I believe the spec to be perfect as it is but it's damn close. I just wish that I could get out of XML/RDF land for a little while to work more on YAML. Jason. |
From: Clark C . E. <cc...@cl...> - 2001-07-16 03:36:34
|
On Sun, Jul 15, 2001 at 02:34:43PM -0700, Brian Ingerson wrote: | While we wait for your C engine, I have gone ahead and | done a very lightweight Perl implementation of YAML.pm. Thank you for holding off... your feedback is very helpful and any comments on the "C" API would be very welcome. In particular the "C" API is designed for a filter architecture. Where loading/saving is just the end points. The serialize() and deserialize() functions should be fairly easy to layer on top, but it would be cool if you could look at the API to verify that it would layer efficiently and provide some feedback. By the 23rd I'm going full-time on implementing. A few more comments... | - Detect and terminate on circular (but not duplicate) refs. | - Duplicate refs are merely serialized multiple times right now. I'm actually thinking of outlawing circular refs, no? It makes a bunch of the traversal algorithems far more complicated since cycle detection would have to become a standard part of every stage in a processing pipeline. | - Support "#classname" syntax for marking data as objects. Two requests: a) Can you change # to !, as the pound sign usually means comment, not class. b) Can you limit the class mechanism only to maps and scalars? Why not lists? First, let's explain why it works for a map. Second, I'll explain why it works for a scalar. And then I'll explain how it fails for a list... It works for a map since one can always add a key value and not break a client. Thus, key: @ one: Value can (under most circumstances) be replaced with key: @ one: Value !: Class without breakage of clients. And, as we had suggested, we'd like the following to be a short hand for the above: key: !Class @ one: Value It works for a Scalar, in the YAML API due to the scalar value acquisition rules. If a client is expecting a scalar value, and they encounter a map, then the YAML parser interface will give them the scalar value of the item having the key "=". Thus, if one is expecting: key: Value Then, they can be given... key: @ =: Value !: Class Without noticing. Thus, one can "color" a scalar value without a problem. And as we talked, Oren and I would like the following to be a short-hand for the above construct. key: !Class Value Thus, in this case, any process expecting a scalar will completely ignore the class (as a color). This is our ideal behavior, and works with comments (#) as well. So... to answer the question, "Why not list?" The answer is that there is not an easy way to convert a list into a map or a map into a list.... that's all. Thus, coloring lists is not a possibility. And since we'd like the class attributation to be a color, we can't allow classes to be used on lists... Hope this makes sense. Best, Clark |
From: Clark C . E. <cc...@cl...> - 2001-07-16 20:50:53
|
On Mon, Jul 16, 2001 at 11:07:35AM -0700, Brian Ingerson wrote: | Don't you mean '%'? Yep, my bad. | You can always represent a list object using color though: | | key : % | = : @ | foo | bar | ! : Class Oh yes. Thanks for pointing out. I guess the limitation if more or less a problem with the current "C" API... I should list this as a bug and swat it. Brian, thank you for your consideration on the other block proposal. A particular difficulty with the indentation proposal put forth is that it does not work well with our starting production (a sequence of maps separated by a blank line). one: Value two: <<2 This is the value which doesn't have leading whitespace. three: Is this a new entry in the first map, or the next map in the sequence. I'm not sure, but I think we should probably fix the starting production (I never liked the extra space causing so much difference anyway) rather than changing the block indentation proposal. four: Also... if this was taken, the << looks too much like the <<EOS thingy, so perhaps we should find another marker... five: +3 This is a block scalar, indented three spaces using the plus sign rather than << so as to not imply an end marker like >> Thoughts? Calrk |
From: Clark C . E. <cc...@cl...> - 2001-07-16 08:31:27
|
On Mon, Jul 16, 2001 at 01:01:52AM -0700, Brian Ingerson wrote: | | The data starts on the next line. So there is no leading newline. | Right. It took me a spell to figure it out. | I'm thinking of dropping the number thing altogether. +1 | > | <<2 : something | > | This syntax could easily be used | > | for multi-line keys | > | >> | > | <<2 : <<2 | > | a | > | key | > | >> | > | and a | > | value | > | >> | > | > Syntax error detected at the "and a", right? | | Nope. This is a multiline key followed by a multiline value. (Much the | same way that Perl handles multiple here-docs.) My internal parser is not properly configured. I think I'd rather see indentation used to mark off the entire block. And I'm not sure I like the idea of the terminator (see my counter proposal). | > Is the compomise... ok? At least it allows me to | > re-format the block so that it can be properly | > word-wrapped again if the map entry above has to | > be pushed in another 20 columns, so that a 76 | > column margin can be maintained (with exceptions | > of course). | | OK. Here is our major difference in thinking. In Perl (and most other | *scripting* languages) I'm going to want to serialize some large | structure with a single command. All I care about is that it round-trips | and that it's easy to read serialized. I'm not going to want to bother | telling YAML to serialize this string one way and that string another. I | want YAML to figure that out for me. Exactly. We do not have a difference of opinion here. I think what we consider as our average leaf is different. | So while we can support the double quoted thingy above, I don't see | it being used very often in the general case. So why go through all | the trouble? My general case must be very different from yours. In my general case long lines (254 on average) are stored without any carriage returns. And when a carriage return is used, it is to indicate a new paragraph. The block format would force a ton of lines that scroll well past the right margin. This isn't good. It's probably about equivalent to escaping leading whitespace! ;) | I'm leaning towards these emission defaults: I'd like a bit more like... space = 76 - current_depth; if len(scalar) <= space: if is-simple(scalar): print simple(scalar) else print quoted(scalar) else if max-line-length(scalar) <= space or lots-of-escaped-stuff(scalar): print block(scalar) else print quoted(scalar) | > | Oren: <<3 | > | One huge reason for using this syntax | > | Is that it makes it easy to quote a YAML document | > | as a single string in another YAML doc | > | >> | > | > 1. I think this is nice, although I think what | > ever characters immediately follow the << | > should be used to match the EOS marker... | > | > <<EOS ..... EOS>> or something like that. | | I think the general case should be to use no marker if | possible. I found the markers hard to read in large YAML docs. Right. | | > 3. I very much like whitespace folding rules not | > being in effect for this mechansim. However, | > it is important for there to be a mechansim | > where whitespace folding isn't a problem. | | OK. For now let's go with your compromise, but make the default emission | behavior be verbatim text using the << >> syntax for multi-lines. Please consider otherwise. My major use cases have very long lines that I need to word-wrap for readability. | Also let's make leading and trailing whitespace significant if it comes | between text and the double quote. So I can do the following: | | foo1 : " " | foo2 : " qty cost total " +1 | That way I can do nice clean single line values with | significant whitespace. Perfect. Only when you do a multi-line quoted is the new line and trailing/leading whitespace skipped. Best, Clark |