From: Clark C. E. <cc...@cl...> - 2004-12-12 03:37:50
|
Hello all. In the last few months I've been working heavily with SQL results, CSV files, and other tabular data sets. I'd like YAML 1.2 (in another year or so) to address the succinct human-presentation of these structures. This is just a brain-storm post to get any interested parties thinking about this problem to see if we can find early concensus (or identify issues). Previously (about a year ago?) Brian had proposed using a pipe-delimited format the literal scalar indicator like... --- | name | hr | ave |--------------+----+------ | Mark McGwire | 65 | 0.278 | Sammy Sosa | 63 | 0.288 ... This particular structure would then be equivalent to example 2.4 in the specification, --- - name: Mark McGwire hr: 65 avg: 0.278 - name: Sammy Sosa hr: 63 avg: 0.288 ... A few details about this I see as important: - This proposes an alternative, more succinct, style for a compactly presenting sequence-of-mappings. It does not introduce a new fundamental "kind" or such modification to the information model. It's syntax sugar. - The pipe delimiter | can be used by Excel as a field separator for loading, the leading | used to create a column with indentation (which is discarded). Excel compatibility, is, (perhaps unfortunately), very important. - Within a given "cell", normal flow structure can occur. Thus quotes can be used to escape characters, (especially the | indicator), and aliases can occur as needed. I'm not sure how to anchor a given row (anchoring a cell is probably not needed), perhaps: --- | name | hr | ave | admires |--------------+----+------+-------- &mark | Mark McGwire | 65 | 0.278| | Sammy Sosa | 63 | 0.288| *MARK ... Eww! That's ugly, but I need to be able to anchor a given row... I just don't know how it should look. - An optional "divider" line separates the mapping keys from the data values can be extended later on to provide nested groups (I call them "facets"), perhaps something like: --- | player | stastic | | hr | ave |--------------+----+------ | Mark McGwire | 65 | 0.278 | Sammy Sosa | 63 | 0.288 ... being mapped to... --- - player: Mark McGwire stastic: hr: 65 avg: 0.278 - player: Sammy Sosa stastic: hr: 63 avg: 0.288 ... This is extremely common in my data sets, and the presentation is straight-forward. Anyway... any thoughts? I suppose this would violate YAML 1.1 productions since it uses the | character in a very odd way. I've tried other character though, and they don't work as nicely. Do we have any clean migration path? Kind Regards, Clark |
From: Oren Ben-K. <or...@be...> - 2004-12-12 08:01:58
|
On Sunday 12 December 2004 05:37, Clark C. Evans wrote: > Previously (about a year ago?) Brian had proposed using a > pipe-delimited format the literal scalar indicator like... > > --- > | name | hr | ave > |--------------+----+------ > | Mark McGwire | 65 | 0.278 > | Sammy Sosa | 63 | 0.288 > ... This can be written as: --- - { name: Mark McGwire, hr: 65, ave: 0.278 } - { name: Sammy Sosa, hr: 63, ave: 0.288 } ... Which is not as compact but is arguably more readable. It isn't Compatible with Excel, of course, but I seriously doubt you can achieve compatibility anyway: > - Within a given "cell", normal flow structure can occur. > Thus quotes can be used to escape characters, (especially > the | indicator), and aliases can occur as needed. I'm not > sure how to anchor a given row (anchoring a cell is probably > not needed), perhaps: All this seems incompatible with Excel. It would choke on nested {} or [], insist that each cell be contained in a single line, will misinterpret \t in either '\t' or "\t" (or both), will consider '&anchor' before the row as content (or an error), and so on. In contrast, it is trivial to write a csv2yaml and yaml2csv that will take _any_ CSV file and convert it to YAML, and _any_ YAML file containing a sequence of same-structure mappings and convert it to a CSV. A yaml2csv/csv2yaml could do tricks like having "anchor" columns in the csv, allowing for row (and even cell) anchors. It could extract just a piece of the YAML to convert into the csv (or embed the CSV into a certain location inside a YAMNL file). And so on. > Anyway... any thoughts? I suppose this would violate YAML 1.1 > productions since it uses the | character in a very odd way. I've > tried other character though, and they don't work as nicely. > Do we have any clean migration path? If for some reason we decided CSVs are needed, the right way to support them would be using a CSV style (I'm using @ because its a reserved indicator; this allows specifying the separator character) --- csv: @, name,hr,ave Mark McGwire,65,0.278 Sammy Sosa,63,0.288 The details of the CSV format would be _exactly_ the way CSVs behave in practice in spreadsheets (this requires a bit of investigation of course). Of course, you won't get any anchors, or tags, and escaping would be very limited. It seems like this won't be good enough for your use case (having row anchors). Well, that's CSV for you :-) I think all this is uncalled for. It is rather un-yaml-ish. It brings to mind the "here document" flow style (I'm using ` here, again just because it is reserved): --- { foo: `EOF how much indentation here? EOF , bar: baz } --- foo: bar: baz: `EOF what about indentation here? EOF ... In both cases you have the same issues: either you indent the content, so it isn't really a CSV/"here document", or you don't indent it - breaking the yaml block structure, having problems with content starting with '%', '---' or '...', and so on. In both cases we prefered simply not to support the problematic style. FWIW, I think a "here document" style is much more useful than a CSV style: it could serve as the equivalent of the literal style for people who prefer flow collection styles. Also, allowing `"EOF" vs. `EOF would enable using escape sequences in a "here document", making it truly powerful... Still, it just doesn't mesh well with YAML indentation. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2004-12-13 06:38:03
|
Oren, Thanks for humoring the discussion. I guess there are two particular needs that I'm mixing/addressing: 1. A very nice "presentation" tabular form (that would not be great for editing since columns wouldn't line up). 2. The ability to directly include delimited tables, that are indented when put into YAML, this is helpful since a bulk of the data files used in a biomedical lab (where I'm working) are CSV files. I think this applies to all sorts of other fields... tables are just so common. Your take on these items is that we need special import/export functions, such as: yaml2csv <yamlfile> <ypath> [ -o <csvfile> ] Takes a YAML file, plus a ypath expression and extracts the "leaf" (which is a sequence of mappings) and formats it as a CSV file to stdout (or optionally an output file). csv2yaml <yamlfile> <ypath> [ -i <csvfile> ] Takes a CSV file (or stdin) and injects it into a YAML file as a sequence of mappings at the location specified by the ypath expression. Not a bad idea, but let me play a bit more... On Sun, Dec 12, 2004 at 10:01:51AM +0200, Oren Ben-Kiki wrote: | > --- | > | name | hr | ave | > |--------------+----+------ | > | Mark McGwire | 65 | 0.278 | > | Sammy Sosa | 63 | 0.288 | > ... | | --- | - { name: Mark McGwire, hr: 65, ave: 0.278 } | - { name: Sammy Sosa, hr: 63, ave: 0.288 } | ... | | Which is not as compact but is arguably more readable. Not really. Here is a more real-life example (but in reality it has lots and lots of columns). --- @| | Dysmorphology | Physical Examination | Acondroplasia | Marfan Syndrome | ... | Height | Weight | ... | | | | | | | Not Evaluated | Positive | ... | 63 | 130 | ... ... thousands of rows ... | Positive | Not Evaluated | ... | 42 | 80 | ... ... This just _isn't_ nice to read when you put it in current YAML styles. | It isn't Compatible with Excel Well, 99% of the time, the import is easy of this table... you simply specify that | is the delimiter. In the domain I'm working in, | never appears in the real data. Indentation isn't an issue since Excel loads it into the first column, which can be easily ignored. That said... it requires an extra step through the import Wizard -- so you are correct. | > - Within a given "cell", normal flow structure can occur. | > Thus quotes can be used to escape characters, (especially | > the | indicator), and aliases can occur as needed. | | All this seems incompatible with Excel. It would choke on nested {} or | [], insist that each cell be contained in a single line, will | misinterpret \t in either '\t' or "\t" (or both), will consider | '&anchor' before the row as content (or an error), and so on. And that's just OK, these are rare cases, and it's easy to explain to a lab technician that *xxx in a cell represents an reference to another object (such as mother or father). | If for some reason we decided CSVs are needed, the right way to support | them would be using a CSV style (I'm using @ because its a reserved | indicator; this allows specifying the separator character) | | --- | csv: @, | name,hr,ave | Mark McGwire,65,0.278 | Sammy Sosa,63,0.288 Right. This perhaps is the better approach. Use @ to signify "tabular data" followed by the delimiter (or perhaps no character to signify a 'here' document). --- @, , Dysmorphology , Physical Examination , Acondroplasia , Marfan Syndrome , ... , Height , Weight , ... , , , ... , , , ... , Not Evaluated , Positive , ... , 63 , 130 , ... ... thousands of rows ... , Positive , Not Evaluated , ... , 42 , 80 , ... ... I like it. | The details of the CSV format would be _exactly_ the way CSVs behave in | practice in spreadsheets (this requires a bit of investigation of | course). Yes, fortunately, we are somewhat close already. If we allowed for double "" to be the same as a single " it would be even closer. Is the following currently valid YAML? key: "one "" two" One interpretation of this is two strings, that are automatically concatinated. If this is the current interpretation, then I guess this potential compatibility is dead-on-arrival. | Of course, you won't get any anchors, or tags, and escaping | would be very limited. It seems like this won't be good enough for your | use case (having row anchors). Well, that's CSV for you :-) Its not that CSV (or Excel) has to understand these items, as much as they just preserve them intact. By using the very first column for row-level indicators and leading whitespace this could potentially work well. It's easy in Excel to ignore the first column. | I think all this is uncalled for. It is rather un-yaml-ish. Now _this_ is the best argument against it! ;) | It brings to | mind the "here document" flow style (I'm using ` here, again just | because it is reserved): Right. | --- | { foo: `EOF | how much | indentation | here? | EOF | , bar: baz | } | --- | foo: | bar: | baz: `EOF | what about | indentation | here? | EOF | ... The latter is similar to our block scalar, and that's why we don't need a HERE document. This issue is different, unlike a HERE document, there isn't a good readable alternative in current YAML for writing large tabular data sets (modeled as a sequence of mappings). | In both cases you have the same issues: either you indent the content, | so it isn't really a CSV/"here document", or you don't indent it - | breaking the yaml block structure, having problems with content | starting with '%', '---' or '...', and so on. Right. | In both cases we prefered simply not to support the problematic style. | FWIW, I think a "here document" style is much more useful than a CSV | style: it could serve as the equivalent of the literal style for people | who prefer flow collection styles. Also, allowing `"EOF" vs. `EOF would | enable using escape sequences in a "here document", making it truly | powerful... Still, it just doesn't mesh well with YAML indentation. Right -- but once again, I'm not proposing to break indentation. ;) Cheers, Clark |
From: trans. (T. Onoma) <tra...@ru...> - 2004-12-13 13:28:11
|
T24gTW9uZGF5IDEzIERlY2VtYmVyIDIwMDQgMDE6MzggYW0sIENsYXJrIEMuIEV2YW5zIHdyb3Rl Ogp8ICAtLS0gQCwKfCDCoCAsIER5c21vcnBob2xvZ3kgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAg wqAgwqAgwqAgwqAgLCBQaHlzaWNhbCBFeGFtaW5hdGlvbiDCoAp8IMKgICwgQWNvbmRyb3BsYXNp YSAsIE1hcmZhbiBTeW5kcm9tZSAsIC4uLiAsIEhlaWdodCAsIFdlaWdodCAsIC4uLgp8IMKgICwg wqAgwqAgwqAgwqAgwqAgwqAgwqAgLCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCAsIC4uLiAsIMKg IMKgIMKgIMKgLCDCoCDCoCDCoCDCoCwgLi4uCnwgwqAgLCBOb3QgRXZhbHVhdGVkICwgUG9zaXRp dmUgwqAgwqAgwqAgwqAsIC4uLiAsIMKgNjMgwqAgwqAsIDEzMCDCoCDCoCwgLi4uCnwgwqAgwqAg wqAgwqAgwqAgLi4uIHRob3VzYW5kcyBvZiByb3dzIC4uLgp8IMKgICwgUG9zaXRpdmUgwqAgwqAg wqAsIE5vdCBFdmFsdWF0ZWQgwqAgLCAuLi4gLCDCoDQyIMKgIMKgLCDCoDgwIMKgIMKgLCAuLi4K ClBlcmhhcHMgdGhlcmUncyBhIGJldHRlciB3YXkgdG8gaGFuZGxlIG11bHRpLWNvbHVtbiByb3dz LCBidXQgdGhpcyBhbHJlYWR5IApzZWVtcyBkb2FibGU6CgotLS0KLSBbIER5c21vcnBob2xvZ3kg wqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgwqAgICAgIMKgIMKgIMKgICwgUGh5c2ljYWwgRXhhbWlu YXRpb24gwqAgICAgXQotIFsgWyBBY29uZHJvcGxhc2lhICwgTWFyZmFuIFN5bmRyb21lICwgLi4u IF0gLCBbIEhlaWdodCAsIFdlaWdodCAsIC4uLiBdIF0KLSBbIFsgwqAgwqAgwqAgwqAgwqAgwqAg wqAgLCDCoCDCoCDCoCDCoCDCoCDCoCDCoCDCoCAsIC4uLiBdICwgW8KgIMKgIMKgICDCoCwgwqAg wqAgwqAgwqAsIC4uLiBdIF0KLSBbIFsgTm90IEV2YWx1YXRlZCAsIFBvc2l0aXZlIMKgIMKgIMKg IMKgLCAuLi4gXSAsIFvCoDYzIMKgICDCoCwgMTMwIMKgIMKgLCAuLi4gXSBdCiMgwqAgwqAgwqAg wqAgLi4uIHRob3VzYW5kcyBvZiByb3dzIC4uLgotIFsgWyBQb3NpdGl2ZSDCoCDCoCDCoCwgTm90 IEV2YWx1YXRlZCDCoCAsIC4uLiBdICwgWyDCoDQyIMKgIMKgLCDCoDgwIMKgIMKgLCAuLi4gXSBd CgoKVC4K |
From: Tim H. <tim...@co...> - 2004-12-15 09:10:36
|
Clark C. Evans wrote: [CHOP] >Right. This perhaps is the better approach. Use @ to signify >"tabular data" followed by the delimiter (or perhaps no character >to signify a 'here' document). > > (If you did this, wouldn't you wan't a way to indicate space deliminited documents? No character might be natural for that since block scalars already take care of the here document case. And what about tab deliminited, would that be @\t?) > --- @, > , Dysmorphology , Physical Examination > , Acondroplasia , Marfan Syndrome , ... , Height , Weight , ... > , , , ... , , , ... > , Not Evaluated , Positive , ... , 63 , 130 , ... > ... thousands of rows ... > , Positive , Not Evaluated , ... , 42 , 80 , ... > ... > >I like it. > [CHOP] Whay all the leading commas? And what's the advantage to the above over: --- - [Dysmorphology , Physical Examination] - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] - [Not Evaluated , Positive , ... , 63 , 130 , ...] ... thousands of rows ... - [Positive , Not Evaluated , ... , 42 , 80 , ...] ... Just ease of import / export to CSV? And is the second equivalent to the first? In other words how would this CSV thingy map onto the YAML model? [CHOP] >The latter is similar to our block scalar, and that's why we don't >need a HERE document. This issue is different, unlike a HERE >document, there isn't a good readable alternative in current YAML >for writing large tabular data sets (modeled as a sequence of >mappings). > Hmmm. I don't think CSV tends to map all that well to a sequence of mappings. Always using the first line as a header seems fragile. I'd represent it as a sequence of sequences and let the app sort it out. -tim |
From: Clark C. E. <cc...@cl...> - 2004-12-13 13:22:52
|
On Mon, Dec 13, 2004 at 04:51:19AM -0700, Tim Hochberg wrote: | > --- @, | > , Dysmorphology ,, ... , Physical Examination ,, ... | > , Acondroplasia , Marfan Syndrome , ... , Height , Weight , ... | > , , , ... , , , ... | > , Not Evaluated , Positive , ... , 63 , 130 , ... | > ... thousands of rows ... | > , Positive , Not Evaluated , ... , 42 , 80 , ... | > ... | > | | Whay all the leading commas? So that leading indentation is handled cleanly, you can delete the top and bottom of the YAML document, exposing only the tabular leaf. This region could then be imported cleanly into Excel, where the leading spaces (no matter how far indented) all end up in the first column which can be ignored. And what's the advantage to the above over: | | --- | - [Dysmorphology , Physical Examination] | - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] | - [Not Evaluated , Positive , ... , 63 , 130 , ...] | ... thousands of rows ... | - [Positive , Not Evaluated , ... , 42 , 80 , ...] | ... | | Just ease of import / export to CSV? Partly, but that's not the primary reason. | And is the second equivalent to the first? In other words how would | this CSV thingy map onto the YAML model? That's the reason. This would be syntax sugar for... --- - Dysmorphology: Acondroplasia: Not Evaluated Marfan Syndrome: Positive Physical Examination: Height: 63 Weight: 130 #... other columns ... # ... thousands of rows ... - Dysmorphology: Acondroplasia: Positive Marfan Syndrome: Not Evaluated ... other columns ... Physical Examination: Height: 42 Weight: 80 ... | >The latter is similar to our block scalar, and that's why we don't | >need a HERE document. This issue is different, unlike a HERE | >document, there isn't a good readable alternative in current YAML | >for writing large tabular data sets (modeled as a sequence of | >mappings). | | Hmmm. I don't think CSV tends to map all that well to a sequence of | mappings. Always using the first line as a header seems fragile. I'd | represent it as a sequence of sequences and let the app sort it out. But that's not the common use case. The common case is to use the first row as mapping keys, column headers. A less common, but also very useful pattern is to allow a set of column headers to provide for nested mappings, like the example above. In this case, you need a blank row between the header and the content. It's not that I really _need_ to represent CSV, its that I'd like a more succinct way to write sequences of mappings, where each item in the sequence has the same mapping keys. The nested key thingy above is just icing (and perhaps icing that is too confing). That this structure could be modeled as looking like CSV is nice, but quite beside the point. The primary problem is having a clear representation so I don't have to duplicate all of the column headers again and again and again. It's syntax sugar. Best, Clark -- Clark C. Evans Prometheus Research, LLC. http://www.prometheusresearch.com/ o office: +1.203.777.2550 ~/ , mobile: +1.203.444.0557 // (( Prometheus Research: Transforming Data Into Knowledge \\ , \/ - Research Exchange Database /\ - Survey & Assessment Technologies ` \ - Software Tools for Researchers ~ * |
From: Oren Ben-K. <or...@be...> - 2004-12-13 18:23:27
|
On Monday 13 December 2004 08:38, Clark C. Evans wrote: > Thanks for humoring the discussion. Its just the curse again. I've checked-in a reasonable CVS version of the draft, and even (gasp) fired up VI on the C parser code. Obviously the universe was forced to generate another "let's add stuff to the spec" thread :-) > I guess there are two > particular needs that I'm mixing/addressing: > > 1. A very nice "presentation" tabular form (that would not > be great for editing since columns wouldn't line up). As Tim pointed out, this: - [Dysmorphology , , Physical Examination ] - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] - [Not Evaluated , Positive , ... , 63 , 130 , ...] Works fine for that. The extra '- [' at the start and the ']' at the end are not a problem. Also, they let you wrap lines which is _great_ for presentation (and can't be done in a CSV). So, just for presentation, we don't need anything we don't already have. Of course, if you want a tabular syntax to express a sequence of mappings, well, that's a whole separate issue. It seems to me that defining exactly how CSV (or CSV-like) column headers get converted to nested mapping keys would be a huge PITA. > 2. The ability to directly include delimited tables, that > are indented when put into YAML, this is helpful since > a bulk of the data files used in a biomedical lab (where > I'm working) are CSV files. I think this applies to all > sorts of other fields... tables are just so common. > > Your take on these items is that we need special import/export > functions, Quite. > Not a bad idea, but let me play a bit more... It seems that you always need some import/export code, no matter what you do: > Well, 99% of the time, the import is easy of this table... > ... > That said... it requires an extra step through the > import Wizard -- so you are correct. ;-) > | > - Within a given "cell", normal flow structure can occur... > | > | All this seems incompatible with Excel... > > And that's just OK, these are rare cases.. My point was it wouldn't be a "normal flow structure"; it would be more like the restrictions we have on simple plain mapping keys. > If we allowed for > double "" to be the same as a single " it would be even closer. Is > the following currently valid YAML? > > key: "one "" two" No. It is an error. We could make it mean 'one " two', I suppose, but all our " escape sequences are based on \, it would stand out like a sore thumb. > | I think all this is uncalled for. It is rather un-yaml-ish. > > Now _this_ is the best argument against it! ;) Here's another: YAGNI. Nobody "really" needs it. What GUI people would want is two menu entries: "export to yaml" and "import from yaml" (or maybe even YAML as an OLE data source, or some other such nastiness). Also, they do not save their files as "stuff.csv", they use "stuff.xls". So, if you want to get them a separate tool, it should be "xls2yaml" rather than "csv2yaml", and probabbly be GUI-based rather than a command line. BTW, none of these should be very difficult; Microsoft Office has many faults, but it is very scriptable/extendible if you know how. CLI people, being CLI people, would have no problem using csv2yaml and yaml2csv, and in fact would probably prefer it that way :-) There's an in-between type of folk - "smart editor" users. VI users can invoke the yaml2csv/csv2yaml from their editor (type ":.!csv2yaml -i 12 foo.csv<enter>" and get "foo.csv" converted to YAML, indented 12 spaces, and inserted at the current cursor position). Emacs users can also do something like that, but they'll probably prefer to write a YAML parser in lisp that will import data from any SQL source and pretty-print it in their file, using the Ctl-Alt-Y Hyper-Meta-I key :-) I don't see that the set of users who need to cut&paste text in notepad from a csv to YAML is that large... It just doesn't seem to justify the effort (and un-yaml-ness) required. Besides, we got to leave something to YAML 2.0, right? :-) Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2004-12-14 03:20:26
|
On Mon, Dec 13, 2004 at 08:23:14PM +0200, Oren Ben-Kiki wrote: | I've checked-in a reasonable CVS version of the draft Which repository? SF.NET? | > 1. A very nice "presentation" tabular form (that would not | > be great for editing since columns wouldn't line up). | | As Tim pointed out, this: | | - [Dysmorphology , , Physical Examination ] | - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] | - [Not Evaluated , Positive , ... , 63 , 130 , ...] Not the same... this is a sequences of sequences. I'm looking for a nice syntax for sequences of _mappings_ ;) | Of course, if you want a tabular syntax to express a sequence of | mappings, well, that's a whole separate issue. Yes, that's the issue -- sequences of mappings, aka lists of objects, when the set of keys is heavily repeating is just an eyesore. Someting like what I orginal posted would be nice... > --- > | name | hr | ave > |--------------+----+------ > | Mark McGwire | 65 | 0.278 > | Sammy Sosa | 63 | 0.288 > ... > > This particular structure would then be equivalent to > example 2.4 in the specification, > > --- > - > name: Mark McGwire > hr: 65 > avg: 0.278 > - > name: Sammy Sosa > hr: 63 > avg: 0.288 > ... ... | It seems to me that | defining exactly how CSV (or CSV-like) column headers get converted to | nested mapping keys would be a huge PITA. I'm really not that concerned about CVS compatibility - really! | What GUI people would want is two menu entries: "export to yaml" and | "import from yaml" ... | CLI people, being CLI people, would have no problem using csv2yaml and | yaml2csv, and in fact would probably prefer it that way :-) ... | I don't see that the set of users who need to cut&paste text in notepad | from a csv to YAML is that large... It just doesn't seem to justify the | effort (and un-yaml-ness) required. Right. I'm sold, CVS-in-YAML is a _bad_ idea. However, I still have a ton of YAML data that is tabular (and modeled in my programming languages as a sequence of mappings). I'd like a nice clean way to express this, and it'd be nice if it allowed for "pivot table" like format... > --- > | player | stastic > | | hr | ave > |--------------+----+------ > | Mark McGwire | 65 | 0.278 > | Sammy Sosa | 63 | 0.288 > ... > > being mapped to... > > --- > - > player: Mark McGwire > stastic: > hr: 65 > avg: 0.278 > - > player: Sammy Sosa > stastic: > hr: 63 > avg: 0.288 > ... Now that would make alot of my YAML data _super_ nice to read! But that said, if we have a nice syntax we can add this later... that's good 'enuff for now. Best, Clark |
From: trans. (T. Onoma) <tra...@ru...> - 2004-12-14 03:59:15
|
On Monday 13 December 2004 10:20 pm, Clark C. Evans wrote: | On Mon, Dec 13, 2004 at 08:23:14PM +0200, Oren Ben-Kiki wrote: | | I've checked-in a reasonable CVS version of the draft | | Which repository? SF.NET? | | | > 1. A very nice "presentation" tabular form (that would not | | > be great for editing since columns wouldn't line up). | | As _Tom_ pointed out, this: | | | | - [Dysmorphology , , Physical Examination ] | | - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] | | - [Not Evaluated , Positive , ... , 63 , 130 , ...] | | Not the same... this is a sequences of sequences. I'm looking | for a nice syntax for sequences of _mappings_ ;) It should be a simple matter of tagging. --- !ClarksMedTable - [Dysmorphology , , Physical Examination ] - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] - [Not Evaluated , Positive , ... , 63 , 130 , ...] Your application has every right to transform this into a sequence of mappings. T. |
From: David H. <dav...@bl...> - 2004-12-14 04:13:06
|
trans. (T. Onoma) wrote: > On Monday 13 December 2004 10:20 pm, Clark C. Evans wrote: > | | > | | - [Dysmorphology , , Physical Examination ] > | | - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] > | | - [Not Evaluated , Positive , ... , 63 , 130 , ...] > | > | Not the same... this is a sequences of sequences. I'm looking > | for a nice syntax for sequences of _mappings_ ;) > > It should be a simple matter of tagging. > > --- !ClarksMedTable > - [Dysmorphology , , Physical Examination ] > - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] > - [Not Evaluated , Positive , ... , 63 , 130 , ...] > > Your application has every right to transform this into a sequence of > mappings. This can't be manipulated generically by application-independent code that groks tables. How about this: --- !!table fields: - [Dysmorphology , , Physical Examination ] - [Acondroplasia , Marfan Syndrome , Height , Weight ] rows: - [Not Evaluated , Positive , 63 , 130 ] - # ... thousands of rows ... - [Positive , Not Evaluated , 42 , 80 ] -- David Hopwood <dav...@bl...> |
From: Oren Ben-K. <or...@be...> - 2004-12-14 07:45:32
|
On Tuesday 14 December 2004 05:20, Clark C. Evans wrote: > | I've checked-in a reasonable CVS version of the draft > > Which repository? SF.NET? yaml.org:/home/cvs > Right. I'm sold, CVS-in-YAML is a _bad_ idea. However, I still > have a ton of YAML data that is tabular (and modeled in my > programming languages as a sequence of mappings). I'd like a > nice clean way to express this, and it'd be nice if it allowed > for "pivot table" like format... The closest you can get today is: --- - { player: Mark McGwire, statistics: { hr: 65, ave: 0.278 } } - { player: Sammy Sosa , statistics: { hr: 63, ave: 0.288 } } ... It is much less compact (folding lines helps a bit). On the other hand, if you have 27 fields and you are down at the 100th record, you don't have to count columns to know which field is which. Excel allows you to freeze the header line so you only scroll the content, but you can't do that in an editor. I suppose you could repeat the headers line in a comment... So, you make a good case for it, but I'm not quite convinced yet. > But that said, if we have a nice syntax we can add this later... > that's good 'enuff for now. I agree. Let's "table" the issue :-) Side note: Onoma wrote: > It should be a simple matter of tagging. > > --- !ClarksMedTable > - [Dysmorphology , , Physical Examination ] > - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] > - [Not Evaluated , Positive , ... , 63 , 130 , ...] And David correctly replied: > This can't be manipulated generically by application-independent code > that groks tables. True, and it is actually trickier than that. We have intentionally set a restriction that the "kind" of a tag is fixed for all time. That is, if a "!foo" tag is a sequence in the syntax, it is also a sequence in the YAML representation (the native data type can be any cave drawing, of course). The reason is, as you said, to allow YAML data to be manipulated by generic tools. So, if you do something like: > --- !!table > fields: > - [Dysmorphology , , Physical Examination ] > - [Acondroplasia , Marfan Syndrome , Height , Weight ] > rows: > - [Not Evaluated , Positive , 63 , 130 ] > - # ... thousands of rows ... > - [Positive , Not Evaluated , 42 , 80 ] A generic tool will use the ypath "/rows[0][0]" to get to "Not Evaluated". Which means you can NOT say that the above is "equivalent" to: --- - Dysmorphology: Acondroplasia: Not Evaluated Marfan Syndrome: Positive Physical Examination: Height: 63 Weight: 130 ... Because there a generic tool will need the ypath "[0]/Dysmorphology/Acondroplasia" to get to the same value. So, the above two are completely different beasts as far as YAML is concerned. You could argue that a generic tool should know about the common "!!table" type, and would therefore allow the second ypath format for the first document. But what's allowed for common types is also allowed for application specific types... So this really messes up things - ypaths become dependant on the set of recognized tags, you get "multiple views" of the same graph, etc. It essentially makes "generic YAML tools" impossible. Bottom line, we decided "tag implies kind" and accepted that we exclude some things like expressing sparse sequences as mappings (!!seq {2: foo, 1934: bar}) - or tricks like your pivot table. That said, you _could_ define a tag ("!!pivot") that would behave like you want - as long as you accept that this is _not_ alternate syntax for "!!seq of !!map", it is a different type. So, if you load a "!!pivot" to your application, you get something that is different from a vanilla sequence-of-mappings. Specifically, in a "!!pivot", you must ensure all mappings have the same set of keys; adding a new one is a global operation that affects all mappings, not a local one done to each one separately. Also, a "!!pivot" is always presented the way you have shown above, so all ypaths accessing a "!!pivot" will always do so in terms of "/row[n][m]" rather than in terms of "[n]/mth-key". It boils down to YAML.pm not being able to load a "!!pivot" to a simple Perl array-of-hashes. Thus using "!!pivot" vs. "!!seq of !!map" becomes not a mere syntactical issue, it becomes a schema-level representation issue. Picking one over the other should depend on the way you percieve your data to be constructed rather than the way you'd like it to be presented. If you are writing a spreadsheet application, "!!pivot" might be a good match. If you are writing an SQL-based application, it would probably be a bad match. Even though the data is the exact same data! Now, this is NOT a good thing. You'd like to have one way to represent the data, and many ways to present it. We can't avoid this sort of thing completely. Is a timestamp a scalar or a mapping with year, month, date etc.? How about a float - should it be a mapping with sign, mantissa, exponent? How about a complex - should it be a mapping with real and imaginary keys? There are _tons_ of choices of this type. We do try to minimize them, though. So we did not include a "!!sparse" tag in the repository, and IMO should not include a "!!pivot" tag there either. What makes Clark's "pivot table" style concept attractive is it defuses the dillema (for the "!!pivot" case, anyway). You can represent the data as a sequence of mappings, they way it "should" be, and still present it any way you like - including in a tabular form. All the above said, I still think using a sequence of flow mappings is good enough for now :-) Have fun, Oren Ben-Kiki |
From: trans. (T. Onoma) <tra...@ru...> - 2004-12-14 11:41:48
|
Throwing out some thoughts... On Tuesday 14 December 2004 02:45 am, Oren Ben-Kiki wrote: | The closest you can get today is: | | --- | - { player: Mark McGwire, statistics: { hr: 65, ave: 0.278 } } | - { player: Sammy Sosa , statistics: { hr: 63, ave: 0.288 } } | ... | | It is much less compact (folding lines helps a bit). On the other hand, | if you have 27 fields and you are down at the 100th record, you don't | have to count columns to know which field is which. Excel allows you to | freeze the header line so you only scroll the content, but you can't do | that in an editor. I suppose you could repeat the headers line in a | comment... | | So, you make a good case for it, but I'm not quite convinced yet. I was just thinking about how one might save space in representing the syntax in a program. For instance in Ruby: template = "- player: %s, statistics: { hr: %i, ave: %d }\n" s = "---\n" s << template % [ 'Mark McGwire', 65, 0.278 ] s << template % [ 'Sammy Sosa', 63, 0.288 ] # ... I wonder if YAML itself could incorporate some type of template mechanism. | > But that said, if we have a nice syntax we can add this later... | > that's good 'enuff for now. | | I agree. Let's "table" the issue :-) | | Side note: | | Onoma wrote: | > It should be a simple matter of tagging. | > | > --- !ClarksMedTable | > - [Dysmorphology , , Physical Examination ] | > - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] | > - [Not Evaluated , Positive , ... , 63 , 130 , ...] | | And David correctly replied: | > This can't be manipulated generically by application-independent code | > that groks tables. | | True, and it is actually trickier than that. We have intentionally set a | restriction that the "kind" of a tag is fixed for all time. That is, if | a "!foo" tag is a sequence in the syntax, it is also a sequence in the | YAML representation (the native data type can be any cave drawing, of | course). | | The reason is, as you said, to allow YAML data to be manipulated by | | generic tools. So, if you do something like: | > --- !!table | > fields: | > - [Dysmorphology , , Physical Examination ] | > - [Acondroplasia , Marfan Syndrome , Height , Weight ] | > rows: | > - [Not Evaluated , Positive , 63 , 130 ] | > - # ... thousands of rows ... | > - [Positive , Not Evaluated , 42 , 80 ] | | A generic tool will use the ypath "/rows[0][0]" to get to "Not | Evaluated". Which means you can NOT say that the above is "equivalent" | to: | | --- | - Dysmorphology: | Acondroplasia: Not Evaluated | Marfan Syndrome: Positive | Physical Examination: | Height: 63 | Weight: 130 | ... | | Because there a generic tool will need the ypath | "[0]/Dysmorphology/Acondroplasia" to get to the same value. So, the | above two are completely different beasts as far as YAML is concerned. | | You could argue that a generic tool should know about the common | "!!table" type, and would therefore allow the second ypath format for | the first document. But what's allowed for common types is also allowed | for application specific types... So this really messes up things - | ypaths become dependant on the set of recognized tags, you get | "multiple views" of the same graph, etc. It essentially makes "generic | YAML tools" impossible. It does? Wouldn't the generic "views" still work, even if there were special "pseudo-views" added-on? | Bottom line, we decided "tag implies kind" and accepted that we exclude | some things like expressing sparse sequences as mappings (!!seq {2: | foo, 1934: bar}) - or tricks like your pivot table. | | That said, you _could_ define a tag ("!!pivot") that would behave like | you want - as long as you accept that this is _not_ alternate syntax | for "!!seq of !!map", it is a different type. So, if you load a | "!!pivot" to your application, you get something that is different from | a vanilla sequence-of-mappings. Specifically, in a "!!pivot", you must | ensure all mappings have the same set of keys; adding a new one is a | global operation that affects all mappings, not a local one done to | each one separately. Also, a "!!pivot" is always presented the way you | have shown above, so all ypaths accessing a "!!pivot" will always do so | in terms of "/row[n][m]" rather than in terms of "[n]/mth-key". It | boils down to YAML.pm not being able to load a "!!pivot" to a simple | Perl array-of-hashes. A table is not a sequence-of-mappings --it is a different type. If one uses a table as the means of definition then the globalness you mention is necessarily inherent to the type. Table data is a sequence of sequences. That one might want it to be a sequence of mappings is therefore strictly a transformation. But perhaps all that is really wanted is a way to access it _as if_ it were a sequence of mappings? | Thus using "!!pivot" vs. "!!seq of !!map" becomes not a mere syntactical | issue, it becomes a schema-level representation issue. Picking one over | the other should depend on the way you percieve your data to be | constructed rather than the way you'd like it to be presented. If you | are writing a spreadsheet application, "!!pivot" might be a good match. | If you are writing an SQL-based application, it would probably be a bad | match. Even though the data is the exact same data! Right. But what if I want to use it both ways? | Now, this is NOT a good thing. You'd like to have one way to represent | the data, and many ways to present it. Or, one way to represent it and many ways to access it. | We can't avoid this sort of | thing completely. Is a timestamp a scalar or a mapping with year, | month, date etc.? How about a float - should it be a mapping with sign, | mantissa, exponent? How about a complex - should it be a mapping with | real and imaginary keys? There are _tons_ of choices of this type. We | do try to minimize them, though. So we did not include a "!!sparse" tag | in the repository, and IMO should not include a "!!pivot" tag there | either. So could a ypath be programmable? Such that, for instance, "[n]/mth-key" might be translated into "/row[n][m]", depending on the tag? | What makes Clark's "pivot table" style concept attractive is it defuses | the dillema (for the "!!pivot" case, anyway). You can represent the | data as a sequence of mappings, they way it "should" be, and still | present it any way you like - including in a tabular form. | All the above said, I still think using a sequence of flow mappings is | good enough for now :-) Seems to me there really is a difference between one of these flow-mappings and a Record. I think YAML has a tricky time because it tires to be actual data structure rather then just represent it. But it gets tied up in certain places, like ordered mappings and these table records and what not. That's were XML seems to have an advantage. It has its own data structure of course, but it is only relevant to the XML-as-XML, the data can still be anything in needs to be to the application. But I guess that's the thing, YAML ain't markup language --it's data language. Which seems a bit harder to tame. T. |
From: trans. (T. Onoma) <tra...@ru...> - 2004-12-14 12:06:17
|
On Tuesday 14 December 2004 06:40 am, trans. (T. Onoma) wrote: | | All the above said, I still think using a sequence of flow mappings is | | good enough for now :-) | | Seems to me there really is a difference between one of these flow-mappings | and a Record. But that's not to say that a flow mapping can't be good enough ;) T. |
From: Oren Ben-K. <or...@be...> - 2004-12-14 19:05:32
|
On Tuesday 14 December 2004 13:40, trans. (T. Onoma) wrote: > I was just thinking about how one might save space in representing > the syntax in a program. For instance in Ruby: > > template = "- player: %s, statistics: { hr: %i, ave: %d }\n" > s = "---\n" > s << template % [ 'Mark McGwire', 65, 0.278 ] > s << template % [ 'Sammy Sosa', 63, 0.288 ] > # ... > > I wonder if YAML itself could incorporate some type of template > mechanism. Interesting notion, but unless someone comes up with a really neat mechanism I fear it would be too much of a monster. I'm also much less worried about repeating the field names in the file. They hardly affect the compressed size (if disk space or communication bandwidth is the issue); they are trivial to cut & paste (so having to type a lot isn't an issue); and personally I think they are more readable (you don't have to refer to a headers row a few hundred lines above to figure out what field #13 means). > | ... So this really messes up > | things - ypaths become dependant on the set of recognized tags, you > | get "multiple views" of the same graph, etc. It essentially makes > | "generic YAML tools" impossible. > > It does? Wouldn't the generic "views" still work, even if there were > special "pseudo-views" added-on? Not unless you agreed never to use the "pseudo-views" in your generic tool, which defeats the purpose. As for allowing a "generic" tool to support several simultaneous conflicting views of the same graph based on the set of recognized tags - that's a real mess. > A table is not a sequence-of-mappings --it is a different type. In YAML, its either a sequence-of-sequence or a sequence-of-mappings. We already have a very nice syntax for sequence-of-sequence (as Tim and David pointed out). What Clark wanted was a "very nice syntax" for sequence-of-mapping.Call it a "pivot" or a "spreadsheet". I agree "table" is a bad name. > | Now, this is NOT a good thing. You'd like to have one way to > | represent the data, and many ways to present it. > > Or, one way to represent it and many ways to access it. True, and this is actually the case anyway (there's more than one ypath expression to reach any node). But that doesn't solve Clark's problem. > So could a ypath be programmable? Such that, for instance, > "[n]/mth-key" might be translated into "/row[n][m]", depending on the > tag? Again, this means that ypath would be different for tags you "recognize" and tags you don't know about. Which means I won't be able to apply a ypath to a file without first loading some "schema" defining the set of recognized tags and how to handle them. This gets us back to the dark ages of sgml where you couldn't touch a document without first parsing a DTD. We _really_ want YAML documents to be something you can manipulate _without_ knowing the tags. > I think YAML has a tricky time because it tires to be actual data > structure rather then just represent it. "True, but mostly irrelevant". Saying you "merely represent" data, when you go deeply enough into it, ends up defining an "actual data structure". It isn't like you have a real choice in the matter. > But it gets tied up in > certain places, like ordered mappings and these table records and > what not. That's also true. Trying to be an "actual data structure" is hard. But it is worth it. > That's were XML seems to have an advantage. This is where XML has a _disadvantage_. Because XML is "just a syntax", with the information model bolted on top as an afterthought, then you have to worry about two things: the mis-match between your data structure and XML's structure (which can be very significant), and also worry about what different XML tools will make of it (because they each have a slightly different way to interpert the same XML). Example: The YAML spec is in docbook and I had to insert workarounds for line breaks in certain places because the docbook-to-fo and docbook-to-html convertors treat them differently. Note, this is the classical use of XML (a structured document with markup) and both convertors are part of the same application - and they _still_ read the same XML differently! In YAML this would have never happened, even if the convertors were written by different developers on different sides of the world who never spoke to each other, because we aimed at an "actual data structure". There is exactly one "semantic" for each YAML file. It is most definitely worth the effort. Have fun, Oren Ben-Kiki |
From: David H. <dav...@bl...> - 2004-12-14 19:45:14
|
Oren Ben-Kiki wrote: > Onoma wrote: > >>It should be a simple matter of tagging. >> >>--- !ClarksMedTable >> - [Dysmorphology , , Physical Examination ] >> - [Acondroplasia , Marfan Syndrome , ... , Height , Weight , ...] >> - [Not Evaluated , Positive , ... , 63 , 130 , ...] > > > And David correctly replied: > >>This can't be manipulated generically by application-independent code >>that groks tables. > > True, and it is actually trickier than that. We have intentionally set a > restriction that the "kind" of a tag is fixed for all time. That is, if > a "!foo" tag is a sequence in the syntax, it is also a sequence in the > YAML representation (the native data type can be any cave drawing, of > course). > > The reason is, as you said, to allow YAML data to be manipulated by > generic tools. So, if you do something like: > >>--- !!table >> fields: >> - [Dysmorphology , , Physical Examination ] >> - [Acondroplasia , Marfan Syndrome , Height , Weight ] >> rows: >> - [Not Evaluated , Positive , 63 , 130 ] >> - # ... thousands of rows ... >> - [Positive , Not Evaluated , 42 , 80 ] > > > A generic tool will use the ypath "/rows[0][0]" to get to "Not > Evaluated". Which means you can NOT say that the above is "equivalent" > to: > > --- > - Dysmorphology: > Acondroplasia: Not Evaluated > Marfan Syndrome: Positive > Physical Examination: > Height: 63 > Weight: 130 > ... Indeed it is not equivalent. However, a generic tool can *convert* it from one representation to the other, without knowing anything about the application. > That said, you _could_ define a tag ("!!pivot") that would behave like > you want - as long as you accept that this is _not_ alternate syntax > for "!!seq of !!map", it is a different type. Absolutely, it is a different type. The application would request a conversion. I don't think this is as much of a problem as your post suggests. -- David Hopwood <dav...@bl...> |