From: Clark C . E. <cc...@cl...> - 2002-04-15 01:29:08
|
Ok. In the last week we seem to have two open issues: 1. The spec does not require a BOM for UTF-8 and it seems that this is industry practice so that legacy encodings can be handled. Also, without requiring UTF-8, it is a bit harder to do a #ENCODING:ISO8859-1 at a later date, for example. I suggest we ammend the specification to allow strict ASCII (only 7 bit characters) without a BOM and require all streams which use unicode to start with the UTF BOM. This should have zero little impact on existing YAML users since there arn't any unicode parsers yet... 2. We have a potentially large lookahead for the series/key short-hand. I suggest limiting this case to only allow for in-line string (without anchors and type family). The impact is that the following become illegal: --- - this: is legal - this: [ is, also, legal] - this: !!remains &001 legal --- - !!this is: illegal - &so is: this - [and,this,is]: illegal too If we all agree, I can patch up the spec in the next few days. Best, Clark |
From: Neil W. <neilw@ActiveState.com> - 2002-04-15 01:50:03
|
Clark C . Evans [14/04/02 21:34 -0400]: > Ok. In the last week we seem to have two open issues: > > 1. The spec does not require a BOM for UTF-8 and it seems > that this is industry practice so that legacy encodings > can be handled. Also, without requiring UTF-8, it is > a bit harder to do a #ENCODING:ISO8859-1 at a later date, > for example. Seems fine to me. > I suggest we ammend the specification to allow strict ASCII > (only 7 bit characters) without a BOM and require all streams > which use unicode to start with the UTF BOM. This should have > zero little impact on existing YAML users since there arn't > any unicode parsers yet... Actually, the Java parser Rolf sent out probably handles Unicode. I'm not sure whether System.in reads Unicode characters by default, though, so maybe not. His character ranges do include Unicode characters. > 2. We have a potentially large lookahead for the series/key > short-hand. I suggest limiting this case to only allow > for in-line string (without anchors and type family). > The impact is that the following become illegal: > > --- > - this: is legal > - this: [ is, also, legal] > - this: !!remains &001 legal > --- > - !!this is: illegal > - &so is: this > - [and,this,is]: illegal too --- - ? what: about : this then: this - then: this? Later, Neil |
From: Clark C . E. <cc...@cl...> - 2002-04-15 02:10:03
|
The proposal is to patch production 93, keyed_in_series, so that nested_keyed_entry is replaced with inline_leaf keyed_entry_separator value_node(>n) The following would then be illegal since it uses the "?" indicator (and there isn't a matching production). On Sun, Apr 14, 2002 at 06:49:51PM -0700, Neil Watkiss wrote: | --- | - ? ERROR OCCURRS HERE | what: about | : this | then: this | - then: this? |
From: Neil W. <neilw@ActiveState.com> - 2002-04-15 02:21:32
|
Clark C . Evans [14/04/02 22:15 -0400]: > The proposal is to patch production 93, keyed_in_series, so that > nested_keyed_entry > is replaced with > inline_leaf > keyed_entry_separator > value_node(>n) > > The following would then be illegal since it uses > the "?" indicator (and there isn't a matching production). No problem. Just being clear. Later, Neil |
From: Rolf V. <rol...@he...> - 2002-04-15 07:42:54
|
Neil Watkiss wrote: > Actually, the Java parser Rolf sent out probably handles Unicode. I'm not > sure whether System.in reads Unicode characters by default, though, so maybe > not. His character ranges do include Unicode characters. In fact the byte to char conversion is not part of the code I submitted. That can be handled by the standard Java API. So, the characters I process are Unicode, since that is their native representation in Java. I'm not sure what happens with characters above 65535, but I'll let the Sun take care. :-) Rolf. [ By the way, I hate to sent duplicate mails: are you all on the list or should I cc ? ] |
From: Brian I. <in...@tt...> - 2002-04-16 02:06:17
|
On 14/04/02 21:34 -0400, Clark C . Evans wrote: > Ok. In the last week we seem to have two open issues: > 2. We have a potentially large lookahead for the series/key > short-hand. I suggest limiting this case to only allow > for in-line string (without anchors and type family). > The impact is that the following become illegal: This also has an interesting impact on emitters. The following it illegal: --- - &Anchor foo: bar - *Anchor ... The emitter has two options. Walk the entire document ahead of time or doctor up the stream when it sees a duplicate reference. I'd probably do the latter because that's how I currently deal with anchor emission. The basic emission fact is that you can't start writing a stream until you've seen the entire document. *Unless* you anchor every single node, which might actually be useful for some streaming applications where a human never sees the YAML. Just stating the (perhaps not so) obvious, Brian |
From: Clark C . E. <cc...@cl...> - 2002-04-16 02:20:34
|
On Mon, Apr 15, 2002 at 07:06:11PM -0700, Brian Ingerson wrote: | The basic emission fact is that you can't start writing a stream until you've | seen the entire document. *Unless* you anchor every single node, which might | actually be useful for some streaming applications where a human never sees | the YAML. Or, unless your programming language gives you reference counts of each object. In this case you may accidently anchor a few objects that don't need to be anchored, but at least you don't miss any. ;) Clark |
From: Neil W. <neilw@ActiveState.com> - 2002-04-16 02:25:05
|
Brian Ingerson [15/04/02 19:06 -0700]: > This also has an interesting impact on emitters. The following it illegal: > > --- > - &Anchor foo: bar > - *Anchor > ... > > The emitter has two options. Walk the entire document ahead of time or doctor > up the stream when it sees a duplicate reference. I'd probably do the latter > because that's how I currently deal with anchor emission. > > The basic emission fact is that you can't start writing a stream until you've > seen the entire document. *Unless* you anchor every single node, which might > actually be useful for some streaming applications where a human never sees > the YAML. I was thinking the "emitter" would be a two-part API, with libyaml implementing the "emitter proper" part of it (far right): ------------- ------------------------------- ----------- | In-memory | ---> | Application-specific dumper | ---> | Emitter | ------------- ------------------------------- ----------- My theory is that only Perl knows how to walk Perl data structures; the same is true for every other case, too. So the Dumper will be the one which has to futz with the anchors, make sure there are no duplicate keys, etc. The emitter interface will be (remarkably) matched to the parser interface: void emit_doc_header(...) void emit_keyed(...) void emit_series(...) void emit_leaf(...) void branch_close(...) ... You get the idea. The anchors and and other node properties will be fed in through the various emit_* functions, which will format them according to YAML. But it's the Dumper's job to get the semantics correct. So in libyaml's case, I would do it "the former" way, not the latter. That's because it really implements a stream -- you can't unstream something. How do other serializers do this: Data::Dumper, cPickle, etc? Later, Neil |
From: Clark C . E. <cc...@cl...> - 2002-04-16 02:49:43
|
On Mon, Apr 15, 2002 at 07:24:54PM -0700, Neil Watkiss wrote: | I was thinking the "emitter" would be a two-part API, with libyaml | implementing the "emitter proper" part of it (far right): | | ------------- ------------------------------- ----------- | | In-memory | ---> | Application-specific dumper | ---> | Emitter | | ------------- ------------------------------- ----------- | | My theory is that only Perl knows how to walk Perl data structures; the same | is true for every other case, too. So the Dumper will be the one which has to | futz with the anchors, make sure there are no duplicate keys, etc. The | emitter interface will be (remarkably) matched to the parser interface: | | void emit_doc_header(...) | void emit_keyed(...) | void emit_series(...) | void emit_leaf(...) | void branch_close(...) Actually the emitter should be "exactly" the parser interface. ;) Clark |
From: Brian I. <in...@tt...> - 2002-04-16 03:52:53
|
On 15/04/02 19:24 -0700, Neil Watkiss wrote: > Brian Ingerson [15/04/02 19:06 -0700]: > So in libyaml's case, I would do it "the former" way, not the latter. That's > because it really implements a stream -- you can't unstream something. How do > other serializers do this: Data::Dumper, cPickle, etc? Data::Dumper can always refer back because it has the "root anchor", $VAR1. This is not without its edge cases. Data::Dumper does not prewalk the tree, but Sarathy has said that it probably should. Time to be philosophical: "No serializer is perfect". The closest you can come to pure data accuracy is to dump/restore core memory. This ends up being far from perfect in human readability. YAML hits a sweet-spot between human readability and data accuracy. It highly values both, but is perfect in neither. Some of this accuracy lies outside the spec, IMO. For instance, YAML.pm does not preserve aliases between leaves. That's because I think that humans do not like serializations that look like this: --- answers: - &001 yes - *001 - &002 no - *001 - *002 - *002 - *001 ... I prefer: --- answers: - yes - yes - no - yes - no - no - yes ... You may say I'm breaking the rules, but I'll disagree. I'm merely choosing not to preserve a probably uninteresting data relationship to improve usability. I could easily add a Purity option. But I wouldn't make it the default. There are plenty of other attributes that YAML won't readily preserve. Like the fact that a leaf was READONLY. I could dump leaves as (perl internal) SV structs in a special YAML class, but I don;t think the masses would appreciate it. Just rambling, Brian |
From: Clark C . E. <cc...@cl...> - 2002-04-16 04:02:10
|
On Mon, Apr 15, 2002 at 08:52:38PM -0700, Brian Ingerson wrote: | YAML hits a sweet-spot between human readability and data accuracy. | It highly values both, but is perfect in neither. I prefer... | --- | answers: | - &001 yes | - *001 # yes | - &002 no | - *001 # yes | - *002 # no | - *002 # no | - *001 # yes | ... This is one of the items still on my wish list. Best, Clark |
From: Brian I. <in...@tt...> - 2002-04-16 04:17:14
|
On 16/04/02 00:07 -0400, Clark C . Evans wrote: > On Mon, Apr 15, 2002 at 08:52:38PM -0700, Brian Ingerson wrote: > | YAML hits a sweet-spot between human readability and data accuracy. > | It highly values both, but is perfect in neither. > > I prefer... > > | --- > | answers: > | - &001 yes > | - *001 # yes > | - &002 no > | - *001 # yes > | - *002 # no > | - *002 # no > | - *001 # yes > | ... > > This is one of the items still on my wish list. Unfortunately, there's no good way to scale it to every case without creating a big mess. And it's ugly anyway, at least for the general user. Cheers, Brian |
From: Steve H. <sh...@ha...> - 2002-04-16 09:01:24
|
----- Original Message ----- From: Brian Ingerson <in...@tt...> > Some of this accuracy lies outside the spec, IMO. For instance, YAML.pm does > not preserve aliases between leaves. That's because I think that humans do > not like serializations that look like this: > > --- > answers: > - &001 yes > - *001 > - &002 no > ... > I prefer: > --- > answers: > - yes > - yes > - no I use the YAML emitter probably 50 times a day at work to debug data structures. Human readability is a big deal to me. So far the reference syntax has generally hindered readability more than helped it. It would be nice to have an option to suppress aliases for debugging. Every now and then the aliases are helpful, because they remind me what parts of my data structure are shallow copies of each other. I could see big uses for that in Python, where objects get passed as references by default. I've been bitten by reference-passing in Python. |
From: Brian I. <in...@tt...> - 2002-04-16 18:19:47
|
On 16/04/02 04:55 -0400, Steve Howell wrote: > ----- Original Message ----- > From: Brian Ingerson <in...@tt...> > > Some of this accuracy lies outside the spec, IMO. For instance, YAML.pm does > > not preserve aliases between leaves. That's because I think that humans do > > not like serializations that look like this: > > > > --- > > answers: > > - &001 yes > > - *001 > > - &002 no > > ... > > I prefer: > > --- > > answers: > > - yes > > - yes > > - no > > I use the YAML emitter probably 50 times a day at work to debug data structures. > Human readability is a big deal to me. So far the reference syntax has > generally hindered readability more than helped it. > > It would be nice to have an option to suppress aliases for debugging. > > Every now and then the aliases are helpful, because they remind me what parts of > my data structure are shallow copies of each other. I could see big uses for > that in Python, where objects get passed as references by default. I've been > bitten by reference-passing in Python. Well the main reason for aliases in the first place is to deal with recursive data structures. I crashed a guy's machine with my first version of Data::Denter. You obviously can't reprint the structure when it's recursive. All of the data structures that we (me and Steve) deal with are not recursive. It would be expensive, but I suppose we could add an option to print nonrecursive duplicates. Cheers, Brian |
From: Steve H. <sh...@ha...> - 2002-04-16 09:12:49
|
----- Original Message ----- From: Brian Ingerson <in...@tt...> > On 16/04/02 00:07 -0400, Clark C . Evans wrote: > > I prefer... > > > > | --- > > | answers: > > | - &001 yes > > | - *001 # yes > > | - &002 no > > | - *001 # yes > > | - *002 # no > > | - *002 # no > > | - *001 # yes > > | ... > > > > This is one of the items still on my wish list. > > Unfortunately, there's no good way to scale it to every case without creating > a big mess. And it's ugly anyway, at least for the general user. > I agree with Brian. I'd rather have an option to suppress the alias altogether for leaves. When you're aliasing a hash, comments might be more helpful. Suppose this is your data: Name: Bert Address: Street: Sesame St. City: Hollywood State: CA Name: Ernie Address: Street: Sesame St. City: Hollywood State: CA Here is what I'd like: Name: Bert Address: &Address01 Street: Sesame St. City: Hollywood State: CA Name: Ernie Address: *Address01 # line 2 (up 5) The alias would have the hash hey in it, so that it's easier to spot when you're visually scanning the YAML. It would also have the line number so you could just jump to the alias from within the editor. Actually, if I were truly using YAML just for debugging, this is what I'd want: Name: Bert Address: &DEEP_Address_001 Street: Sesame St. City: Hollywood State: CA Name: Ernie Address: *DEEP_Address_001 # line 2 (up 5) Street: Sesame St. City: Hollywood State: CA In other words, I'd show what's aliased, but repeat the data anyway, and I'm make it easy to find the original reference. |
From: Steve H. <sh...@ha...> - 2002-04-16 23:59:33
|
---- Original Message ----- From: Brian Ingerson <in...@tt...> > Well the main reason for aliases in the first place is to deal with recursive > data structures. I crashed a guy's machine with my first version of > Data::Denter. You obviously can't reprint the structure when it's recursive. > > All of the data structures that we (me and Steve) deal with are not > recursive. It would be expensive, but I suppose we could add an option to > print nonrecursive duplicates. > How about this for a loop detection algorithm? sub emit { my ($ptr, $parent) = @_; my $ptr = ...; if ($FORCE_REF{$ptr}) { emit_as_ref($ptr); } emit_normal($ptr); FORCE_REF{$ptr} = 1; emit_normal($ptr); FORCE_REF{$ptr} = ALWAYS_REF{$ptr}; } sub emit_as_ref { my ($ptr) = @_; ALWAYS_REF{$ptr) = 1; # ... } |
From: Steve H. <sh...@ha...> - 2002-04-17 02:58:38
|
----- Original Message ----- From: Steve Howell <sh...@ha...> > > How about this for a loop detection algorithm? > > sub emit { > my ($ptr, $parent) = @_; > my $ptr = ...; > > if ($FORCE_REF{$ptr}) { > emit_as_ref($ptr); return; > } > > FORCE_REF{$ptr} = 1; > emit_normal($ptr); > FORCE_REF{$ptr} = ALWAYS_REF{$ptr}; > } > > > sub emit_as_ref { > my ($ptr) = @_; > > ALWAYS_REF{$ptr) = 1; > # ... > } > The idea here is that we're only detecting loops, so we only need to set the FORCE_REF flag while we're still emitting a node. I could even see doing this loop detection algorithm before we do the emitting pass, so that we don't have to recalculate anchor offsets every time we detect an alias. In general, doing loop detection first would avoid this code: for my $id (@{$o->{node_ids}}) { if ($found) { $o->{id2offset}{$id} += length($anchor) + 2; } [...] Doing loop detection first might also allow some other presentations of recursive data structures, even if they weren't strictly YAML. For example, we might show all the aliases sections after the toplevel document, sort of like how complex types work in our XML schema documents at work. On the other hand, we'd be duplicating a lot of walker code. None of these are pressing issues; just throwing out ideas. I can't emphasize enough how useful YAML has been for debugging. I'm wondering if it makes sense to position YAML as a debugging tool and add some features that can help with cyclical dependency detection, deep/shallow copy detection, etc. |