From: Oren Ben-K. <or...@ri...> - 2002-05-10 20:05:49
|
Clark C . Evans wrote: > | To me, the point of alias/anchors is to preserve the > | structure of the object being dumped. They're not > | just a convenience for saving storage. They're > | addresses (what self-important languages call > | references). > > Agreed. Agreed, but... This is a sticky issue because it deals with the "identity" of scalars. This is terribly abstract when dealing with integers, so let's talk about strings. In C, a "string" is just a null-terminated char array. Hence it can be modified *in place*. This makes saying that two variables "hold the same string" a meaningful statement; if you change the string "in place" through one of them, accessing the second variable would reflect the changes. If one goes even further back in hostory, the story goes that using old FORTRAN compilers, if you passed by reference the value "3" to a function, and this function changed the value of the parameter to "4", then from that point on every "3" in the program became a "4". That is, integers there acted exactly like strings do in C. This, mercifully, isn't the case in modern languages. Java, Python, and Perl all treat integers and strings in the same way, as immutable *values* which can't be changed in place. In such languages, saying that two variables "hold the same string" is as meanigful (meaningless :-) as saying they hold "the same integer". There are good reasons to take this approach. There is a big difference between "value based classes/objects" and "identity based classes/objects". Alas, most languages haaven't reached the point where this distinction is fully appreciated and supported by the language itself. It is interesting to note that C# does make this distinction. Sort of. In a horribly confusing way which would probably give it a bad reputation for years to come. But anyway... The question is, what is YAML's position on this. Are YAML nodes identity-based (read: require aliases structure to be preserved), value based (read: allow aliases structure to change), or what? The answer is, at this point in time, that we haven't given it the thought it deserves. The issue certainly needs to be clarified. I'll offer some thoughts here. - It is obvious that we want numbers to be value based. I don't think that FORTRAN programs from the 50s are relevant to YAML. Besides, I expect that anyone writing code relying on this behavior had not survived long in the profession :-) - It makes a lot of sense to have strings be value based as well. I realize this would make life difficult for serializing C/C++ structures that rely on the same string being *modifiable* via different graph paths. I'll go out on a limb here and say that such programs should be considered our modern equivalent to the aforementioned FORTRAN programs. I don't see that YAML should be bent out of shape because of them. - It makes a lot of sense to have any scalar type be valud based as well. Our current examples of other scalar types are: date/time and binary data. I feel that all the arguments pertaining to numbers and strings also hold for date/time and binary data (consider that the internal representation of date/time and binary data is extremely close to numbers and strings; likewise the operations done on them). In fact I'd argue that almost *by definition*, being a "scalar" means being "value based". Actually I'd like to avoid an actual argument here; this is one of those issues we could spend a year debating. Let's just say I "strongly feel that this should be the case" in general, and specifically in YAML. - We want to be able to write long scalars - specifically, strings (and binary data) - only once in a YAML file. The only mechanism we have for that is aliases. I don't think there is a reasonable alternative here. - We need some mechanism to safely specify that a certain scalar *is* modifiable via two separate paths. Such a mechanism has some possible usefulness even for numbers, has a bit more acceptable uses for strings (I suppose), and may be very useful for scalar types we haven't envisioned at this time. By allowing such a mechanism we can allow the rare cases of "shared modifiable scalar" to be represented. Regadless of how you feel about the mostly philosophical issues I raised above, it seems to me we have only two practical options. 1. We just "keep it simple": - Say that an alias means sharing, pure and simple; - And give up on emitters being able to insert aliases for "equal scalars" unless they are also "shared" in the graph model. - This is elegant to specify and implement. 2. We use "!ptr" for scalars: - An alias only means sharing for a collection node. - Emitters can insert aliases for all "equal scalars" if they see fit. - Emmiters must emit "!ptr" for scalar nodes "shared" in the graph node and alias *them*. - This is less elegant (aliases mean one thing for collections and another for scalars). - Most (but by no means all) languages make this distinction along the same lines, though this doesn't always map to YAML (e.g., C does, but in C a string is a collection). - Note this approach matches our "compatibility" requirement. Example: - &1 a long string that isn't "shared" - *1 - &2 !ptr =: a long string that is - *2 One may claim that a pointer to an integer/string/whatever isn't the same as an actual integer/string/whatever. *That's exactly the point*. Consider that in C/C++ etc. - where the most of the trouble is - this distinction is explicit in the data structure: struct { some_type not_shared_with_anyone; some_type *may_be_shared_with_anyone; } xyzzy; In C# there's a similar semantic distinction which has no accompanying syntactical distinction (there's a special room in the hot place below for the guy who thought *that* up). So C# users may be somewhat surprised at the proper YAML representation of their data types. Thank you, the Sirius Cybernetic Corporation. At any rate, back to YAML. 3. We use "!ptr" *always*. - *All* aliases don't imply sharing. - *All* sharing is done via "!ptr". Even for collections I see no way around choosing between the above three options. I must say it is a hard choice. - Keeping things pure means giving up on the ability to avoid duplicating long strings. Question: How useful is this ability in practice? - Using "!ptr" for scalars means unifying "collection" with "identity" and "scalar" with "value". Question: How often is this unification broken in existing languages/systems? - Using "!ptr" for all sharing means requiring an explicit syntax for sharing collections. Question: Is the additional explicit syntax acceptable? I don't know. Thoughts? Have fun, Oren Ben-Kiki |
From: Oren Ben-K. <or...@ri...> - 2002-05-11 14:05:36
|
Clark C . Evans wrote: > There are two levels of YAML; the first is the raw > model without any type families. At this level, I > think that aliases should always be preserved. We've actually defined 4 levels... I think it is better to define it in the following way: - In all models other than the NATIVE model (that is, in all YAML models), aliases are significant, must be preserved, etc. - The act of loading a YAML model into a NATIVE model, or dumping a NATIVE model to a YAML model, is free to preserve or not to preserve aliases, according to the semantics of each specific type family. This harms YNY to some extent (aliases may not be preserved). In general, YNY round-trip can only preserve NATIVE information, and in some cases this is a *subset* of the information in the GENERIC model. I think we should add a property for type families: identity node-based: equal nodes have different identities unless one is an alias to another. Hence aliases structure always survive YNY. value-based: equal nodes can be considered to be identical even if one is not an alias of another. Hence aliases structure may not survive YNY. All our scalar type families have value-based identity. All our collection type families have node-based identity. That doesn't prevent others from defining their type families differently. Something along the above lines needs to be worked into the spec. The models section is a real PITA to get right... and I couldn't find the time to even *start* on it this weekend (so far, anyway... maybe tonight). Well, at least now we know *what* we want to say there. Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2002-05-11 17:08:12
|
| - In all models other than the NATIVE model (that is, in all YAML models), | aliases are significant, must be preserved, etc. | | - The act of loading a YAML model into a NATIVE model, or dumping a NATIVE | model to a YAML model, is free to preserve or not to preserve aliases, | according to the semantics of each specific type family. Nice. In each model we "loose" information, in the native model, maintaining references is not something which can be gaurenteed. However, in the graph/generic model we do require aliases to be round-tripped; and it is the generic model where our transforms operate. Good deal. I think this is a good rationalization. | identity | node-based: equal nodes have different identities unless one is an alias | to another. Hence aliases structure always survive YNY. | value-based: equal nodes can be considered to be identical even if one | is not an alias of another. Hence aliases structure may not survive YNY. Ok. | All our scalar type families have value-based identity. All our collection | type families have node-based identity. That doesn't prevent others from | defining their type families differently. This may be the general rule, but there are exceptions, !binary scalars are node-based, and !ptr mappings are value-based. ;) | Something along the above lines needs to be worked into the spec. The models | section is a real PITA to get right... and I couldn't find the time to even | *start* on it this weekend (so far, anyway... maybe tonight). Well, at least | now we know *what* we want to say there. Yep. The current model took me appx 40-60 hours to write. I went through many many many private revisions before the current explanation was arrived at; and it still has problems. ;) Best, Clark |
From: Oren Ben-K. <or...@ri...> - 2002-05-17 14:57:49
|
Andrew Kurn wrote: > On the topic of 'identity' of YAML nodes: > > As so often before, much of what you say > sounds a little weird to me . . . :-) > Oren: > > The question is, what is YAML's position on this. Are YAML nodes > > identity-based (read: require aliases structure to be preserved), > > value > > based (read: allow aliases structure to change), or what? > > It seems obvious to me that it's not up to YAML. If identity > is allowed at all, it must be possible to make anything identical. > It's the user's discretion. Identity isn't something to "allow". It just "is" in the sense that native data structures behave in certain ways. I think we agree it isn't up to YAML... we agreed that it is up to the native model to do whatever it needs/pleases, and hence that YAML models (graph/tree/syntax) must always preserve alias structure (otherwise they would be robbing the native model from its ability to do whatever it needs/pleases). In practical terms this means that the choice of whether alias==sharing is left to the specific dumper/loader implementation for the specific type family. This implementation may do anything it needs/pleases, and YAML guarantees that whatever was dumped, the same thing would be loaded. I think that answers your question: > I think you have to let the user decide how to treat > pointers. It's true that you might offer him some > policies that you think are likely to be useful most > of the time, but you must allow him to choose anything > at all in some cases. Quite. The dumper can choose to use aliases, explicit pointers, whatever it wants. I'm not worried about "automatic dumper/loader". To dump/load a native structure, you need to "know" it. Have fun, Oren Ben-Kiki |
From: Brian I. <in...@tt...> - 2002-05-10 22:13:15
|
On 10/05/02 16:07 -0400, Oren Ben-Kiki wrote: > Clark C . Evans wrote: > syntactical distinction (there's a special room in the hot place below for > the guy who thought *that* up). LOL! > I don't know. Thoughts? Well, as the pragmatist of the group, here's my take. YAML.pm currently does not preserve the relationship of scalars in the native model, when dumping it. It does honor aliases between scalars when loading them. This is the most useful, in the general case, IMVHO. (BTW, you *can preserve the relationship using hard references which YAML.pm dumps as !ptr) However, soon I will offer the nondefault option to dump all scalars with the same pointer value, as aliases of the first. I will also have an option to "mark" specific scalars as "pure". Perhaps an option to mark all scalars longer than N, etc... This seems like a great compromise. From Y->N we must honor the aliases. From N->Y we should be *able* to preserve the relationships. But making it the default is the wrong choice from where I stand. Cheers, Brian |
From: Clark C . E. <cc...@cl...> - 2002-05-11 04:14:21
|
| > syntactical distinction (there's a special room in the hot place below for | > the guy who thought *that* up). | LOL! Yea, this one made my day. Thanks Oren. ;) There are two levels of YAML; the first is the raw model without any type families. At this level, I think that aliases should always be preserved. At the next level is where type families come to play; and my suggestion was at this level. We define specific behaviors for the "core" YAML type families: Namely, aliases to string, integer, float, date, and time scalars are not significant "as far as these type families are concerned". Thus, [ &1 "one", *1 ] is equivalent to [ "one", "one" ] as far as the string type family is concerned, but they are not equivalent in the YAML graph model. This is not an equivalence within any of our models, I think the models themselves should be pure. This is an equivalence within the type families we define. If a user agrees with these equivalences, then they can use the built-in type families; otherwise they can make their own type families with what ever semantic constraints they require. Yes; we could attempt to put some rules about this at a lower level, but this creates some arbitrary rules which someone may not want; with this suggestion the rules are only limited to specific type families. Best, Clark |