From: Oren Ben-K. <or...@ri...> - 2001-12-08 12:13:44
|
We are back to "YAML is like an onion, not like a cake" issues. Both have layers, but... I realized you and I read a slightly different meaning into the current wording. You expect the generic information model to be realized in concrete programming terms, as a separate programming layer (libyaml, probably). I've thought of it as a set of constraints on the concrete information model implementation. In programming terms, you think of it as a concrete class to wrap and I think about it as an interface to implement. I understand why you'd want a concrete generic implementation. It would be ever so nice to be able to implement YPATH, YSIGN etc. using just a single, universal implemenattion which is totally independent of the schema, data types etc. used in the document... Unfortunately this is an impossible goal. Even the lowly integer (10, 012, 0xA) defeats you, never mind the harder cases. I think my view is a more practical one. With a little bit of care it is possible, using it, to do everything which can be done without trapping ourselves in impossible constraints. I'll try to formalize this a bit. Let's use digital signatures as a test case because it focuses nicely on information model issues. The way I see it, there are only two concrete models. One is the serialized model. Every YAML document is a concrete realization of this model. We both agree on this, and that not all the information contained in this model is "significant". Note that digitally signing a serialized YAML document, including comments, folding etc. isn't a bad idea. In fact it may make perfect sense in certain scenarios. The fact that some of the information is "insignificant" to the application which will act upon the document does not mean this information is "insignificant" from a legal point of view, for example. At any rate, the second model is the native one. This isn't a great name... It is the result of parsing/loading/reading the file into "working memory" (which may be an on-disk SQL database, but you get the drift). In short it is what a YAML application is expected to actually work with. One design goal for YAML which I feel we should strongly adhere to is that the concrete realization of the second model uses native data structures rather than generic DOM-like objects. This, however, poses a problem. On the one hand, we have a strong need to define exactly what information is significant in this model. The trivial way to do that is to define a set of DOM-like interfaces (which is what XML has done). But we will be using a maze of twisty little native data structures, all different; there's no way we can enforce/require a common set of interfaces. The solution is to define a set of *constraints* on any concrete in-memory data structure used to represent YAML data. You could think of this set of constraints as an "interface" which the concrete realization must "implement". Hence I think of it as "the abstract data model". Hence there's no such thing as "an API for the abstract data model" just like there's no such thing as an instance of an interface class. Under this view let's revisit our layered processing model: Serialized model -> Parser -> Abstract model -> Loader -> Native model It is obvious that what passes between the parser and the loader can't be something you can work with. The reason to separate the parser from the loader is to separate the task of extracting structured, uninterpreted informaton from the file (the parser) and converting it into appropriate native structures (the loader). Now, about transfer types/methods, type families etc. Each node has a serial representation. After parsing/loading it has a native one. What is allowed in the serial representation is known as "the syntax" and we've got that one covered pretty well. What is allowed in the native representation *can't* be fully specified because we'll be using all these different implementations and hence we can only specify *constraints* on it, known as "abstract types". These define things like equality, identity etc. Again there's no instance of an "abstract type" which isn't also an instance of some concrete "native type". Once the parser extracted the data from the serialized representation someone has to convert it into *some* native type. This is the task of the loader, and the piece of code which does that for a particular node is an *implementation* of a certain "transfer method". A transfer method is *defined* in terms of "input: node syntax (style, format); output: abstract type". An *implementation* of a transfer method takes data in a format specified by a certain parser (an arbitrary API - character buffers or whatever) and converts it into a concrete native object, where the native data type of that object satisfies the requirements given by the abstract type. One can think of the "loader" as "the collection of transfer method implementations". Again, the API between the parser and "the loader" is completely arbitrary and not something anyone other then a parser or loader implementor needs to know about. One could in theory think of standard APIs between the parser and the loader to allow for the same transfer method implementation to work in different parsers, but that's a whole separate issue and IMVHO extremely premature to even mention, so forget I said it. Certainly no application would *ever* be working at this level. As mentioned in the spec, when a parser sees an unknown transfer method he has the option to invoke a "catch-all" transfer method known as "encoding". This method builds on the native data types which are required by YAML (plus the magical structured keys) to provide a native data object which preserves the significant information (actually, it preserves *too much*, but that's the best one can do). On output this native data object would be unharmed and the application can even do *very limited* processing on its content. Now, back to signatures. It is obvious it *can't* work in the abstract model; an implementation will always use *some* concrete native model. However, we want different implementations of the signature utility, running using different native models, to give the same result. How can we do that? Obviously by restricting the implementations to only rely on properties of the data which are fully specified in the abstract model. It is similar to using an interface class rather than concrete implementations to ensure that your code would work no matter the implementation. However in our case we can't define an actual interface class which you can code against (because the different implementations use LISP, Python, SQL, Perl, C++ and who knows what else). In an ideal world we could have used some IDL and ask everyone to code against that. However this isn't an ideal world and a Perl programmer would just code using Perl arrays and hashes. So, back to the signature utility. It has another problem. For an unknown transfer method, it *doesn't know* what properties of the data are "fully specified" by the abstract model and what aren't. For example if it doesn't know what integers are then there is no way it can tell that '10', '012' and '0xA' have the property of being the number ten. This means that the signature utility must play it safe. It takes all the (encoded) data and uses that. This means that if it doesn't know what integers are, '10', '012' and '0xA' would give different signatures. All of which would be different from what a utility which does know about integers would give to the number ten. Is that acceptable? For digital signatures, probably not. So, if I were writing a signature utility, I'd reject every unknown transfer method. But in other cases, it makes perfect sense. Take the simple task of creating a map object. One of your keys is in an unknown transfer method. You don't know whether it is equal to another key. So you assume it isn't. Hence '10' and '012' and '0xA' would be accepted as different keys in an application which doesn't know what integers are. It would know that '10' and '10' are equal, because if you have the same transfer method and the same uninterpreted data, you know you'd get equal native objects - a transfer method must be deterministic. An application which does know about integer would recognize '10', '012', '0xA' as duplicates. That's perfectly acceptable. Take YPATH. Again, if a YPATH implementation knows the transfer method, it can match '10' with '0xA'. If it doesn't, it can't. This sucks. Really sucks. But no wording, twisting, re-layering etc. would help you avoid this hard fact. The best you can do is require every YPATH implementation to be aware of a minimal set of transfer methods. Final point: round trip. It follows from the above that the particular format ('10' vs. '012'), style (block vs. escaped), *transfer method* ('base64' vs. 'base16') are *NOT* round tripped. What is round tripped is (1) the abstract type of the value; (2) the value itself (under the equality operation defined by the abstract type). That's all. Encoding is a simple way to allow every application to satisfy this when the transfer method is unknown, by round-tripping both the transfer method and the format. Summary: - Three models: Serialization: - Concrete (files). - Syntax (node style). Native: - Concrete (objects). - Classes (native data types). Abstract: - Constraints on native. - Mathematical (schemas, etc.). - Transfer method: - Definition: Input: uninterpreted data (node style, value format). Output: abstract type. - Implementation: Input: parser-specific API (uninterpreted data). Output: native object. - The native data type is a property of the native model. - The transfer method is a property of the serialization model. - The same transfer method may accept several formats for the same value. - On output it chooses amongst them. - Several transfer methods may target the same abstract type. - On output the dumper chooses amongst them. - The same abstract type may be realized by several native types. - The transfer method implementation chooses amongst them. - If you don't know the transfer method, you *are restricted* but can round-trip. Well, this was a long post... but I hope it conveys my views on the issue, and that I convinced you it is "the right thing to do". Have fun, Oren Ben-Kiki |