From: Clark C. E. <cc...@cl...> - 2002-09-16 04:57:36
|
Machine Processing and Human Presentation YAML information can be viewed in two primary ways -- as data structures for automated processing ("machine") and as a sequence of nicely formatted characters representing this data ("human"). A YAML processor is a tool for converting information between these views. This section examines components of YAML information through the union and intersection of these perspectives. +---------------------------------------------------| | MACHINE PROCESSING | | | +---------------------|..............................+ | | HUMAN PRESENTATION | HUMAN/MACHINE OVERLAP : | | | : | | | YAML is a graph structure : Adds data type | | | composed of sequences, : information to | | | mappings and scalar nodes. : each node. | | | : | | Adds formatting | Scalars are an ordered : Scalars default | | information to | set of unicode characters : to a string. | | each node. | : | | | Sequences are an ordered : Sequences are | | Adds key order to | set of nodes. : often an array. | | mapping nodes for | : | | better reading. | Mappings are an unordered : Mappings are | | | set of key/value pairs : usually loaded | | Comments can be | where duplicate keys are : as hashtables as | | used for helpful | not allowed. : class objects | | human-oriented | : | | details. +---------------------------------------------------+ | | +----------------------------------------------------+ When moving information between these two domains, elements common to both domains should always be maintained. That is sequences, mappings, and scalars should survive conversion. However, the machine processing representation of this information may not be able to maintain the presentation components, such as key order, comments, and style. In a similar manner, for readable text, it is frequently desirable to omit data typing information which is often obvious to the human reader and not needed. This is especially true if the information is created by hand, expecting humans to bother with data typing detail is optimistic. There are two general approaches for "round-tripping" information between these two domains, preservation and using schemas. At a cost in ugliness, data type properties required for processing can be preserved in the YAML text file by using an explicit syntax. In a similar manner, presentation components such as key order and style, can be preserved in the processing domain using shadows, decorators, or other approaches. Shadowing puts the presentation aspects of a node in a lookup table keyed by the native objects' memory address. Decorators wrap a native object to hold key order, style, or other properties and forward on all requests to the underlying object. The general problem with preserving is that humans may forget to add the appropriate type identifiers, or a machine process may add, modify, delete or move nodes in such a manner that key order, style and other preserved information becomes out-of-date. Certainly this can be solved by making the automated process be aware of the human domain node properties, but this may complicate things and unnecessarily tie human presentation concerns to the processing requirements. Therefore, while the above are good solutions they require each domain to hold information needed only by the other. An alternative round-tripping solution is to keep ordering, style, and typing information in an shared external schema. In this way, when converting from a human presentation to the machine processing domain, the schema can be used to provide typing information which may be absent from the presentation-oriented text. Likewise, when converting from a native machine representation, the schema can be used to restore human presentation properties. Since both domains share the same core model of mappings, sequence and scalars, a schema can leverage this common information to determine proper placement of missing properties. The downside to a schema is that it requires external information, so that either representation is not self-contained. YAML's Core Model An essential aspect of YAML is that it has a core model of mappings, sequences, and scalars. This allows for a generic tool set including such items as: a path-based node selection tool ("ypath"), a schema tool used for filling in presentation or processing specific properties or verifying that a YAML text or internal structure has a particular shape ("yschema"), or a transformation language for converting one graph structure to another ("cyatl"). This core model of mappings, sequences, and scalars was chosen for three reasons. First, they are common data structures found in most modern scripting languages -- or, if they are not build-in, most languages have the ability to simulate them quite easily. For example, in Python mappings are the dict type, sequences are the list type, and scalars are the string type. Likewise, in Perl the mapping is the Hashtable type and sequence is the array. The YAML mapping does not have ordered keys and does not allow duplicates as most of these native data types have the same restrictions. Further, in most languages a class object can be viewed through YAML rose colored glasses as a typed mapping with each of the class's properties treated as a key. Likewise lists of objects can be viewed as a sequence of typed mappings. The second reason is that mappings and sequences are functional, that is, they are an example of a mathematical function. Although this is an academic argument, at the core of mathematics is concept of a function and there is not much one can do in mathematics without it. For example, most of algebra and category theory, the basis of modern type theory rests firmly upon a functional abstraction. In mathematics, the domain (keys) of a function is not ordered and does not admit duplicate values. Thus, YAML has this same restriction, which happily coincides with the restrictions found in concrete programmatic constructs such as hashtables and arrays. The third reason is pragmatic. YAML had to pick some set of structures for its primitives. If there are too many primitives, YAML seems complex and it complicates tool building. If there are two few primitives then YAML lacks expressive power. Mappings and sequences were chosen because they combine very well to make other structures. An alterative choice (one chosen by XML) is a list of pairs. While this may seem simpler because it is one primitive, it is actually quite a complicated option. It causes YAML programs and derivative standards (such as YAML schema) to deal with the complexities of pairings when they just want a sequence or deal with ordering and duplicates when they just want a mapping. In short, mappings and sequences are far more common than a list of pairs and a list of pairs is easily constructed as a sequence of mappings. YAML's Semantic Restrictions Since YAML information has simoutaenous a machine processing representation and a human presentation image there is a risk of properties specific to human presentation to bleed over into the machine processing domain. While it may seem obvious to some that using comments, key order, style, and other aspects of the presentation domain to drive processing is a bad idea; this may not be clear to all. As described above, YAML's core model has a mapping which does not use key order and forbids duplicates. Unfortunately, these constraints cannot be enforced by a sequential-access parser which will not only report key ordering but also allow duplicates. Thus, there is significant risk that a new YAML user may not grok this subtly and create application which uses key order not just for presentation, but also to drive application logic. This situation may be worsened if a node wrapper approach is used to preserve presentation key order. This may not be a problem if the application is isolated and the author of the program completely controls the access to the created files. However, a core adoption driver of human-readable serialization is that they can be read and used in ways which the original author has never intended. As such, it is often the case that a YAML data file may be used by programs which the original author has never indented; and this indeed is a good thing. Therefore, it is recommended that a YAML distribution be clear about this restriction and encourage application writers to respect it even though the parser itself won't be able to enforce. For example, it could call a shadow mechanism "presension" or a wrapper object could export the key order labeled perhaps "presentation_order". Ideally, a schema based approach could be used to help maintain key order instead of relying upon preservation techniques. This restriction on key order is based squarely on intent. A litmus test to ask is: "If an intermediate processor messed up the key order, could my application fix it so that the user is not distressed by the presentation". For example, consider the following YAML: --- # Debate Winners Dave Lee Ross: Grover High School Ed Roadhouse: Lasher High School Mark Smith: Andover High School If it is interpreted to be a way to find the high school for a particular person, then the application is a proper use of YAML. This intent would pass the litmus test, if an intermediate processor scrambled the key order, the pretty printer could sort the keys alphabetically without loss of information. However, if the items above are meant to encode the winners of a Debate contest, where Dave is 1st place and Ed is second, then there is a problem. This can be seen with a user request to ask who took 3rd place; a "ypath" tool would have no way to answer this question. In this case, the formulation above would be better represented: --- # Debate Winners - name: Dave Lee Ross school: Gover High School - name: Ed Roadhouse school: Lasher High School - name: Mark Smith school: Andover High School Therefore, it is seen that this is very subtle: if the intent is to have order significant for processing reasons (not just for presentation) then the sequence construct should be used. However, this does stop keys from being ordered for presentation purposes; say alphabetically for the human reader or according to a particular fixed key sequence such as our invoice example. |
From: Brian I. <in...@tt...> - 2002-09-16 05:53:57
|
On 16/09/02 04:59 +0000, Clark C. Evans wrote: > > Machine Processing and Human Presentation For those of you who think that email is a poor medium for a 200+ line essay, I have recreated Clark's message at: http://wiki.yaml.org/yamlwiki/MachineProcessingAndHumanPresentation I would ask you to put your comments there. I've been informed that Mr. Evans has purchased a web browser, and will be more than able to see your thoughts. If you like the formatting, well, thanks. I spent a good 90 seconds on it. Cheers, Brian |
From: Tom S. <tra...@tr...> - 2002-09-16 13:26:33
|
On Sun, 2002-09-15 at 22:59, Clark C. Evans wrote: > Since YAML information has simoutaenous a machine processing > representation and a human presentation image there is a risk of > properties specific to human presentation to bleed over into the > machine processing domain. While it may seem obvious to some that > using comments, key order, style, and other aspects of the presentation > domain to drive processing is a bad idea; this may not be clear to all. i have come to the conclusion that it is a mistake to consider order as something distinct from type. certain types entail order (eq. !omap) others do not. seperating the two makes the false premise that order is only useful to human use cases. but order can also be useful to machine. the notation of using '-' when order is required is a good one in that it preserves compatability with platforms that do not support ordered types --an ordered collection will remain intact even though not strictly treated as an single thing, but rather as a collection of collections. all of this holds for duplicates as well. > As described above, YAML's core model has a mapping which does > not use key order and forbids duplicates. Unfortunately, these > constraints cannot be enforced by a sequential-access parser which > will not only report key ordering but also allow duplicates. Thus, > there is significant risk that a new YAML user may not grok this > subtly and create application which uses key order not just for > presentation, but also to drive application logic. This situation > may be worsened if a node wrapper approach is used to preserve > presentation key order. well, what can you do? either deny access to the serial level (in effect deprecating the serial model). Or make the graph model deal with it (which obviously is not desired). Otherwise you're going to be playing a "please don't do that" game, which has a tendency not to work. > > --- # Debate Winners > - name: Dave Lee Ross > school: Gover High School > - name: Ed Roadhouse > school: Lasher High School > - name: Mark Smith > school: Andover High School > > Therefore, it is seen that this is very subtle: if the intent > is to have order significant for processing reasons (not just > for presentation) then the sequence construct should be used. > However, this does stop keys from being ordered for presentation > purposes; say alphabetically for the human reader or according > to a particular fixed key sequence such as our invoice example. > now this last point is very interesting. wouldn't a schema still be able to designate an order? -- tom sawyer, aka transami tra...@tr... |
From: Clark C. E. <cc...@cl...> - 2002-09-16 16:09:19
|
On Mon, Sep 16, 2002 at 07:37:44AM -0600, Tom Sawyer wrote: | > As described above, YAML's core model has a mapping which does | > not use key order and forbids duplicates. Unfortunately, these | > constraints cannot be enforced by a sequential-access parser which | > will not only report key ordering but also allow duplicates. Thus, | > there is significant risk that a new YAML user may not grok this | > subtly and create application which uses key order not just for | > presentation, but also to drive application logic. This situation | > may be worsened if a node wrapper approach is used to preserve | > presentation key order. | | well, what can you do? This situation isn't just limited to key order. For example, --- - &ONE one - *ONE Someone could wish to round-trip the anchor name used, as it is part of the human presentation -- even though it isn't used by the machine model. Now, if someone did something like... --- - &ONE one - | if ONE = 'one' then ... Here, the intent is clearly to use the anchor more or less as a key. There is no way to _stop_ someone at the parser level from doing this. The answer depends upon which perspective you take: If YAML is *just* a syntax, everything is fair game. If YAML is a mechanism for converting between a human presentation syntax and a machine native representation, the above usage undermines the purpose of YAML and thus, although the _syntax_ may be valid, it breaks the *model* put forth by YAML. At this time, the anchor is _not_ part of the graph model and thus any tool which supports this usage would be doing YAML a dis-service, much like any tool that allowed for TABS to appear in the syntax (an extension) without a #TAB declaration would be doing YAML a dis-service. | either deny access to the serial level (in effect | deprecating the serial model). You can't do this as pretty printers are a good thing. There are times when you want to deal with YAML as a syntax. And thus, the syntax/serial model. | Or make the graph model deal with it (which obviously is | not desired). Otherwise you're going to be playing a | "please don't do that" game, which has a tendency not to work. I have faith. If we explain our goals clearly and the rationale behind them, *most* of the people out there will follow. As a result, those who follow the rules will be better off with them beacuse schema language will be cleaner and eaiser to understand. And those who don't follow the rules won't be able to use shared utilities. We have to have *some* assumptions as to how YAML will be used. This isn't putting on a straight-jacket, it's just setting a clear scope. If your application doesn't fit the scope... we probably won't be much good to you anyway. | > Therefore, it is seen that this is very subtle: if the intent | > is to have order significant for processing reasons (not just | > for presentation) then the sequence construct should be used. | > However, this does stop keys from being ordered for presentation | > purposes; say alphabetically for the human reader or according | > to a particular fixed key sequence such as our invoice example. | | now this last point is very interesting. wouldn't a schema still be able | to designate an order? In our invoice example, yes. The schema could designate that the "invoice" key comes first, followed by the "date", and then the "bill-to", etc. In this way, any invoice can be fixed by sending it through a pretty-printer that re-organizes the keys so that they are nice. The pretty printer could also specify scalar styles to use, add extra comments to help the reader understand, etc. In this particular example, you couldn't use the schema to re-order, without putting into the schema the particular results of this exact debate. And if you do this, you don't need this document cuz the schema has all the information required. I hope this helps. In short, I see key order as one of many possible presentation properties which could bleed over. No way around this. Making key order significant in the core/machine model just complicates things for 95% of the cases in order to make a few presentation cases "nicer". It's just not a very good trade. Best, Clark |
From: <ir...@ms...> - 2002-09-16 17:18:42
|
On Mon, Sep 16, 2002 at 04:59:52AM +0000, Clark/Oren wrote: > --- # Debate Winners > - Dave Lee Ross : Gover High School > - Ed Roadhouse : Lasher High School > - Mark Smith : Andover High School Kudos to Clark for identifying the distinction between ordering for presentation convenience and ordering to represent an essential aspect of the data (i.e., the data would be destroyed without it). Any time the word "ranking" comes up, it's clearly a sequence and should be represented as such. The case for a pretty printer that preserves the order for all mappings is different; it's a presentational issue that can be met at the serial level. One use case: One use case: I work in the publishing industry. When proofreading or a large number of changes are necessary, it's very common to print out the data and mark it up offline. This helps cut down the number of hours you're staring at a monitor, and also lets you see the big picture at once (not just one screenful). After markup, you enter the changes all at once in some editor. It's important for the order in the editor to be the same as on the printout, otherwise the clerk would spend 3X as much time finding the items to change, and may update the wrong one since it doesn't have the context around it. Requiring sequences to preserve the order for a pretty printer would essentially make it necessary to code all mappings as sequences. Of course, you can just print the .yml file itself, but a pretty printer would be able to make font and color enhancements the raw file doesn't have. -- -Mike (Iron) Orr, ir...@ms... (if mail problems: ms...@oz...) http://iron.cx/ English * Esperanto * Russkiy * Deutsch * Espan~ol |