From: Oren Ben-K. <or...@ri...> - 2002-03-08 18:53:58
|
Rolf Veen has asked why use '/' for separating path segments. Good question. I did it "for historical reasons" - people are just used to it for "paths". At any rate, the choice of characters is less important than the basic approach. Clark C . Evans wrote: > Hmm. I think we could use [] as the existence test (like XPATH). I think RegExps already have a syntax for that. One reason I think we should base YAPTH on regexps - just tailor the matching operators with graph paths - is that many issues were already hashed to death in regexps. Existance test is one. I don't have the perl book handy at home; I suspect that looking at the regexp section there would be very educational. > I was thinking that /xxx would only match on keys and > not on values. If you want ot match on values, I think > using the exists operator would be ideal. There is a difference between matching the value and matching the key that has that value... > # /two matches "world" > --- > one: hello > two: world > ... If so, what matches "two"? I agree there's an ambiguity... perhaps we should write something like /=12 matches --- 12 To disambiguate leaf matching from key matching. Hmmm. > | /transfer(int) > > Close. This would be /[transfer("int")] since transfers > belongs in the predicate. The way I see it, the whole thing is a predicate... I don't see why I need to make the distinction. in "/key", "key" is a predicate which happens to match any !str-typed node with the value "key". See below... > | Question#2: Does > | > | /transfer(ip) > | match > | --- 1.2.3.4 > > I think so. If the engine/parser doesn't have the "ip" > method registered, then transfer("ip") will be an error. That's not that nice... I may still want to be able to match on nodes with explicit "!ip" transfer, even if the engine doesn't recognize it. Perhaps we need two "transfer()" operators, one for explicit and one for implicit? This is getting sticky, fast... > No wildcard transfer methods. I suppose it makes some sense, except for one possible very useful use case - suppose I want to match any HTML element, it would be very nice to be able to write: /transfer("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd^") Matching any transfer by prefix. Again, that's pretty straight-forward for explicit transfers, but is hell for implicit ones. The swamp is getting deeper... > | Question #4: Given we have series/sequences, should we add regexp > | operators > | >int, >=int, <int, <=int, !=int? > > Yes, but in the predicate [] Predicates again... See below. > | Question #5: Is this set ordered? Think of it... it is tricker > | than it seems at a first glance. > > I was thinking the result of a YPath expression is a > unordered stack of nodes, that I'll call a context. > > As for ordering... I think we can add some > sort of "ORDER BY". Having the nodes ordered implicitly > causes alot of work by the xpath processor that is often > unnecessary in XML (and probably impossible in our case). This seems like an area we'll be spending quite a bit of time hammering out the details of. Shudder. > | Question #6: How about prefixes? Do we allow this: > | > | > /transfer("http://company.tld/whatever^type")/transfer("^another-type") > > This could work. I hope so. Imagine the pain matching HTML elements otherwise. In fact, even *with* it. > | So, sticking with absolute paths, it seems as though parent() is well > | defined. IMPORTANT: In this context parent() is well defined for general graphs!!! > XPATH has the notion of "axis", thus each path segment > is (axis '::')? node-expr -- I'm not sure if this fits us. Neither do I... the syntax is interesting however. Food for thought. > Is it possible to use ".." for parent? and "." for > current node? I know both of these are regex... however. Right. Also, they are hardly enough. > | In a sequence, we need to specify next()/prev() entry in a series: > | ... > | Likewise we can define 'before()' and 'after()'. > > Hmm. Unlike XML our tree is not ordered (although it > can be ordered by the keys). This requires some thought. Sequences *are* ordered even in the graph model! > | Question #7: Using any of these function() operators (except for > | transfer()) > | means that the path is no longer "simple". This has implications in > | terms of > | how easily it can be implemented, whether it can be used in a > | streaming > | application, etc. I think we should formally define random-access vs. > | streaming paths. > > Right. Yes. Sigh. Another complication. > | Question #8: There are many functions which can be cast in the form > | <direction>(<distance>)... > > Hmm. Couldn't phrase it better myself :-) > | Question #9: Do we do it the Perl way ("there's more than one way") > | or the > | Python way ("there's the right way")... > > I think ".." is fine, "..." and "...." tip the scale. I guess we'll have to see how it all fits together before we get into this sort of detail. > | Question #10: obviously the above provides a natural syntax for them. [Relative paths - not starting with '/'] > | However, in a relative path, going up above the starting point isn't > | well-defined in the graph model. This is the only case where working in the graph model gives you trouble, and only in the limited case where you go up "above" your starting point. > | So... how to we handle this? Define two > | classes of relative paths, one which are safe in a graph model and > | ones that aren't? > > The result of a YPath evaluation is a unordered set of contexts > (a context is a stack of nodes, aka path). > > So, to evaluate a "relative" path, one must pass in a context, > not just a node. Note that any node is a context with itself > as the "top" node. Hmmm. So this way you could, "logically" anyway, evaluate a YPATH piece by piece - each step giving the "context" of matching nodes to the next step. I see potential difficulties with this, mainly that it would raise the expectation that this "context" would be a manipulatable piece of any implementation (e.g., being able to give a context as an argument in the YAPTH API). Did this happen with XPATH? Is it truly necessary? >| Question #11: How do we handle !include? > > Good question. #include is non-trivial (see XML Include) > > | Hmmm. YPATH isn't that simple after all, it seems. > Neither is !include :-) > > Nope. YPATH is a 6 month project. And its best to define it > as we implement it so that we can always play with it... Seems thay way. We do need to agree on the basics. We seem to have two alternative approaches as to the "infrastructure" used. You focus on the incremental nature of YPATH - working segment by segment, using context as the glue between consecutive steps, and employing patterns as a slave mechanism applied separately at each step. I wanted to tackle the whole monster in one fell swoop by saying "YPATH is a kind of regexp". In this view '/' is just a regexp operator, and *everything* is a pattern. To formalize a bit: each <graph, starting node> define an (enumerable) set of paths (walks through the graph) starting at that point. The YPATH regexp pattern is a boolean function which says "match" or "doesn't match" to each of these paths. Each such path ends at some point ("existance" is trivially handled by going down-then-up, etc.). Thus the function defines a set of nodes - these that are at the end of paths that "match". This is the result of applying the YPATH to the graph (and starting point). Now, regexps are a thoroughly investigated field... both in theory and practice (e.g., there's a well-known syntax for specifying regexps we can build on instead of inventing new ways to do existance tests, ORs, ANDs, partial matches; there are excellent libraries implementing various types of regular expressions which we may be able to build on; and so on). I'm well aware that XPATH does it your way :-) and I'll admit to being somewhat naive about the implications of using regexps vs. "context refinements" as the basic mechanism. I suspect the two are equivalent at some level - though the syntax would probably end up being different... I think the issue deserves some discussion, before we delve into the nitty-gritty of whether we say /.../ or /down(>0)/ or whatever. Thoughts? Have fun, Oren Ben-Kiki |
From: Oren Ben-K. <or...@ri...> - 2002-03-09 11:12:28
|
Clark C . Evans wrote: > Also, from a regex perspective, the slash '/' is traditionally > used as a top level separator, s/this/that/g That's not a part of regexp... just the way to wrap it. > | I suspect that > | looking at the regexp section there would be very educational. > > As much as I'd love to do that, I think that it would > be a big research project. OK. It boils down to: "do we extend regexps or wrap them in an additional level". Wrapping them (the XPATh way) is probably easier. Extending them is more elegant but much more work. XPATH has shown the "wrapping" approach works, si I guess we could go with that... though I still think the syntax for "existance" should not collide with the regexp syntax :-) > | There is a difference between matching the value and matching the key > | that has that value... > | > # /two matches "world" > | > --- > | > one: hello > | > two: world > | > ... > | > | If so, what matches "two"? I agree there's an ambiguity... > | perhaps we should write something like > > I'm not interested in matching "two". Well, I am. For example: people: Oren Ben-Kiki: ... Clark Evans: ... I want to match the *key* ending with 'Ben Kiki' (all my family members names). I'd have expected "/people/.* Ben-Kiki" to do that... I don't see we can simply say "this can't be done". I think this is just another TBD... there are quite a few of these anyway. > | I think the issue deserves some discussion, before we delve into the > | nitty-gritty of whether we say /.../ or /down(>0)/ or whatever. > > Yep. Unfortunatly, I don't have a ton of time. Could we just > focus on the parser for a while? YPATH is going to be hard > to work out without an active implementation... I only did this posting as a probe to see how complex things are. They turned out to be rather complex... so let's delay working out the details. I agree: Parser implementations first. Have fun, Oren Ben-Kiki |
From: Clark C . E. <cc...@cl...> - 2002-03-09 15:23:15
|
On Sat, Mar 09, 2002 at 06:12:22AM -0500, Oren Ben-Kiki wrote: | OK. It boils down to: "do we extend regexps or wrap them in an additional | level". Wrapping them (the XPATh way) is probably easier. Extending them is | more elegant but much more work. XPATH has shown the "wrapping" approach | works, si I guess we could go with that... though I still think the syntax | for "existance" should not collide with the regexp syntax :-) Xpath uses [] to wrap predicate filters. Basically, for each path segment you have the following components: - Optional parenthesis and union operator (one|the-other) - an Axis such as child,parent,descendent,attribute,namespace,etc. - a wildcard (*), namespace, or qualified name (namespace+local_part) - one or more predicates, which are essentially a composite "filter". The predicate is an expression which is cast as a boolean: - A subordinate path expression is TRUE if it contains at least one node, two nodes sets are "equal" if they have a non-empty intersection (a nasty hack). - A mathematical operation is true if it matches the index of the current node (*shudder*), this allows [3] to double as an index operation. IMHO, this "neat" trick is a bad idea. - String comparision is as expected, when a node is compared with a string, the node's string value is used. - many other items... - Since you can have "sub-paths" within a predicate, you have to have a delimiter to let you know that you are at the end of the expression, hence [a][b][c] is used instead of [a and b and c]. | people: | Oren Ben-Kiki: ... | Clark Evans: ... | | I want to match the *key* ending with 'Ben Kiki' (all my family members | names). I'd have expected "/people/.* Ben-Kiki" to do that... I don't see we | can simply say "this can't be done". I can buy that. So, we need to allow regular expressions on key matches as well as value matches. Ok. | I only did this posting as a probe to see how complex things are. They | turned out to be rather complex... so let's delay working out the details. I | agree: Parser implementations first. XPath is very complicated, but also very powerful. You can do all kinds of stuff with it and it's probably important that YAML has an equivalent mechanism that is just as powerful. That said, I think we have the following conflicts if we wish to extend regular expressions with a tree traversal operator. Symbol Regex meaning XPath meaning ------------------------------------------------------------------------- / <none> Separates path segments [] Character class Predicate (filters the path segment) ?+ quantifiers <none> (|) grouping and alternation grouping and alternation $^ matches first and last <none> {} {min,max} in some engines <none> . matches any character represents current node .. matches any two char represents parent node : <none> Used to separate namespace from name * univseral quantifier matches any node name So, the biggest conflict seems to be the use of [] and the period (. ..) If you think about it... this is very hopeful. Now, from my experience you quite need a "predicate" mechansim or the power of the tree expression is very weak; the predicate is used to compute stuff that decides if a node is included or not, but this computed stuff is not included in the result. Unfortunately, since we can't use indentation, we need a begin/end marker. Clearly [] is out since it is used for character classes and () is out since it is used for grouping. On the standard keyboard, this leaves <> and {}. <> is bad since it is used for XML and requires escaping in XML/HTML, furthermore it is used for mathematical operations. This leaves {}. Conclusion: We can drop {min,max} operator from the regular expression syntax and use this for predicates. Note that {min,max} could be replaced with { current-node() > min && current-node() < max }. Note that stuff in the parenthesis can be non-regular expression, function like syntax providing for a nice dual-syntax mechanism. Ok. The other problem is the "." period. Since the period is so critical within regular expressions it seems that we wouldn't want to overload this. Therefore, we need some way to handle ".." and "." which are just abbreviations for "parent" and "self" axis. Thus, we really just need a way to do axis. I was thinking that we could use "Lotus 123" style function calls for this and perhaps for other escape operations within the regex, for example, @parent() or self. Then we could have @@ mean "parent" and @ without the parenthesis to mean "self". Thus, one could write, @/@@/sibling to navigate to up one and then down into a sibling. Ok. What do you think? Summary: 1. Use regular expression everwhere, except the brackets {} , 2. Use brackets {} for predicates 3. Use @func() for function calls, and @ by itself for "self" and @@ for parent. Just about everything should fit-in just dandy... Best, Clark -- Clark C. Evans Axista, Inc. http://www.axista.com 800.926.5525 XCOLLA Collaborative Project Management Software |
From: Oren Ben-K. <or...@ri...> - 2002-03-12 11:32:21
|
Rolf Veen [mailto:rol...@he...] wrote: > I just happen to have a structure that represents a sequence > of 2-tuples <k,v> where the domain of k is not a set, it is a bag. > Lets call this structure listmap, for example. Bags are valid data types. I've worked with languages providing bags myself. > You are saying that only maps and lists are commonly found. But, > maps and lists are *not* primitive types in many languages. On the > other hand, a bag is just as generic as a map or a list. Nothing > draws a (border) line between maps/lists and bags. Here I disagree. The list/array/sequence is a basic data type in almost every programming language invented by man. The hash table/map is a close second' if it isn't supported by the language itself (e.g., in C), it is almost always provided by the standard library, or by a "very popular" one. Bags, while a valid data type, are not even provided by a wide range of languages, except for perhaps by esoteric libraries. This is especially true for scripting languages. Examples abound: JavaScript. Perl. Python. Java. C++. Also, in my experience, this data type is nice for being complete from a CS point of view, but is very rarely used in practice. Now, YAML *can* encode bags or anything else: a set: !!set - member - another member a bag: !!bag - member - duplicate - duplicate a listmap: !!listmap - duplicate key: value - duplicate key: value The syntax isn't offensive to the eye, isn't overly verbose, and does what you want: It will load the data directly into your favorite set/bag/listmap native data structure. Sure, a YTL engine (for example) would unnecessarily preserve the order of the entries. So what? A YTL engine would also unnecessarily preserve the case of the letters in: address: country code: !!country IL So what? Generic YAML implementations will *always* preserve "unnecessary" information, unless we want them to be fully aware of each and every data type in existence. > > YAML does _not_ have a document model. > > When you load a YAML file into memory, you will have a possibly > nested structure of maps and lists. An in-memory image. > > And what is a DOM ? A nested structure of objects. Every DOM is a nested structure of objects. Not every nested structure of objects is a DOM. A DOM for the XYZZY format is a nested structure of objects of classes specific to the XYZZY format, which implement the specific data model defined by that format. YAML doesn't have a DOM. It uses the native data structures of the host language. If the host language's data model is incompatible with YAML, you'll require a DOM. Luckily, such languages are rare, except for the important category of fully static languages such as C/C++. Even in such languages YAML can avoid using a DOM if you load a schema-specific file into application-specific native objects. XML, in comparison, has a data model which is incompatible with the native data structures of almost every language I know of. Hence it *always* requires an XML-specific DOM. This is a *big* difference in practical terms. > I'm claiming that a listmap is as basic as a map or a list. My > underlying information model includes it, so I do want a file > format that supports it. As shown above, YAML does support it. With no overhead I might add - in your application it would load directly to your listmap class (or whatever), *without going through intermediate lists/maps*. That's the whole point of "having no DOM". > You are free to draw a frontier > between maps/lists and bags, but you should admit that it > is a design choice, an artificial frontier. I'm incapable of > seeing any strong foundation that supports it. I've no problem with the above statement. The choice is a design choice, and is artificial. All engineering choices are like that. It is exactly on par with making it easy to write base 8, 10 and 16 integers, and harder to write base 12 ones. > I'm trying to fit my needs into an existing standarization > effort such as YAML, and adapt to it as much as possible, but > this one is a real issue for me. And if I thoght that it was > only a personal need, than I wouldn't have mentioned it. I hope that the syntax I proposed above solves your problem. Have fun, Oren Ben-Kiki |
From: Rolf V. <rol...@he...> - 2002-03-12 13:24:39
|
Oren Ben-Kiki wrote: > Bags, while a valid data type, are not even provided by a wide range > of languages, except for perhaps by esoteric libraries. This is > especially true for scripting languages. Examples abound: > JavaScript. Perl. Python. Java. C++. Also, in my experience, this > data type is nice for being complete from a CS point of view, but > is very rarely used in practice. True. Still, a bag is on the same complexity level as a map or list. The same for what I called a listmap. A map: <s,v>+ A list: <'-',v>+ A listmap: <b,v>+ where s is a set element, b a bag element and v a value. > Now, YAML *can* encode bags or anything else: I know this. > XML, in comparison, has a data model which is incompatible with the > native data structures of almost every language I know of. Hence it > *always* requires an XML-specific DOM. What I described as a listmap can hold XML. That is the nice thing. And it isn't a DOM. > That's the whole point of "having no DOM". Tnis discussion is about what are considered native data types. My position is that bags are as native as lists: 'rare' and 'native' are from different domains. > I hope that the syntax I proposed above solves your problem. Yes, but not in the cleanest way. My intended use of YAML is not data serialization but data interchange and configuration, and one important aspect is user interaction. A user may not understand why he needs to write: invoice: - product: a - product: b instead of: invoice: product: a product: b If you explain him that it is because we only use maps and lists to hold the data, he may laugh. So I propose to reserve a directive such as #BAG:true, while defaulting to the current behavior. Interesting discussion, in any case :-). Regards. Rolf. |
From: Clark C . E. <cc...@cl...> - 2002-03-12 13:45:09
|
On Tue, Mar 12, 2002 at 02:20:13PM +0100, Rolf Veen wrote: | True. Still, a bag is on the same complexity level as a map or list. | The same for what I called a listmap. A lists and maps are special. They are both types of "functions", where the domain of a list is limited to whole numbers. Functions are _the_ cornerstone of mathematics, after integers, they are the first thing one studies in 7th grade algebra class and almost the entire first semester of college-level real analysis class. Bags and listmaps are nice, and they certainly have their representations in mathematics; however, they are not the primitive constructs of formal mathematics, functions (maps/lists) are. | > XML, in comparison, has a data model which is incompatible with the | > native data structures of almost every language I know of. Hence it | > *always* requires an XML-specific DOM. | | What I described as a listmap can hold XML. That is the nice thing. | And it isn't a DOM. ML, Python, Perl, VisualBasic, Javascript, Java, and C# do not have a native "out-of-the-box" listmap structure; save XML's DOM. | Tnis discussion is about what are considered native data types. My | position is that bags are as native as lists: 'rare' and 'native' | are from different domains. ML, Python, Perl, Visual Basic, Javascript do not have bags but they do have maps and lists. In those languages Bags are often simulated using a list where you disregard the order. | Yes, but not in the cleanest way. My intended use of YAML is not | data serialization but data interchange and configuration, and | one important aspect is user interaction. A user may not understand | why he needs to write: | | invoice: | - product: a | - product: b | | instead of: | | invoice: | product: a | product: b This isn't fair. 99% of users will never construct YAML on their own and will just be modifying existing YAML. In which case, if they don't see any duplicate keys, they probably won't make any; if they do, they get slapped by the parser! YAML is not a user interface, it is a clean serialization language. Clean in two ways, first, it has a nice syntax, but just as important it uses language's native data structures. A listmap is just not native in most modern programming languages. | If you explain him that it is because we only use maps and lists | to hold the data, he may laugh. If you tell her that she must indent children, or use a colon instead of a dash she may laugh as well. Every language has its constraints. | So I propose to reserve a directive such as #BAG:true, while | defaulting to the current behavior. *icky* If it is a bag, they can use a list structure and just ignore the order provided. No directive needed. ;) Clark -- Clark C. Evans Axista, Inc. http://www.axista.com 800.926.5525 XCOLLA Collaborative Project Management Software |
From: Rolf V. <rol...@he...> - 2002-03-12 15:18:31
|
Clark C . Evans wrote: > A lists and maps are special. They are both types of "functions", > where the domain of a list is limited to whole numbers. Functions are > _the_ cornerstone of mathematics, after integers, they are the first > thing one studies in 7th grade algebra class and almost the entire > first semester of college-level real analysis class. Bags and > listmaps are nice, and they certainly have their representations in > mathematics; however, they are not the primitive constructs of > formal mathematics, functions (maps/lists) are. Functions are a special class of binary relations. If you define a format for binary relations in general, you will never have the chance for such a discussion in the future. And that at the cost of nothing: you already have the syntax. I do insist, for the last time, to open the specification to generic binary relations. It can only bring more users to YAML, never less. :-) Rolf. |
From: Clark C . E. <cc...@cl...> - 2002-03-12 16:32:25
|
On Tue, Mar 12, 2002 at 04:14:15PM +0100, Rolf Veen wrote: | Functions are a special class of binary relations. If you define | a format for binary relations in general, you will never have the | chance for such a discussion in the future. And that at the cost of | nothing: you already have the syntax. I understand your perspective, but just beacuse we can do it does not mean we should. A few factors weigh-in heavily: 1) Most programming languages have built-in list and map datastructures. Some languages (C,C++,ML) may only have "static" maps (aka structs or records), but with a fixed schema this is good enough, and often those languages have a well known hash construct. If a language doesn't support maps or lists, then support is usually easy to find. 2) While your statement is true, a bulk of mathematics (90% or more of real analysis theorems) require functions. The same with most computer algorithems, they require functions (and usually only functions) to operate. Thus, by sticking with functions we don't have to deal with cases where the data is non-functional, and thus a great many results are immediately available. For instance, one could generate a composition function over yaml texts that is transitive. Relations in general do not have these properties. Thus, by restricting our information to functions we make it much easier to define operations on YAML information in a rigorous manner. 3) As a pratical example of #2, since YAML containers are functional, any YPATH such as /x/3/y/z will return at most one node. Thus, all of the logic to differentiate between "sets" of nodes and nodes is not needed for cases like this. In practice... this is huge. Play with the DOM some and you will see all kinds of quirks and hacks to get data which "could be a list but isn't" to work in a functional way. 4) Functions (maps,lists) are the 80/20 rule. At least 80% of the time it is exactly what you want, and in the cases that it's not (far less than 20%), you can simulate what you need very easily. And unlike relations, maps/lists are very powerful abstractions due to the rich mathematical properties that emerge from the functional constraints. 5) If we use a maplist structure, then this will require a DOM (third-party data object) in most languages, like XML. IMHO, this is a show-stopper. Alternatively, one could use an alternating map/list structure. However, such a structure is not very usable by your average programmer; thus a custom construct would be needed. 6) We debated long and hard as to how we can make the syntax less painful for those people which want a namedlist structure, and the "- key : value" syntax construct is the result. This was the work of alot of compromises and consessions on many sides of the isle. Thus, for listmaps. 7) Modeling relations is one step closer to the "namedlist", which is a list of ordered pairs (key,val). This is the fundamental XML construct and is not supported by any language natively. It's also a pain to work with programmatically. This structure works great for mixed-content. 8) If you want mixed-content, use XML. XML excels at mixed-content. We are trying to provide a simple serialization for native data constructs; not a document (aka mixed content) markup language. | I do insist, for the last time, to open the specification to generic | binary relations. It can only bring more users to YAML, never less. I totally disagree. Doing this will complicate everything and provide very little benifit. Just beacuse the syntax requires no modification to support this doesn't mean that the API and dependent libraries are not signficantly impacted. And IMHO, this impact just isn't worth it. That said, I hope for your continued support in YAML; this just happens to be one of the cornerstones that I think we need to stick by. We are not making this decision blindly it is the result of many (at least two years) discussion on SML-DEV. Best, Clark |
From: Rolf V. <rol...@he...> - 2002-03-12 17:04:46
|
Clark C . Evans wrote: > by restricting our information to functions we make it > much easier to define operations on YAML information in a > rigorous manner. This is a good reason, indeed. > That said, I hope for your continued support in YAML; this just > happens to be one of the cornerstones that I think we need to > stick by. We are not making this decision blindly it is the result > of many (at least two years) discussion on SML-DEV. Ok, parking the item and going back to the Java parser. :-) Rolf. |
From: Clark C . E. <cc...@cl...> - 2002-03-08 20:58:02
|
On Fri, Mar 08, 2002 at 01:53:47PM -0500, Oren Ben-Kiki wrote: | Rolf Veen has asked why use '/' for separating path segments. Good question. | I did it "for historical reasons" - people are just used to it for "paths". | At any rate, the choice of characters is less important than the basic | approach. Also, from a regex perspective, the slash '/' is traditionally used as a top level separator, s/this/that/g | > Hmm. I think we could use [] as the existence test (like XPATH). | | I think RegExps already have a syntax for that. One reason I think we should | base YAPTH on regexps - just tailor the matching operators with graph paths | - is that many issues were already hashed to death in regexps. Existance | test is one. I don't have the perl book handy at home; I suspect that | looking at the regexp section there would be very educational. As much as I'd love to do that, I think that it would be a big research project. I'd rather have two libraries, the first is a simple incremental path syntax, and the second is regular expressions where we can use an existing library like the wonderful perl compatible regular expression library (pcre). The other thing is that regular expressions are very poor with arithemetic operations and if we want YAML to be used for business (accounting, bookkeeping, timekeeping) we want to have a full suite of mathematical operators (preferably using an existing library). Thus, I'd like to distinguish between a few "sub-parts" of YPATH: 1. The path segment approach, graph oriented navigation, primarly based on key values. 2. Regular expressions, primarly applied on the *contents* of a particular leaf node. Most map keys will be fixed by a schema, thus using regular expressions on them make little sense... however, built-in regex support for leaves is cool. 3. Full support for boolean and arithemetic operations. To this end, I was thinking the path sement approach could largly borrow from XPath. IMHO, Xpath is one of the big successes of the W3C and we'd do best to emulate it rather than invent our own. For what it does, XPath is amazing. Furthermore, many people know it. Now... that said the only real "big" blemish in XPath is the set-equivalence operator, which is denoted = and is true when two node sets have one or more overlapping nodes. This is not intuitive, the operator is good, but using the '=' indicator is a poor choice at best. XPath distinguishes between (1) and (3) by using the predicate notation []. IMHO, this will work wonderfully. When I get time (in a week or so), I'll map out the archecture of an XPath processor... and how it could be modified for YPath. | > I was thinking that /xxx would only match on keys and | > not on values. If you want ot match on values, I think | > using the exists operator would be ideal. | | There is a difference between matching the value and matching the key that | has that value... Absolutely. Keys will be rather "static" and we don't need regular expressions for them. Values, on the other hand, are exactly what pcre is made forr! Since they are leaf values... we can use pcre directly without any modification. | > # /two matches "world" | > --- | > one: hello | > two: world | > ... | | If so, what matches "two"? I agree there's an ambiguity... | perhaps we should write something like I'm not interested in matching "two". Most of the time I want the value... that said, I think "two" would be somehow stuffed into the node's _context_. How context is structured will have to be explored. | > Close. This would be /[transfer("int")] since transfers | > belongs in the predicate. | | The way I see it, the whole thing is a predicate... I don't see why I need | to make the distinction. in "/key", "key" is a predicate which happens to | match any !str-typed node with the value "key". See below... The path expression returns a node-set. The predicate returns a boolean value. They are different. You want to separate the path engine from the predicate engine or it'll be a mess... | Hmmm. So this way you could, "logically" anyway, evaluate a YPATH piece by | piece - each step giving the "context" of matching nodes to the next step. Right. It is recursive descent. You actually need generators to implement this efficiently when you add the union operator into the mix. | I see potential difficulties with this, mainly that it would raise the | expectation that this "context" would be a manipulatable piece of any | implementation (e.g., being able to give a context as an argument in the | YAPTH API). Did this happen with XPATH? Is it truly necessary? A context is always necessary, it is just usually an opaque handle. | You focus on the incremental nature of YPATH - working segment by segment, | using context as the glue between consecutive steps, and employing patterns | as a slave mechanism applied separately at each step. Right. This has a simple and relatively efficient implemenation without requiring a schema. | I wanted to tackle the whole monster in one fell swoop by saying "YPATH is a | kind of regexp". In this view '/' is just a regexp operator, and | *everything* is a pattern. To formalize a bit: each <graph, starting node> | define an (enumerable) set of paths (walks through the graph) starting at | that point. The YPATH regexp pattern is a boolean function which says | "match" or "doesn't match" to each of these paths. Each such path ends at | some point ("existance" is trivially handled by going down-then-up, etc.). | Thus the function defines a set of nodes - these that are at the end of | paths that "match". This is the result of applying the YPATH to the graph | (and starting point). I don't think the two views are mutually exclusive. I know I can implement the second view though.... and I have experience doing it (with XPath). | Now, regexps are a thoroughly investigated field... both in theory and | practice (e.g., there's a well-known syntax for specifying regexps we can | build on instead of inventing new ways to do existance tests, ORs, ANDs, | partial matches; there are excellent libraries implementing various types of | regular expressions which we may be able to build on; and so on). Right. And I'd rather just use regexps on leaf values and not make a research project... although research projecs are fun. | I'm well aware that XPATH does it your way :-) and I'll admit to being | somewhat naive about the implications of using regexps vs. "context | refinements" as the basic mechanism. I suspect the two are equivalent at | some level - though the syntax would probably end up being different... Right. Perhaps a different syntax, perhaps not. There *is* a difference between node trees that are part of the predicate and those that are not. In the former, you are only testing for existance, in the latter you are returning a node set. There are also many other differences between the two which are ever-so-subtle. | I think the issue deserves some discussion, before we delve into the | nitty-gritty of whether we say /.../ or /down(>0)/ or whatever. Yep. Unfortunatly, I don't have a ton of time. Could we just focus on the parser for a while? YPATH is going to be hard to work out without an active implementation... Best, Clark |
From: Rolf V. <rol...@he...> - 2002-03-11 17:39:54
|
Oren Ben-Kiki wrote: >>Hmm. Unlike XML our tree is not ordered (although it >>can be ordered by the keys). This requires some thought. >> > > Sequences *are* ordered even in the graph model! As I see it, the fact that key:value pairs are not ordered is just a definition in the spec, not reflected in the productions, ie, a semantical statement (currently) on top of the productions. Think of the following: invoice: product: a product product: another product That is compatible with the current productions. If you have hashmaps in mind, ok, then you can only represent undordered non duplicate keys, but that is not allways the case. Not mine. Why do I need this heretical approach ? Several reasons, but import/export from/to XML is one of them, similarity with a document model is another. The YAML file format can hold an ordered graph. Why limit it ? Let the schema do that. Well, this is just a proposal. I can be missing side effects... or just be plain wrong :-). Regards. Rolf. |
From: Clark C . E. <cc...@cl...> - 2002-03-11 18:22:51
|
On Mon, Mar 11, 2002 at 06:35:42PM +0100, Rolf Veen wrote: | invoice: | product: a product | product: another product | | That is compatible with the current productions. If you have | hashmaps in mind, ok, then you can only represent undordered non | duplicate keys, but that is not allways the case. Not mine. YAML is as much an information model as it is a syntax, just beacuse the syntax may seem to allow it doesn't mean that it is valid YAML. YAML has these restrictions since it mirrors map/list structure of most programming languages. If you remove these constraints, then this isomorphism no longer holds, and much of the value YAML provides is gone. So, in short, duplicate keys are illegal YAML and must stay that way. | Why do I need this heretical approach ? Several reasons, but | import/export from/to XML is one of them, similarity with a | document model is another. YAML does _not_ have a document model. YAML's model is not compatible with XML since XML's model is not compatible with the map/list structure of most programming languages. If we did follow this, then we'd need a DOM. But since we _don't_ allow duplicate keys and other XMLisms, we don't need a DOM. And using native data structures is at least 1/2 of YAML, getting a nicer syntax is just a side-effect, IMHO. | The YAML file format can hold an ordered graph. Why limit it ? | Let the schema do that. Beacuse the YAML information model (native map/list structures found in most programming languages) can't. Thanks! Clark |
From: Rolf V. <rol...@he...> - 2002-03-12 10:18:10
|
Clark C . Evans wrote: > YAML is as much an information model as it is a syntax, just > beacuse the syntax may seem to allow it doesn't mean that it > is valid YAML. YAML has these restrictions since it mirrors > map/list structure of most programming languages. If you > remove these constraints, then this isomorphism no longer holds, > and much of the value YAML provides is gone. So, in short, > duplicate keys are illegal YAML and must stay that way. I just happen to have a structure that represents a sequence of 2-tuples <k,v> where the domain of k is not a set, it is a bag. Lets call this structure listmap, for example. You are saying that only maps and lists are commonly found. But, maps and lists are *not* primitive types in many languages. On the other hand, a bag is just as generic as a map or a list. Nothing draws a (border) line between maps/lists and bags. > YAML does _not_ have a document model. When you load a YAML file into memory, you will have a possibly nested structure of maps and lists. An in-memory image. And what is a DOM ? A nested structure of objects. An in-memory image. The objects of a DOM are more complex, ok. But, for me, both are in memory images of a possibly nested structure. Yes, YAML allows for many documents. Still not a vital difference. On one side you have ints and strings, and on the other a DOM. The YAML map/list convention is somewhere in between. Thus: if: set ::= unordered, no duplicates. bag ::= unordered, duplicates allowed. sequence ::= ordered, duplicates allowed. then: list ::= sequence of 1-tuples <v>. map ::= set of 2-tuples <k,v> where the domain of k is a set. now: listmap: sequence of 2-tuples <k,v> where the domain of k is a bag. I'm claiming that a listmap is as basic as a map or a list. My underlying information model includes it, so I do want a file format that supports it. You are free to draw a frontier between maps/lists and bags, but you should admit that it is a design choice, an artificial frontier. I'm incapable of seeing any strong foundation that supports it. I'm trying to fit my needs into an existing standarization effort such as YAML, and adapt to it as much as possible, but this one is a real issue for me. And if I thoght that it was only a personal need, than I wouldn't have mentioned it. Regards. Rolf. |