On Tue, Jun 05, 2001 at 05:38:10PM -0700, Jason Diamond wrote:
| The next layer up in the stack is what Microsoft ...
| The same layering occurs in Java...
| So what I'm trying to get down to is this: IYamlReader (and
| the associated YamlTextReader) are what I'd like to see as
| the bottom of the YAML stack. Since it's a forward-only,
| streaming interface, it's not only easy to implement but
| it's efficient....
| Ugh, sorry for going on for so long. Do you think that this kind of model
| would suit what you had envisioned for YAML?...
Ok. There are three issues here. The first one is hierarchical
iterator vs a hierarchy of iterators. The second one is all
about layering. And the third one is about push vs pull
interfaces and building a sequential processing pipeline.
Before I get started, let me preface by saying that I'm very
well versed in the DOM/SAX/MSXML APIs and have put a good
deal of thought (and prior implementations) into this. Also,
thus far, my experience has shown two types of "stream"
interfaces, push and pull. And Oren has pointed out their
connection to the visitor and iterator design
patterns respectively.
Issue #1: Hierarchical Iterator vs Hierarchy of Iterators
~~~~~~~~~
There are two approaches to representing a hierarchical
stream. You can either use a hierarchy of iterators,
where each iterator *only* covers a list of nodes for
a given container (map/list). Or you can use a single
iterator with some sort of "begin/end" or "depth" notion.
Please compare and contrast
public interface IYamlHierarchicalIterator
{
YamlNodeType NodeType { get; }
string Key { get; }
int Read(...);
* int Depth { get; }
* bool Read();
* void Skip();
}
with
public interface IYamlFlatIterator
{
YamlNodeType NodeType { get; }
string Key { get; }
int Read(...);
* bool Next();
* IYamlFlatIterator FirstChild();
* IYamlFlatIterator ParentIterator();
}
I strongly feel that the latter is much more
"intuitive" and easy to read, since the
iterator is flat, it only moves over a
single container; where the former is
hierarchical, and the user must constantly
watch for the "depth" to change.
In the latter, I can have *different* iterators
over different sub-trees, giving the underlying
implementation a great deal of flexibility.
In the former, this flexibility would imply
using the Envelope or Adapter pattern; thus it
comes at a higher cost.
In the latter the user doesn't have to think
about "skipping" nodes. If the user doesn't
want them, they don't ask for them and just
call Next() -- it is positive logic which
is easy to understand.
One may say that the latter is harder to
implement, perhaps, but not by much. The only
"additional" complexity on the parser end is
keeping the parser state on a stack. This will,
actually, make the parser implementation easier
to write... try it.
The speed difference is negligable. In fact, for
common use cases where the user of the interface
requires the stack... it is faster since
the user need not try to setup their own
structures and copy the keys...
The memory difference is not much. And given
modern processors... negligable. The only
thing that needs to be kept is a stack of
(key, anchor, parent-pointer) tuples. This
is trivial memory usage. Not even worth
the mental effort in the big scheme of things.
In fact... you probably have to keep this
stack *anyway* for your parsing! Why not
expose it!
The "Read" method and "FirstChild" are exlusive.
In the latter, the Read() method is defined as:
(a) the characters if the node is a scalar,
(b) the Read() of the first item if the
node is a list.
(c) the Read() of the map item having
a key "=" if the node is a map.
This behavior is very clean and allows for
"substitutability" as defined in earlier e-mails.
You mentioned that one advantage of the hierarchy
of iterators is that you can implement this
in a single loop. True enough. However, I
have yet to see a single case (with reasonable
complexity) where a single loop works well.
In short... there is a difference. The
hiearchy of iterators may seem less efficient,
but this is negligable. And from a useability
perspective... the hierarchy of iterators wins
hands down...
Even if you don't buy the layering argument
to follow, I hope you consider this. It isn't
the "norm"... however, the norm isn't necessarly
the 80/20 rule. The 80/20 rule says that you
want to keep "some" part of the stream in
memory, but only 20% of it (the ancestor stack).
Issue #2: Layering
~~~~~~~~~
I believe that you are introducing layering where
it is not required. Certainly layering is often
good to keep complexity in check, however, too
much layering just adds additional "unneeded"
interfaces, code, function calls, etc.
The Reader API should be a sequential interface
to YAML data. Where this information comes from,
via text file or visiting a object node in a
tree or rummaging over native maps and lists
in memory, should be largely irrelvant. By
calling your interface "Reader", I think you
get off on the wrong foot, and limit youself
only to the case of reading from a text stream.
This is an unnecessary, and indeed harmful
limitation. What is required is a pull based
sequential access interface!
First, if this interface is too optimized for
one medium, then every medium should have an
interface. By this logic, we need at least
three interfaces -- a TextReader, a NodeReader,
and a NativeObjectReader. Ok. So I'm getting
stupid here. But the point is clear, if we have
more than one interface, then we *multiply* the
code, each thing we need will require a way to
do it with that interface or adapters will be
required. We are already going to need two
adapters, push to pull and pull to push, based
on real live concerns. Why make up extra ones?
Second, given the two interfaces compared above,
the Hierarchial vs the Flat iterator, I bet you'd
say that the Hierarchical is "lower level" than
the Flat iterator. I argue that since I can
implement *either* with each other that they
are equivalent and on the same semantic level.
Why? Fundamentally, they are both forward-only
sequential access interfaces.
We sould be looking for an interface that will
work equally well over all types of data sources
In short the Reader should be one concreate
implementation of the Iterator interface.
Issue #3: Push vs Pull
~~~~~~~~~
I'm glad you understand the need for a pull
interface. This is great. I hope you understand
the need for a push interface as well, right?
The printer (emitter) should be using the Visitor
interface. Is this clear?
Anyway, the only real point to be made here
was the "Event" class in the API that I
was putting forward. It's sole purpose
is to put all of those "common" things
that both the Iterator and Visitor interface
have into a single interface for re-use.
It may look complicated, but it really helps
when mixing the iterator and the visitor
interfaces.
This Event class also provides a nice place to
"stuff" away an optional node pointer... in case
the iterator or visitor is looping over an
in-memory structure with random access.
| I typed this up this morning when I got to work. But then
| I saw your First and Next methods and that they might be
| very handy. But I think it would still be prudent to keep
| a plain Read method which, when it reaches the end
| of a map or scalar, continues on to the next node at
| a higher depth. This makes it possible to process an
| entire YAML document with a single loop:
|
| while (yamlReader.Read())
| {
| // do stuff with yamlReader...
| }
I hope the above addresses this. From my experience,
the above use case is minor.... and it can easily
be handled by rather flat recursion.
handle(IYamlIterator reader)
{
while(reader.Read())
{
// to recurse call handle(reader);
}
}
All in all, you'll find it is not much more code!
In fact, you can *strip* all of the code having to
deal with managing depth and let the program stack
do it for you. The depth will always be pretty
small, so it's no real big deal.
| When processing XML documents using XmlReader, I often have an outer loop
| like so which then delegates to other methods with their own "inner" loops.
| The trick is making sure that those other methods stop Read()ing when
| they're supposed to (by watching out for EndTag nodes with the correct
| name). A ReadNext would have made that much easier!
Yep! This is the *primary* use case... which motivates
the "nexted iterator".
| By the way, the pretty printer that I wrote last
| night actually consists of just this one loop with
| a call to a simple YamlTextWriter. I defined an
| overloaded method on the IYamlWriter interface to
| accept an IYamlReader. The reader gets queried for
| the current node type, name, etc, and then outputs it.
I'd like to hear about your IYamlWriter interface!
| Again, sorry for babbling. I can't wait to hear your thoughts.
No problem. I babble too! I hope my thoughts arn't too
negative. I'm still "playing" a bit... as you can see.
But, really I'd like, if at all possible to only have *two*
very simple and very very useable interfaces exposed to
users. I'd rather not have the "parser" and the "iterator"
interface be seperate.
Anyway... thank you *so* much for your feedback.
Best,
Clark
P.S. Oren and Brian, you are welcome to jump in here
and argue for/against what I'm saying. Three or
Four guts are better than two...
P.S.S. Do I still get to keepe the dictator hat, or
must I give it up to the implementers now?
(Although I *will* be implementing a C
version... unless someone does it first
and does it very well! *evil grin* )
|