Custom tree model

2010-03-27
2012-10-08
1 2 > >> (Page 1 of 2)
  • Vladimir Nesterovsky

    Hello Mr Kay!

    I would like to commit an experiment: to implement a new tree model for the
    saxon (a variant of linked tree).

    My question is if it's possible to plug such tree model externally? I've seen
    that I can specify it through an enumeration only.

    The idea behind of this experiment is to separate node parent reference from
    all other node data, and as result to make subtree copy very cheap.

    I've described one reasoning behind of this approach in
    http://www.nesterovsky-
    bros.com/weblog/2010/03/22/InlineFunctionsInXslt21.aspx

    Thanks
    Vladimir Nesterovsky

     
  • Michael Kay

    Michael Kay - 2010-03-27

    Seems a good idea. Saxon tries a bit of this with the class
    net.sf.saxon.om.VirtualCopy which represents a virtual copy of a node; the
    tricky part is that the copied node and all its descendants need to have a
    different identity from the original, that is, A is B must return false when B
    is a copy of A. In Saxon's implementation, the VirtualCopy must be a complete
    tree (i.e. rooted at a parentless node), which severely limits the number of
    places it can be used; but I've always thought it would be possible to extend
    the idea so that two distinct trees (in terms of node-identity) can share
    storage for a common subtree.

    Some of this would be easier if Saxon accessed nodes via a stateful
    XPathNavigator object rather than a NodeInfo object, but that would be a very
    big internal change.

    It's true that Configuration.setTreeModel() takes an enumeration, which means
    you can't set a different tree model as the system default. But at the level
    of an individual transformation you can use Controller.setModel(TreeModel),
    which can be your own implementation of the TreeModel interface: the main task
    (apart from implementing NodeInfo, of course) is to implement a subclass of
    Builder to build your tree from parser events.

     
  • Michael Kay

    Michael Kay - 2010-03-27

    Interesting blog post, by the way. I share your concerns that using higher-
    order functions to plug the limitations in XDM's data structuring facilities
    will lead to poor usability. I've been hoping, however, that we could use HOF
    as the formal underpinning, and then provide more usable facilities as a
    "syntactic sugar" layer on top. But there's definitely a question about
    whether richer data structures should or should not be modelled as XDM nodes,
    and it's interesting to see you addressing that question. It's related to
    questions about user-defined axes, I think. One possible route is to extend
    XDM so the nodes can participate in an arbitrary graph, using arbitrary axes
    for the inter-node relationships, and then to treat the model of an XML tree
    as a special case of this in which the axes are the ones that are currently
    defined in 2.0. Don't expect such radical thinking to be acceptable to the W3C
    WGs any time soon; but it's worth exploring at a research level.

     
  • Vladimir Nesterovsky

    ... can participate in an arbitrary graph ...

    The following theorem should probably be proved, but it goes like this:
    staying within a functional language one cannot model a cyclic graph.

    Thus functional languages including xslt and xquery are limited to trees
    (forest).
    This means that XDM rather closely approximates all kinds of graphs available.

     
  • Vladimir Nesterovsky

    Formally, XDM does not allow "diamond" like structures: A refers to B and to
    C, B and C refer to D, but this can be modeled.

     
  • Vladimir Nesterovsky

    I'm trying noe to figure out how NodeInfo.copy() should work for this tree
    model.
    To understand interactions I'm looking at ElementCreator.constructElement()
    method.

    What I see is that:
    a SequenceReceiver or its wrapper is created to collect result;
    nested elements are added through SequenceReceiver.append(), which calls
    NodeInfo.copy() for a pipe of nested recievers;
    the most nested reciever (Builder) is hidden (not available in
    NodeInfo.copy());
    Builder could allowed me to compose parent child nodes efficiently.

    Am I right in that the efficient external implementation of desired tree model
    is not possible?

     
  • Vladimir Nesterovsky

    Things could work if there were Reciever.append() method, which might elminate
    need of NodeInfo.copy().

     
  • Michael Kay

    Michael Kay - 2010-03-28

    I think you are right in concluding that the Builder is not able to detect
    that the events being sent to it represent a copy of a node in some source
    document.

    On the other hand, if it's "your" node that is being copied (i.e. an instance
    of NodeInfo that uses your implementation of the NodeInfo interface), then
    your implementation of the NodeInfo.copy() method can behave differently. You
    can probably detect that the ultimate destination of the copy is your Builder
    class by following the pipeline of intermediate ProxyReceiver objects using
    getUnderlyingReceiver(). You could also consider exploiting the rather
    peculiar mechanism of the "CopyInformee" which is set in the
    PipelineConfiguration to be notified of xsl:copy-of instructions - though I
    agree this might not be general enough.

    It's entirely possible, however, that your ideas can't be implemented without
    changing the Saxon source code.

     
  • Vladimir Nesterovsky

    I've found that SequenceOutputter used in ElementCreator does not expose
    underlying reciever.
    Also, at some cases SequenceOutputter is wrapped into Validator, which
    probably does not allow to pass item through.

    If one could pass a node to a reciever as a whole rather than a sequence of
    events, then different kinds of recievers could process it and pass a node
    further.

     
  • Vladimir Nesterovsky

    I've implemented described tree model!
    It's called ComposableTree.

    I was not able to create external classes, so I've introduced a small set of
    changes to saxon.
    I've declared an interface:

    public interface NodeReceiver extends Receiver
    {
    public boolean canRecieve(NodeInfo node);
    public void node(NodeInfo node);
    }

    and have implemented it over:
    ComplexContentOutputter,ProxyReceiver, SequenceWrapper, SequenceWriter,
    SequenceWriter.

    Its goal is to push node to the tree builder.

    A tree builder for this tree model is called ComposableTreeBuilder.
    Classes are derived by refactoring linked tree. Definitely, better
    implementation is possible.

    My shallow tests show that approach is working.
    I can observe sharing of data among trees.

    This approach works when code passes through ElementCreator, but does not work
    when Copy is in action (see Copy.processLeavingTail()), as it decomposes node
    into a sequence of tree events.

    Extensive tests are required to verify implementation and to measure its
    performance.
    At this point I hardly can proceed, as I have no large test base.

    I can share the code if you're interested in the verification.

    Thanks.
    Vladimir Nesterovsky

     
  • Michael Kay

    Michael Kay - 2010-04-06

    Your node() method seems quite similar to the append() method in
    SequenceReceiver. Probably a better solution is to merge Receiver and
    SequenceReceiver, and then do what the pull pipeline does: you can wrap a
    Decomposer around any EventIterator to decompose nodes/trees into individual
    events, if the target can't do that for itself. Or one could inherit from an
    abstract base class than implements the append() method in terms of the other
    methods.

    How have you solved the problems of node identity mentioned in comment #2?

     
  • Vladimir Nesterovsky

    Your node() method seems quite similar to the append() method in
    SequenceReceiver.

    You're right.
    The only reason I did not merge them is that I wanted minimal side effects.
    In fact SequenceWriter looks now like this:

    public abstract class SequenceWriter
    extends SequenceReceiver
    implements NodeReceiver
    {
    ...

    public boolean canRecieve(NodeInfo node)
    throws XPathException
    {
    if (outputter instanceof NodeReceiver)
    {
    NodeReceiver nodeReciever = (NodeReceiver)outputter;

    return nodeReciever.canRecieve(node);
    }

    return false;
    }

    public void node(NodeInfo node)
    throws XPathException
    {
    NodeReceiver nodeReciever = (NodeReceiver)outputter;

    nodeReciever.node(node);
    }

    public void append(Item item, int locationId, int copyNamespaces) throws
    XPathException {
    ...
    else {
    NodeInfo node = (NodeInfo)item;

    if (canRecieve(node))
    {
    node(node);
    }
    else
    {
    node.copy(outputter, NodeInfo.ALL_NAMESPACES, true, locationId);
    }
    ...
    }
    }
    }

    }

    How have you solved the problems of node identity mentioned in comment #2?

    Previously node identity were base on sequence number (in most cases).
    In this tree model, I think, we cannot rely on this.
    Thus identity related functions are changed as following:

    public void generateId(FastStringBuffer buffer)
    {
    parent.generateId(buffer);
    buffer.append(NODE_LETTER);
    buffer.append(Integer.toString(index));
    }

    public boolean isSameNodeInfo(NodeInfo other) {
    // default implementation: differs for attribute and namespace nodes
    if (other instanceof NodeImpl)
    {
    NodeImpl otherNode = (NodeImpl)other;

    return this.getNodeData() == otherNode.getNodeData();
    }

    return false;
    }

    public final int compareOrder(NodeInfo other)
    {
    if (other instanceof NamespaceIterator.NamespaceNodeImpl)
    {
    return 0 - other.compareOrder(this);
    }

    // Nodes added by XQuery Update do not have sequence numbers
    return Navigator.compareOrder(this, ((NodeImpl)other));
    }

     
  • Michael Kay

    Michael Kay - 2010-04-06

    I can't see how this works. If the same Java object is used to represent nodes
    in two different trees, then the identity-dependent methods must depend not
    only on the properties of this Java object, but on how it was reached. For the
    VirtualCopy implementation this is done by using a transient Java object (the
    VirtualCopy) which contains both a reference to the "shared" node holding the
    real data, and a documentNumber that distinguishes which tree it is part of. I
    would expect to see you doing something similar.

    Test case:

    <xsl:variable name="x" as="element()">
    <e/>
    </xsl:variable>
    <xsl:variable name="y" as="element()">
    <xsl:sequence select="$x"/>
    </xsl:variable>
    <xsl:value-of select="generate-id($x) = generate-id($y/e)"/>

    should return "false".

     
  • Vladimir Nesterovsky

    I see already that isSameNodeInfo() is not correct.
    I must to check that parens are the same also.

     
  • Vladimir Nesterovsky

    public boolean isSameNodeInfo(NodeInfo other) {
    // default implementation: differs for attribute and namespace nodes
    if (other instanceof NodeImpl)
    {
    NodeImpl otherNode = (NodeImpl)other;

    if (this.getNodeData() != otherNode.getNodeData())
    {
    return false;
    }

    if (parent == null)
    {
    return otherNode.parent == null;
    }

    if (otherNode.parent != null)
    {
    return parent.isSameNodeInfo(otherNode.parent);
    }
    }

    return false;
    }

    as for generate-id() I think that above mentioned implementation addresses the
    question, as value depends on generate-id() of parent.

     
  • Vladimir Nesterovsky

    For this code:
    <xsl:variable name="x" as="element()">
    <e/>
    </xsl:variable>
    <xsl:variable name="y" as="element()">
    <xsl:sequence select="$x"/>
    </xsl:variable>

    <xsl:message select="generate-id($x), generate-id($y/e)"/>

    I can see this output:
    d3e0 d4e0e0

     
  • Michael Kay

    Michael Kay - 2010-04-06

    But I thought the idea was that a node could have more than one parent: or
    more strictly, that a single Java object could represent two different nodes
    with different parents?

     
  • Vladimir Nesterovsky

    NodeImpl stores
    protected ParentNodeImpl parent;
    protected int index;
    and has method:
    protected abstract NodeData getNodeData();

    NodeData defines:
    public abstract NodeImpl getNodeInfo(ParentNodeImpl parent, int index);

    Descendants of NodeImpl define their NodeData:

    abstract class ParentNodeImpl extends NodeImpl {
    protected static abstract class ParentNodeData extends NodeData
    {
    public Object children = null;
    ...
    }

    public class ElementImpl extends ParentNodeImpl implements NamespaceResolver {
    protected static class ElementData extends ParentNodeData
    {
    protected int nameCode;
    protected int typeCode;
    protected AttributeCollection attributeList; // this excludes namespace
    attributes
    protected int namespaceList = null; // list of namespace codes
    ...
    }

    protected ElementData data;

    protected ParentNodeData getNodeData()
    {
    return data;
    }

    NodeData is shared.
    This tree model looks as variation of linked tree but on the other side it is
    similar to node wrappers.

     
  • Michael Kay

    Michael Kay - 2010-04-06

    OK, thanks, I think I get the idea.

     
  • Vladimir Nesterovsky

    Hello,

    Continuing with tree models I have a somewhat different question.
    Looking at NamePool I think your wanted to make some agressive optimization,
    but I don't really get the idea.

    Can you please explain?

    I've stated my thoughts at http://www.nesterovsky-
    bros.com/weblog/2010/04/09/NamePool.aspx

     
  • Michael Kay

    Michael Kay - 2010-04-11

    Your equals() test for comparing two qualified names is MUCH more expensive
    than comparing two integers. To put this in context, Saxon's loop for scanning
    the TinyTree to evaluate //foo:bar contains little more than three integer
    comparisons and an integer increment (for nodes that don't match).

    There are problems with the NamePool because of synchronization - but your
    proposal doesn't solve that. You would still need to synchronize when adding
    names to the pool (and perhaps even when finding them.)

     
  • Vladimir Nesterovsky

    Your equals() test for comparing two qualified names is MUCH more expensive
    than comparing two integers.

    The use pattern goes like this:

    1. get cached QualifiedName for triple (here one should care about contention):
      getQualifiedName(prefix, localName, namespace)
      getQualifiedName(key)

    2. Use (store) cached QualifiedName instance.

    3. To get parts use getPrefix(), getLocalName(), getNamespace(), getExtendedName() that are primitive getters and do not require synchronization.

    4. To test two qualified names for equality one compares two instance references: name1 == name2

    5. To test two expanded names (localname, and namespace) one compares two instance references:
      name1.getExtendedName() == name2.getExtendedName()

    qualified names is MUCH more expensive than comparing two integers

    I guess that reference comparisions are as fast as integer comparisions.

    There are problems with the NamePool because of synchronization - but your
    proposal doesn't solve that.
    You would still need to synchronize when adding names to the pool (and
    perhaps even when finding them.)

    There is only contention point during caching of a new value.
    In the sample I've used concurrent map with no explicit synchronization
    (http://www.nesterovsky-
    bros.com/download/saxon/NameCache.java.txt).

     
  • Michael Kay

    Michael Kay - 2010-04-11

    I don't see what benefits such a redesign would bring. It would involve
    storing object references in the TinyTree for node name and type annotation in
    place of integers, which would increase the size from 19 bytes per node to 27
    bytes per node (or more, because spare bits in the type annotation field are
    used for other things). There would be more synchronization overhead (this is
    a tricky one. Saxon doesn't currently synchronize on getting a value from the
    pool. This isn't provably safe, but it seems to work reliably. But this
    wouldn't be safe with implementations of ConcurrentHashMap, where a writer can
    expand the map dynamically, causing concurrent readers to fail.

    I do need to make improvements to the NamePool to reduce contention for high-
    throughput applications, but this seems to be a step backwards.

     
  • Vladimir Nesterovsky

    It would involve storing object references in the TinyTree for node name

    TinyTree stores names as
    int nameCode;
    but
    QualifiedName names;
    would occupy exactly the same space on 32 bits (and x2 on 64bit).

    There would be more synchronization overhead

    My point is that, there won't be more synchronization (probably less),
    as operations to get name components from qualified name are primitive
    getters,
    so they do not need to access pool at all.

     
  • Vladimir Nesterovsky

    TinyTree for node name and type annotation in place of integers,
    which would increase the size from 19 bytes per node to 27 bytes per node

    Regardless of what's written above, to reduce memory footprint you can:

    1. define NodeMetadata:

    NodeMetadata
    node type;
    qualified name;
    type annotation;

    1. Cache it.
    2. NodeInfo to refer to it.

    For TinyTree this would replace:
    public byte nodeKind;
    protected int nameCode;
    protected int typeCodeArray;
    with
    NodeMetadata metadatas;

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.