Re: [Vtd-xml-users] Random Access Proposal (take 2)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jimmy!

Well, the objective is to hold objects in java data structures for fast 
access, so I opted for a simple class, with minimal computation 
requirements. You can see even the equals() ans hashCode() are 
optimized. It holds a single node because that's the basic building 
block. All else should be done with container classes, I think: let's 
not reinvent the wheel, and pretend to do it better.

All this brings VTD a much needed DOM-like functionality. I can now use 
VTD as I used DOM before. In fact that's what I've been doing since a 
few months ago, with my own patched ximpleware-1.6, and I'm now sharing 
this functionality in a more polished way with you all.

The code is basically the same as for push and pop, that's right. The 
wheel was there, worked just fine, I took it and reused it in my own way 
:-) I gave a simple example because all real code I've produced using 
this heavily is both confidential and proprietary... so I can't really 
show that, but believe me, the speed improvements are huge sometimes.

Memory is cheap and unlimited for most realistic scenarios where this 
library might be used, processor time is limited. I for one have several 
GB available, and try to optimize for speed. Sometimes I even disable GC 
and periodically kill VMs, since that has considerable overall speed 
gains in some code I developed. Still a SimpleContext can be reused, 
that that's a good advise if you care both about space and GC.

I actually made that method you mentioned to recover the position from a 
single integer. It was kinda slow, to say the least, even with rather 
optimized search. That was my first try: been there, done that :-) I 
dumped the code somewhere as it was useless in practice.

I think SimpleContext, as it is implemented in the last mail I sent, is 
the best option for random access as I see it, or at least is strongly 
pointing in the right direction.

Quite frankly I think all 3 options are excelent, and all should be part 
of ximpleware-2.1. They are all excelent and different tools, each with 
a set of critical advantages over others:

- pop/push
- NodeRecorder
- SimpleContext

Concerning the VTDNav.setPosition(int nodeNumber) I think it's useless, 
since it's way too slow for heavy usage. But it eventually should be 
there, just in case, perhaps with some performance-wise warnings.

I hope this helps improve your wonderfull tool, Jimmy. VTD can really 
deliver, if you can make it a bit more open and flexible. You can in 
fact aspire to kill DOM in the future. Perhaps in the future an entire 
DOM-API could be emulated and implemented on top of VTD, dinamically 
creating the required objects. For now this is only a dream, of course, 
but who knows?

Just my 2 euro-cents :-)

Jimmy Zhang wrote:
> Rodrigo, I went over your emails on this thread again and comes up a 
> few questions...
>
> 1. In one of the emails, you attached a class capable of storing a 
> single node position..
> I am wondering why store just one? why not more?
>
> 2.  In the code below, vn.setCtxFromNav and vn.setNavFromCtx seem to me
> equivalent to push() and pop
>
> myContext = new SimpleContext(null); //Or another size you like, but
> null works just fine
> while(ap.iterate()){
>    vn.setCtxFromNav(myContext);
>    // do something messy
>    vn.setNavFromCtx(myContext);
> }
>
>
> As to NodeRecorder, it is designed to be instantiated once and hold 
> many nodes...
> not to be instantiated 10 times for 10 nodes...
>
> As to why NodeRecorder doesn't use the node format as in push pop and 
> setCtx as
> in your patch... the main motivation is to conserve memory, push pop 
> and setCtx all
> use full-expanded node representation which can be quite big for a 
> complex
> document
>
> NodeRecorde's internal format is more compact, but the the 
> representation is variable
> in length...
>
> The most compact node representation is to just use the index value 
> which is always
> 32-bit per node... VTDNav can add a method that "recovers" the node 
> position from a
> single index value of the node...
>
> All in all, the above three options are typical trade-offs between 
> memory and computation
>
> * Pop push use the full  expanded node representation which are 
> constant in length and don't
> require and extra computation
>
> * Node recorder's internal node representation is compacted a bit, but 
> is variable in length,
> and therefore can be accessed only sequentially
>
> * Using a single integer to represent a node is the most compact, but 
> requires some CPU cyles
> to "recover" the node position...
>
> I would like ot know what you think of those options...and will take 
> the discussion
> forward from that point on...