From: Mark S. <ma...@Sc...> - 2007-03-01 03:51:48
|
Rodrigo Cunha wrote: > Nah... given what i understand about the internal structure making it > the new document root is not trivial, i think. > > But you could get away with: > > - A bookmark/position saving system, aka the (in)famous SimpleContext > class?... > > - A function evaluating: > > boolean SimpleContext.isSunOf(Simplecontext ctx); //this should be > simple and efficient to implement You would know better than me. I have a high level requirement to be able to jump to a point and do XPath on it. It doesn't have to be the new document root - I just thought that might be easier. If your SimpleContext class will help me get there then I'm all for it. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Tatu S. <cow...@ya...> - 2007-03-01 04:27:23
|
--- Rodrigo Cunha <rn...@gm...> wrote: > Nah... given what i understand about the internal > structure making it > the new document root is not trivial, i think. I guess there are two different cases: one of using as context node (which seems trivially easy), where it doesn't limit xpath from going up the hierarchy; and the other that would essentially force a sub-tree to be new "sub-document" (so that xpath traversal would never go up past the sub-tree root). Latter would be only little bit more complicated unless I'm missing something. I mean, xpath processing code uses int indexes for traversing different axes; and it should be able to limit its visibility to sub-tree, which is just int-range (node-start-offset - to - next-sibling-offset-minus-1). That's one benefit of using consequtive int indexes, defining sub-trees is rather simple. -+ Tatu +- ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index |
From: Jimmy Z. <cra...@co...> - 2007-03-01 04:29:25
|
I need to give that some thought... ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> To: "Mark Swanson" <ma...@Sc...> Cc: <vtd...@li...> Sent: Wednesday, February 28, 2007 7:20 PM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Nah... given what i understand about the internal structure making it > the new document root is not trivial, i think. > > But you could get away with: > > - A bookmark/position saving system, aka the (in)famous SimpleContext > class?... > > - A function evaluating: > > boolean SimpleContext.isSunOf(Simplecontext ctx); //this should be > simple and efficient to implement > > Regards, > > Rodrigo > > Mark Swanson wrote: >> Rodrigo Cunha wrote: >>> For a SimpleContext, or something similar: >>> >>> The comparison operator could just compare the arrays containing the >>> internal state of Navigator. >>> >>> The hashcode would be the XOR of all elements in the array, that >>> implements java hashcode and equals in a compatible way with the >>> assumed rules in java libraries. >>> >>> Jimmy, I think my "SimpleContext" or whatever you want to call it, if >>> properly implemented and perhaps with a few extra functions could >>> solve all the problems of random access. >>> >>> I think we just want something with the functionality of push() and >>> pop() but that can be extracted and kept outside, memorizing the >>> equivalent of a single stack position, and perhaps with a bit more >>> functionality. >> >> It will be interesting to see how much / little info we can get away >> with saving and still be able to start an XPath expression from an >> arbitrary point. >> >> Important: it may make the implementation easier if we make the >> arbitrary starting point the new document root - just for xpath >> evaluation purposes. I think this is perfect, actually. I don't want >> anything except for the node and its children - that's why I'm >> explicitly pointing there in the first place. >> >> F.E. >> (forgive the illegal simplified syntax..) >> >> aaa >> bbb >> ccc <- Index saved, new root for xpath eval. >> eee >> fff >> /ccc >> dd >> ... >> >> I'd want to say something like this: >> >> vtdNav.toElement(cccIndex); // reset cursor to ccc >> ap.selectXPath("/ccc/*") >> >> Cheers. >> > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Mark S. <ma...@Sc...> - 2007-03-01 02:46:32
|
Tatu Saloranta wrote: > --- Mark Swanson <ma...@Sc...> wrote: > > ... >> It would be most helpful to me if I could index >> arbitrary element >> indexes and start and XPath query from one of these >> indexes. I would >> cache these indexes in a Map with key: some unique >> ID, value: some sort >> of vtd-xml node index. > > Given that VTD-XML indices are, well, ints, would this > be anything more than a kind of Map<String,int>? (or, > a stack thereof). > That seems like a simple thing to build even outside > of VTD-XML itself? > > Just curious, Good point. If Jimmy can rig it so I can just execute arbitrary XPath expressions starting from a specific node this might be too easy. I'm a little worried about starting an XPath evaluation from a specific spot though. Hopefully Jimmy has some insight on that. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2007-02-28 18:51:33
|
Rodrigo, Can you explain the hash table a bit more? I am especially interested in when it would helpful to you... also I would like to know if you think there are alterantives... From my perspective, offering something you described is definitely what we strive to achieve, however, when we design/enhance an API, everything we add can potentially be a double-edged sword, Take nodeRecorder as an example..it offers some degree of random access, but if not used properly it will consume much memory, also I have been a bit busy lately with some other stuff.. let me get to some of your old emails the next few days and get back to you... ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: <vtd...@li...> Sent: Wednesday, February 28, 2007 8:11 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Rodrigo Cunha wrote: >> I understand NodeRecorder was not intended to be kept in large numbers, >> but I think that should exactly be the idea of a random access API: a >> lightweight way of keeping a bunch of bookmarks in the datastructure the >> programmer wants, not in the structure we want, or something... >> >> Your API is nice for somewhat serial processing, not for true random >> access, using pre-build hash tables, for example, or trees, or whatever. >> I could built a wrapper around NodeRecorder implementing a simplier API, >> but that would be really clumsy. >> >> My API, while incomplete, is much more simple, and flexible also... it's >> also rather light. I would like to learn about other opinions on the >> subject, since we are probably both too used to our way of doing things >> to be impartial. > > It would be most helpful to me if I could index arbitrary element > indexes and start and XPath query from one of these indexes. I would > cache these indexes in a Map with key: some unique ID, value: some sort > of vtd-xml node index. > > For most of the applications I use XML for, this would be the only way > to get acceptable performance. Ultimately, without this I would not be > able to consider using vtd-xml for these apps and I would be forced to > use an xml - Object mapping tool. > > I've been using and helping maintain/fix a number of XML - Object > mapping tools over the years. It's been an interesting area of study for > me. Please free me from the insufferable weight of those chains :-) > > Cheers. > > -- > http://www.ScheduleWorld.com/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Jimmy Z. <cra...@co...> - 2007-02-28 18:58:38
|
Mark, I believe the first thing your described, which is to index a node position from which an XPath eval can start, is already possible a custom hack should available... As to node id and hashmap part, what do you think about the index of a VTD record, it is always unique within a given XML doc,what do you think of that? ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: <vtd...@li...> Sent: Wednesday, February 28, 2007 8:11 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Rodrigo Cunha wrote: >> I understand NodeRecorder was not intended to be kept in large numbers, >> but I think that should exactly be the idea of a random access API: a >> lightweight way of keeping a bunch of bookmarks in the datastructure the >> programmer wants, not in the structure we want, or something... >> >> Your API is nice for somewhat serial processing, not for true random >> access, using pre-build hash tables, for example, or trees, or whatever. >> I could built a wrapper around NodeRecorder implementing a simplier API, >> but that would be really clumsy. >> >> My API, while incomplete, is much more simple, and flexible also... it's >> also rather light. I would like to learn about other opinions on the >> subject, since we are probably both too used to our way of doing things >> to be impartial. > > It would be most helpful to me if I could index arbitrary element > indexes and start and XPath query from one of these indexes. I would > cache these indexes in a Map with key: some unique ID, value: some sort > of vtd-xml node index. > > For most of the applications I use XML for, this would be the only way > to get acceptable performance. Ultimately, without this I would not be > able to consider using vtd-xml for these apps and I would be forced to > use an xml - Object mapping tool. > > I've been using and helping maintain/fix a number of XML - Object > mapping tools over the years. It's been an interesting area of study for > me. Please free me from the insufferable weight of those chains :-) > > Cheers. > > -- > http://www.ScheduleWorld.com/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Mark S. <ma...@Sc...> - 2007-03-01 02:44:24
|
Jimmy Zhang wrote: > Mark, I believe the first thing your described, which is to index > a node position from which an XPath eval can start, is already possible > a custom hack should available... > As to node id and hashmap part, what do you think about the index > of a VTD record, it is always unique within a given XML doc,what > do you think of that? I think using the VTD record index is perfect. Requirement: I want to use different XPaths. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2007-02-28 22:14:40
|
Mark, when you save a node position, then later want to start the xpath from that node position, are you talking about the same xpath (one which you use to locate the node in the first place) or a different one? Jimmy ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: <vtd...@li...> Sent: Wednesday, February 28, 2007 8:11 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Rodrigo Cunha wrote: >> I understand NodeRecorder was not intended to be kept in large numbers, >> but I think that should exactly be the idea of a random access API: a >> lightweight way of keeping a bunch of bookmarks in the datastructure the >> programmer wants, not in the structure we want, or something... >> >> Your API is nice for somewhat serial processing, not for true random >> access, using pre-build hash tables, for example, or trees, or whatever. >> I could built a wrapper around NodeRecorder implementing a simplier API, >> but that would be really clumsy. >> >> My API, while incomplete, is much more simple, and flexible also... it's >> also rather light. I would like to learn about other opinions on the >> subject, since we are probably both too used to our way of doing things >> to be impartial. > > It would be most helpful to me if I could index arbitrary element > indexes and start and XPath query from one of these indexes. I would > cache these indexes in a Map with key: some unique ID, value: some sort > of vtd-xml node index. > > For most of the applications I use XML for, this would be the only way > to get acceptable performance. Ultimately, without this I would not be > able to consider using vtd-xml for these apps and I would be forced to > use an xml - Object mapping tool. > > I've been using and helping maintain/fix a number of XML - Object > mapping tools over the years. It's been an interesting area of study for > me. Please free me from the insufferable weight of those chains :-) > > Cheers. > > -- > http://www.ScheduleWorld.com/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Rodrigo C. <rn...@gm...> - 2007-02-27 19:40:01
Attachments:
VTDNav.java
SimpleContext.java
|
Hi! Here goes ximpleware_2.0 files I've changed to make my very simple random navigation API available. A very simple and useless example of usage: myContext = new SimpleContext(null); //Or another size you like, but null works just fine while(ap.iterate()){ vn.setCtxFromNav(myContext); // do something messy vn.setNavFromCtx(myContext); } Of course nicer example might include keeping large numbers of SimpleContexts in hash tables, trees, etc. Jimmy Zhang wrote: > The latest benchmark reports (on Version 2.0) is now live > at > http://vtd-xml.sf.net/benchmark1.html > > The corresponding benchmark code also was uploaded > to the sourceforge at > > http://sourceforge.net/project/showfiles.php?group_id=110612 > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > > |
From: Jimmy Z. <cra...@co...> - 2007-02-21 04:20:00
|
The C,C# and Java, both light and full version are now released, in a few days the benchmark code will also be released, at that time, new benchmark reports will come out... if you downloaded 2.0 yesterday, I advise you to do it again... as today there is addition to the example directory containng code on how to use NodeRecorder... for those using the C version, there is a parseFile method that will make the coding a lot easier... There also will be articles coming out soon concerning the Xpath design and uses, as well as the new indexing feature ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> Cc: <vtd...@li...> Sent: Tuesday, February 20, 2007 12:10 PM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Ok, so I see a new NodeRecorder. > > I didn't saw the internals of NodeRecorder yet, but I presume it's > lightweight, so I can instanciate a few thousands without major trouble > and keep them in my internal structures, right? > > I think you should introduce two new methods into NodeRecorder: > > VTDNav NodeRecorder.getNav(); > > int NodeRecorder.getPositionsCount(); > > Thanks, > > Rodrigo > > Jimmy Zhang wrote: >> the source forge shell service is down, the document for 2.0 is at >> http://www.ximpleware.com/doc/ >> ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> >> To: <vtd...@li...> >> Sent: Thursday, February 15, 2007 3:19 AM >> Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) >> >> >>> Well, just some ideas concerning what I think should be the nature of a >>> "context": >>> >>> - As light as possible to generate, manipulate and access (so just use a >>> simple context with minimun clutter). >>> - Comparable. >>> - Hashable efficiently (good and fast dispertion function). >>> - Possible to associate with VTDNav (so contains a pointer to VTDNav). >>> - Usable in another VTDNav (that's a tricky one, and unsafe, but makes >>> sense if you have various equal VTDNavs and a RMI-based system, so it >>> should be possible despite perhaps including dire warnings in the >>> documentation). >>> >>> Jimmy Zhang wrote: >>> >>> Yes, will try, but then again, there will always be a 2.1 :) >>> >>> >>> ------------------------------------------------------------------------- >>> >>> Take Surveys. Earn Cash. Influence the Future of IT >>> Join SourceForge.net's Techsay panel and you'll get the chance to >>> share your >>> opinions on IT & business topics through brief surveys-and earn cash >>> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >>> >>> _______________________________________________ >>> Vtd-xml-users mailing list >>> Vtd...@li... >>> https://lists.sourceforge.net/lists/listinfo/vtd-xml-users >>> >> >> >> > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Jimmy Z. <cra...@co...> - 2007-03-02 21:20:23
|
Rodrigo, I went over your emails on this thread again and comes up a few questions... 1. In one of the emails, you attached a class capable of storing a single node position.. I am wondering why store just one? why not more? 2. In the code below, vn.setCtxFromNav and vn.setNavFromCtx seem to me equivalent to push() and pop myContext = new SimpleContext(null); //Or another size you like, but null works just fine while(ap.iterate()){ vn.setCtxFromNav(myContext); // do something messy vn.setNavFromCtx(myContext); } As to NodeRecorder, it is designed to be instantiated once and hold many nodes... not to be instantiated 10 times for 10 nodes... As to why NodeRecorder doesn't use the node format as in push pop and setCtx as in your patch... the main motivation is to conserve memory, push pop and setCtx all use full-expanded node representation which can be quite big for a complex document NodeRecorde's internal format is more compact, but the the representation is variable in length... The most compact node representation is to just use the index value which is always 32-bit per node... VTDNav can add a method that "recovers" the node position from a single index value of the node... All in all, the above three options are typical trade-offs between memory and computation * Pop push use the full expanded node representation which are constant in length and don't require and extra computation * Node recorder's internal node representation is compacted a bit, but is variable in length, and therefore can be accessed only sequentially * Using a single integer to represent a node is the most compact, but requires some CPU cyles to "recover" the node position... I would like ot know what you think of those options...and will take the discussion forward from that point on... ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> To: <vtd...@li...> Sent: Thursday, February 15, 2007 3:19 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Well, just some ideas concerning what I think should be the nature of a > "context": > > - As light as possible to generate, manipulate and access (so just use a > simple context with minimun clutter). > - Comparable. > - Hashable efficiently (good and fast dispertion function). > - Possible to associate with VTDNav (so contains a pointer to VTDNav). > - Usable in another VTDNav (that's a tricky one, and unsafe, but makes > sense if you have various equal VTDNavs and a RMI-based system, so it > should be possible despite perhaps including dire warnings in the > documentation). > > Jimmy Zhang wrote: > > Yes, will try, but then again, there will always be a 2.1 :) > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Rodrigo C. <rn...@gm...> - 2007-03-03 00:46:57
|
Hi Jimmy! Well, the objective is to hold objects in java data structures for fast access, so I opted for a simple class, with minimal computation requirements. You can see even the equals() ans hashCode() are optimized. It holds a single node because that's the basic building block. All else should be done with container classes, I think: let's not reinvent the wheel, and pretend to do it better. All this brings VTD a much needed DOM-like functionality. I can now use VTD as I used DOM before. In fact that's what I've been doing since a few months ago, with my own patched ximpleware-1.6, and I'm now sharing this functionality in a more polished way with you all. The code is basically the same as for push and pop, that's right. The wheel was there, worked just fine, I took it and reused it in my own way :-) I gave a simple example because all real code I've produced using this heavily is both confidential and proprietary... so I can't really show that, but believe me, the speed improvements are huge sometimes. Memory is cheap and unlimited for most realistic scenarios where this library might be used, processor time is limited. I for one have several GB available, and try to optimize for speed. Sometimes I even disable GC and periodically kill VMs, since that has considerable overall speed gains in some code I developed. Still a SimpleContext can be reused, that that's a good advise if you care both about space and GC. I actually made that method you mentioned to recover the position from a single integer. It was kinda slow, to say the least, even with rather optimized search. That was my first try: been there, done that :-) I dumped the code somewhere as it was useless in practice. I think SimpleContext, as it is implemented in the last mail I sent, is the best option for random access as I see it, or at least is strongly pointing in the right direction. Quite frankly I think all 3 options are excelent, and all should be part of ximpleware-2.1. They are all excelent and different tools, each with a set of critical advantages over others: - pop/push - NodeRecorder - SimpleContext Concerning the VTDNav.setPosition(int nodeNumber) I think it's useless, since it's way too slow for heavy usage. But it eventually should be there, just in case, perhaps with some performance-wise warnings. I hope this helps improve your wonderfull tool, Jimmy. VTD can really deliver, if you can make it a bit more open and flexible. You can in fact aspire to kill DOM in the future. Perhaps in the future an entire DOM-API could be emulated and implemented on top of VTD, dinamically creating the required objects. For now this is only a dream, of course, but who knows? Just my 2 euro-cents :-) Jimmy Zhang wrote: > Rodrigo, I went over your emails on this thread again and comes up a > few questions... > > 1. In one of the emails, you attached a class capable of storing a > single node position.. > I am wondering why store just one? why not more? > > 2. In the code below, vn.setCtxFromNav and vn.setNavFromCtx seem to me > equivalent to push() and pop > > myContext = new SimpleContext(null); //Or another size you like, but > null works just fine > while(ap.iterate()){ > vn.setCtxFromNav(myContext); > // do something messy > vn.setNavFromCtx(myContext); > } > > > As to NodeRecorder, it is designed to be instantiated once and hold > many nodes... > not to be instantiated 10 times for 10 nodes... > > As to why NodeRecorder doesn't use the node format as in push pop and > setCtx as > in your patch... the main motivation is to conserve memory, push pop > and setCtx all > use full-expanded node representation which can be quite big for a > complex > document > > NodeRecorde's internal format is more compact, but the the > representation is variable > in length... > > The most compact node representation is to just use the index value > which is always > 32-bit per node... VTDNav can add a method that "recovers" the node > position from a > single index value of the node... > > All in all, the above three options are typical trade-offs between > memory and computation > > * Pop push use the full expanded node representation which are > constant in length and don't > require and extra computation > > * Node recorder's internal node representation is compacted a bit, but > is variable in length, > and therefore can be accessed only sequentially > > * Using a single integer to represent a node is the most compact, but > requires some CPU cyles > to "recover" the node position... > > I would like ot know what you think of those options...and will take > the discussion > forward from that point on... |
From: Jimmy Z. <cra...@co...> - 2007-03-03 21:00:41
|
VTD-XML certainly still has a lot of growing up left... One old observation from early days of VTD-XML: Memory usage has strong performance implications as well... DOM's excessive memory usage directly contributes its slow performance, VTD-XML's memory strategy is largely responsible for its parsing performance...so VTD's mentality has always been to: reduce memory usage whenever possible, that will, one way or another, lead to better performance... as setPosition(int i)'s performance, did your implemenation directly manipulate Location Cache tables? I will take a stab at it to see how fast it can get Concering the equal comparison of two node object, I think that the only condition to check is the the currentIndex value of tthe cursor location, assuming the same VTDNav instance... hashCode computation and equality can basically use that value... ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> To: "Jimmy Zhang" <cra...@co...> Cc: <vtd...@li...> Sent: Friday, March 02, 2007 4:46 PM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > Hi Jimmy! > > Well, the objective is to hold objects in java data structures for fast > access, so I opted for a simple class, with minimal computation > requirements. You can see even the equals() ans hashCode() are > optimized. It holds a single node because that's the basic building > block. All else should be done with container classes, I think: let's > not reinvent the wheel, and pretend to do it better. > > All this brings VTD a much needed DOM-like functionality. I can now use > VTD as I used DOM before. In fact that's what I've been doing since a > few months ago, with my own patched ximpleware-1.6, and I'm now sharing > this functionality in a more polished way with you all. > > The code is basically the same as for push and pop, that's right. The > wheel was there, worked just fine, I took it and reused it in my own way > :-) I gave a simple example because all real code I've produced using > this heavily is both confidential and proprietary... so I can't really > show that, but believe me, the speed improvements are huge sometimes. > > Memory is cheap and unlimited for most realistic scenarios where this > library might be used, processor time is limited. I for one have several > GB available, and try to optimize for speed. Sometimes I even disable GC > and periodically kill VMs, since that has considerable overall speed > gains in some code I developed. Still a SimpleContext can be reused, > that that's a good advise if you care both about space and GC. > > I actually made that method you mentioned to recover the position from a > single integer. It was kinda slow, to say the least, even with rather > optimized search. That was my first try: been there, done that :-) I > dumped the code somewhere as it was useless in practice. > > I think SimpleContext, as it is implemented in the last mail I sent, is > the best option for random access as I see it, or at least is strongly > pointing in the right direction. > > Quite frankly I think all 3 options are excelent, and all should be part > of ximpleware-2.1. They are all excelent and different tools, each with > a set of critical advantages over others: > > - pop/push > - NodeRecorder > - SimpleContext > > Concerning the VTDNav.setPosition(int nodeNumber) I think it's useless, > since it's way too slow for heavy usage. But it eventually should be > there, just in case, perhaps with some performance-wise warnings. > > I hope this helps improve your wonderfull tool, Jimmy. VTD can really > deliver, if you can make it a bit more open and flexible. You can in > fact aspire to kill DOM in the future. Perhaps in the future an entire > DOM-API could be emulated and implemented on top of VTD, dinamically > creating the required objects. For now this is only a dream, of course, > but who knows? > > Just my 2 euro-cents :-) > > Jimmy Zhang wrote: >> Rodrigo, I went over your emails on this thread again and comes up a >> few questions... >> >> 1. In one of the emails, you attached a class capable of storing a >> single node position.. >> I am wondering why store just one? why not more? >> >> 2. In the code below, vn.setCtxFromNav and vn.setNavFromCtx seem to me >> equivalent to push() and pop >> >> myContext = new SimpleContext(null); //Or another size you like, but >> null works just fine >> while(ap.iterate()){ >> vn.setCtxFromNav(myContext); >> // do something messy >> vn.setNavFromCtx(myContext); >> } >> >> >> As to NodeRecorder, it is designed to be instantiated once and hold >> many nodes... >> not to be instantiated 10 times for 10 nodes... >> >> As to why NodeRecorder doesn't use the node format as in push pop and >> setCtx as >> in your patch... the main motivation is to conserve memory, push pop >> and setCtx all >> use full-expanded node representation which can be quite big for a >> complex >> document >> >> NodeRecorde's internal format is more compact, but the the >> representation is variable >> in length... >> >> The most compact node representation is to just use the index value >> which is always >> 32-bit per node... VTDNav can add a method that "recovers" the node >> position from a >> single index value of the node... >> >> All in all, the above three options are typical trade-offs between >> memory and computation >> >> * Pop push use the full expanded node representation which are >> constant in length and don't >> require and extra computation >> >> * Node recorder's internal node representation is compacted a bit, but >> is variable in length, >> and therefore can be accessed only sequentially >> >> * Using a single integer to represent a node is the most compact, but >> requires some CPU cyles >> to "recover" the node position... >> >> I would like ot know what you think of those options...and will take >> the discussion >> forward from that point on... > > |
From: Rodrigo C. <rn...@gm...> - 2007-03-04 16:30:14
|
Hi! My implementation of setPosition did a binary search for nodes, I think... I don't really remember well. The performance was dismal, so probably the implementation was wrong... ops! An efficient implementation would be great, althought I think depending on the situation using SimpleContext could still be better. Concerning the hash end equals, yep, yor're right... but I was thinking perhaps in the future a pointer to the related VTDNav could be maintained in SimpleContext, and multiple document mixing in the same data structure could be possible... but yeah, I think we could change it, despite that new implementation only being consistent in the context of a single VTDNav. That specifically should be put in the documentation, of course. Ok, let't keep it simple since class can allways be user-derived from SimpleContext implementing multiple VTDNav functionality. Basically you're 99% right, with some caveats that should perhaps be documented, so let's do it your way :-) Memory usage is not allways critical, sometimes performance can be gained from memory, sometimes that's a false argument. In this case an option should be given between fast position recall and small memory usage. In fact they are both very small comparing with DOM. My bet is that VTD can be better than DOM, even API-wise, and still faster than SAX, if we care to enhance the API. Jimmy Zhang wrote: > VTD-XML certainly still has a lot of growing up left... > One old observation from early days of VTD-XML: > Memory usage has strong performance implications as well... > DOM's excessive memory usage directly contributes its slow > performance, VTD-XML's memory strategy is largely responsible for its > parsing performance...so VTD's mentality has always been to: reduce > memory usage whenever possible, that will, one way > or another, lead to better performance... > as setPosition(int i)'s performance, did your implemenation directly > manipulate Location Cache tables? I will take a stab at it to see > how fast it can get > Concering the equal comparison of two node object, I think that > the only condition to check is the the currentIndex value of tthe > cursor location, assuming the same VTDNav instance... > hashCode computation and equality can basically use that value... |
From: Mark S. <ma...@Sc...> - 2007-03-04 18:35:47
|
> Memory usage is not allways critical, sometimes performance can be > gained from memory, sometimes that's a false argument. In this case an Just my 2cents: significant performance increases can be gained by structuring your algorithms to keep your memory accesses within a column of memory as much as possible. DRAM cycle time penalties are huge - unless you can keep the data in the CPU's cache. This is one of those tricks that column databases like K and Vertica use to stomp traditional databases with for n-dimensional queries. I actually wrote an n-dimensional column-based database for a company a number of years ago. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2007-03-04 19:04:29
|
Totally! DRAM access time is 100~200 cycles, packing everything together will make cache miss far less often and boost both parsing and navigation perfomrnace! ----- Original Message ----- From: "Mark Swanson" <ma...@Sc...> To: "Rodrigo Cunha" <rn...@gm...> Cc: "Jimmy Zhang" <cra...@co...>; <vtd...@li...> Sent: Sunday, March 04, 2007 10:35 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) >> Memory usage is not allways critical, sometimes performance can be gained >> from memory, sometimes that's a false argument. In this case an > > Just my 2cents: significant performance increases can be gained by > structuring your algorithms to keep your memory accesses within a column > of memory as much as possible. DRAM cycle time penalties are huge - unless > you can keep the data in the CPU's cache. > This is one of those tricks that column databases like K and Vertica use > to stomp traditional databases with for n-dimensional queries. I actually > wrote an n-dimensional column-based database for a company a number of > years ago. > > Cheers. > > -- > http://www.ScheduleWorld.com/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > |
From: Rodrigo C. <rn...@gm...> - 2007-03-05 03:00:00
|
Yes, all that is correct, and that's why I told you we should have choices. The API should provide choices and flexibility. Decent programmers will use the right tool, not the wrong tool. Documentation should also help whenever possible. BTW Mark, does my API solve your random-access problem?... Jimmy Zhang wrote: > Totally! DRAM access time is 100~200 cycles, packing everything together > will make cache miss far less often and boost both parsing and > navigation perfomrnace! > ----- Original Message ----- From: "Mark Swanson" > <ma...@Sc...> > To: "Rodrigo Cunha" <rn...@gm...> > Cc: "Jimmy Zhang" <cra...@co...>; > <vtd...@li...> > Sent: Sunday, March 04, 2007 10:35 AM > Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) > > >>> Memory usage is not allways critical, sometimes performance can be >>> gained from memory, sometimes that's a false argument. In this case an >> >> Just my 2cents: significant performance increases can be gained by >> structuring your algorithms to keep your memory accesses within a >> column of memory as much as possible. DRAM cycle time penalties are >> huge - unless you can keep the data in the CPU's cache. >> This is one of those tricks that column databases like K and Vertica >> use to stomp traditional databases with for n-dimensional queries. I >> actually wrote an n-dimensional column-based database for a company a >> number of years ago. >> >> Cheers. >> >> -- >> http://www.ScheduleWorld.com/ >> Free Google Calendar synchronization with Outlook, Evolution, >> cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, >> Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! >> WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. >> > > > |
From: Mark S. <ma...@Sc...> - 2007-03-05 03:45:09
|
Rodrigo Cunha wrote: > Yes, all that is correct, and that's why I told you we should have > choices. The API should provide choices and flexibility. Decent > programmers will use the right tool, not the wrong tool. Documentation > should also help whenever possible. > > BTW Mark, does my API solve your random-access problem?... It looks like it will. I'm in scramble / crunch time atm with the new Thunderbird/Lightning extension and the best I can do is participate via a few emails. Sorry I can't do more with this fascinating stuff atm. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Tatu S. <cow...@ya...> - 2007-03-05 05:45:29
|
--- Rodrigo Cunha <rn...@gm...> wrote: > Yes, all that is correct, and that's why I told you > we should have > choices. The API should provide choices and > flexibility. Decent > programmers will use the right tool, not the wrong > tool. Documentation > should also help whenever possible. Actually, sometimes choice is good; oftentimes too many (and specifically, irrelevant) choices just confuse, and few developers use them. Although I used to think most things should be configurable, I have started to question that -- unless _measured_ performance impact is significant, it may be best to just choose reasonable defaults. In the end, vast majority of users/devs just use whatever tools default to: and those who do not, generally prefer limited set of actually relevant things to configure. Or at the very least, make things configurable when they are requested to be configurable, and not try to speculate wildly about what might be useful. Anyway, just my opinions based on other projects, -+ Tatu +- ____________________________________________________________________________________ Need a quick answer? Get one in minutes from people who know. Ask your question on www.Answers.yahoo.com |
From: Rodrigo C. <rn...@gm...> - 2007-03-05 11:49:24
|
I think we are already getting out of the inicial context, so let's do reset :-) Let's get back to basics: the API, as it exists now, is unusable for what I want. And I don't want that much either, just to be able to keep nodes in my own data structures in an easy and efficient way to access. I actually solved the problem since ximpleware-1.6, I shared the patch then, I'm sharing it now again, since it seems the problem is recurrent with others also. I'm not saying this is the correct implementation, I'm saying this is the correct API, or close to that. The implementation is not that bad either, but perhaps can be improved. As it is it's working just fine, and is so obvious I'm still puzzled at why wasn't it part of ximpleware since version 1.0. Remember, the objective is "being able to keep a bunch of nodes in data structures in an easy and efficient way to access", with no fuzz attached. Just something like "give me node x bookmark", "take this bookmark and go back there". |
From: Jimmy Z. <cra...@co...> - 2007-03-05 18:43:36
|
Rodrigo, in one of the early emails you mentioned that your app sometimes keeps thousands of those nodes (or context), how do you tell them apart? There got to be some data structures to store those context info, right? What if it is implemented such that 1. those contexts/nodes are stored in a linear buffer, which is more space efficient than allocating individual objects 2. You can address those nodes by using an integer, sort like a VTD record... would you consider the design outlined above can be useful to your app? Jimmy ----- Original Message ----- From: "Rodrigo Cunha" <rn...@gm...> To: <vtd...@li...> Sent: Monday, March 05, 2007 3:49 AM Subject: Re: [Vtd-xml-users] Random Access Proposal (take 2) >I think we are already getting out of the inicial context, so let's do > reset :-) > > Let's get back to basics: the API, as it exists now, is unusable for > what I want. And I don't want that much either, just to be able to keep > nodes in my own data structures in an easy and efficient way to access. > I actually solved the problem since ximpleware-1.6, I shared the patch > then, I'm sharing it now again, since it seems the problem is recurrent > with others also. > > I'm not saying this is the correct implementation, I'm saying this is > the correct API, or close to that. The implementation is not that bad > either, but perhaps can be improved. As it is it's working just fine, > and is so obvious I'm still puzzled at why wasn't it part of ximpleware > since version 1.0. > > Remember, the objective is "being able to keep a bunch of nodes in data > structures in an easy and efficient way to access", with no fuzz > attached. Just something like "give me node x bookmark", "take this > bookmark and go back there". > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Rodrigo C. <rn...@gm...> - 2007-03-05 23:39:32
|
I keep them in a HashMap, for example, or in a TreeMap, etc... rarely on a simple list. The key is generally a string I would need to get in more or less convoluted ways from the node during a sequencial search. The node itself contains a lot more info I only want to retrieve in the future if it's needed, or else I would cache the info itself :-D If instead of keeping a context I can keep a simple integer and then order a VTDNav "hey you, get this integer you told me to keep and go to node you bookmarked" I would say it's ok, if the operation "get to the node" is fast. So, you're suggesting an API that would work like this: RandomNodeRecorder xpto = new RandomNodeRecorder(navigator); // xpto is the bookmark keeper organized in a way Jimmy likes :-) int mark = xpto.keepPos(); /* do some stuff here */ boolean xpto.fetchPos(mark); // back to the bookmarked node xpto.del(mark); // don't need the mark any longer I still fail to understand why shoudn't a context be kept outside the structures you seem to like :-) Memory is cheap, and for example, if I keep a hash of NEs, and each NE occupies a few KB itself, it's irrelevant if I'm gona use a few more bytes for each NE. I'm not suggesting one should keep large structures containing any single node in the document, ok? But the random access to a cached node must be fast. I emphasize: fast random access to cached nodes. As far as I understand the SimpleContext structure grows 4 bytes for each depth level, so a deeper node consumes more space, right? So a really deep node, let's say, at level 10, will consume 40 extra bytes, plus the base consumption... that's 48 bytes, quite small, unless the node is small and irrelevant. -- Rodrigo Jimmy Zhang wrote: > Rodrigo, in one of the early emails you mentioned that > your app sometimes keeps thousands of those nodes (or context), > how do you tell them apart? There got to be some data > structures to store those context info, right? > What if it is implemented such that > 1. those contexts/nodes are stored in a linear buffer, which > is more space efficient than allocating individual objects > 2. You can address those nodes by using an integer, sort > like a VTD record... > would you consider the design outlined above can be useful > to your app? > > Jimmy |
From: Mark S. <ma...@Sc...> - 2007-03-06 03:04:19
|
Rodrigo Cunha wrote: > I keep them in a HashMap, for example, or in a TreeMap, etc... rarely on > a simple list. > The key is generally a string I would need to get in more or less > convoluted ways from the node during a sequencial search. The node > itself contains a lot more info I only want to retrieve in the future if > it's needed, or else I would cache the info itself :-D Just a FYI: I have cases where the key is an Integer, and cases where it's a string. > If instead of keeping a context I can keep a simple integer and then > order a VTDNav "hey you, get this integer you told me to keep and go to > node you bookmarked" I would say it's ok, if the operation "get to the > node" is fast. > > So, you're suggesting an API that would work like this: > > RandomNodeRecorder xpto = new RandomNodeRecorder(navigator); > // xpto is the bookmark keeper organized in a way Jimmy likes :-) > int mark = xpto.keepPos(); > /* do some stuff here */ > boolean xpto.fetchPos(mark); // back to the bookmarked node > xpto.del(mark); // don't need the mark any longer > > I still fail to understand why shoudn't a context be kept outside the > structures you seem to like :-) Well, I'd be interested in knowing the time/space trade offs for both. For one specific case, I could have an int as the key, and an int as the mark/vtd-node. Both ints could be native ints with fastutil. Maybe the CPU overhead is much smaller with SimpleContext though... it would be nice to see what Jimmy has in mind (the details). > Memory is cheap, and for example, if I keep a hash of NEs, and each NE > occupies a few KB itself, it's irrelevant if I'm gona use a few more > bytes for each NE. > > I'm not suggesting one should keep large structures containing any > single node in the document, ok? But the random access to a cached node > must be fast. I emphasize: fast random access to cached nodes. +1 > As far as I understand the SimpleContext structure grows 4 bytes for > each depth level, so a deeper node consumes more space, right? So a > really deep node, let's say, at level 10, will consume 40 extra bytes, > plus the base consumption... that's 48 bytes, quite small, unless the > node is small and irrelevant. It is small, but I'm looking at about 6k indexes to cache per document, and as many documents cached as possible. Over-guessing at 100 bytes per SimpleContext (total) would mean 600KB of SimpleContext objects per document. I understand my use cases deal with larger than normal documents, but that just means I have so much more to gain from random access. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |
From: Jimmy Z. <cra...@co...> - 2007-03-06 04:45:01
|
> Just a FYI: I have cases where the key is an Integer, and cases where > it's a string. By Integer is it a java class? or just a primitive data type? Maybe I can modify Rodrigo's class and put it into CVS so you guys can use immediately... however, I can't guarantee that it will be included in the next release... Would that work? > >> If instead of keeping a context I can keep a simple integer and then >> order a VTDNav "hey you, get this integer you told me to keep and go to >> node you bookmarked" I would say it's ok, if the operation "get to the >> node" is fast. >> >> So, you're suggesting an API that would work like this: >> >> RandomNodeRecorder xpto = new RandomNodeRecorder(navigator); >> // xpto is the bookmark keeper organized in a way Jimmy likes :-) >> int mark = xpto.keepPos(); >> /* do some stuff here */ >> boolean xpto.fetchPos(mark); // back to the bookmarked node >> xpto.del(mark); // don't need the mark any longer >> >> I still fail to understand why shoudn't a context be kept outside the >> structures you seem to like :-) > > Well, I'd be interested in knowing the time/space trade offs for both. > For one specific case, I could have an int as the key, and an int as the > mark/vtd-node. Both ints could be native ints with fastutil. Maybe the CPU > overhead is much smaller with SimpleContext though... it would be nice to > see what Jimmy has in mind (the details). > >> Memory is cheap, and for example, if I keep a hash of NEs, and each NE >> occupies a few KB itself, it's irrelevant if I'm gona use a few more >> bytes for each NE. >> >> I'm not suggesting one should keep large structures containing any single >> node in the document, ok? But the random access to a cached node must be >> fast. I emphasize: fast random access to cached nodes. > > +1 > >> As far as I understand the SimpleContext structure grows 4 bytes for each >> depth level, so a deeper node consumes more space, right? So a really >> deep node, let's say, at level 10, will consume 40 extra bytes, plus the >> base consumption... that's 48 bytes, quite small, unless the node is >> small and irrelevant. > > It is small, but I'm looking at about 6k indexes to cache per document, > and as many documents cached as possible. Over-guessing at 100 bytes per > SimpleContext (total) would mean 600KB of SimpleContext objects per > document. I understand my use cases deal with larger than normal > documents, but that just means I have so much more to gain from random > access. > > Cheers. > > -- > http://www.ScheduleWorld.com/ > Free Google Calendar synchronization with Outlook, Evolution, > cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, > Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! > WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. > |
From: Mark S. <ma...@Sc...> - 2007-03-06 15:44:23
|
Jimmy Zhang wrote: > >> Just a FYI: I have cases where the key is an Integer, and cases where >> it's a string. > > By Integer is it a java class? or just a primitive data type? Maybe I can > modify Rodrigo's class and put it into CVS so you guys can use > immediately... > however, I can't guarantee that it will be included in the next release... > Would that work? Oh, I always use native ints and fastutil wherever possible. Just a thought: I use autojar on my code to build a tiny fastutil jar that just has the code I need. You could do the same thing to get excellent native collections instead of writing your own. I see you already wrote your own, but in case you need more.. Fastutil uses the LGPL. Cheers. -- http://www.ScheduleWorld.com/ Free Google Calendar synchronization with Outlook, Evolution, cell phones, BlackBerry, PalmOS, Exchange, Mozilla, Thunderbird, Pocket PC/Windows Mobile. Also sync tasks, notes and contacts! WebDAV, vfreebusy, RSS, LDAP, iCalendar, iTIP, iMIP support. |