From: Jimmy Z. <cra...@co...> - 2007-03-07 19:16:34
|
Fernando, It is interetsing that you have substitute Byte[] with = IbyteBuffer... since there is a level of indirection , the slight slow down should be = expected... I would certainly be interested in your approach to the issue and feel free to send me the = code... Cheers, Jimmy ----- Original Message -----=20 From: Fernando Gonzalez=20 To: Jimmy Zhang=20 Sent: Wednesday, March 07, 2007 7:49 AM Subject: Re: [Vtd-xml-users] Storing parsing info Hi Jimmy, Writing the following I have found that may be it's quite complicated = to understand since you don't know exactly the changes I have made. Even = my tests are not thorough so maybe the best option is to submit a = technical description of the changes, pros and cons, the code, and that = kind of things.=20 I have been testing the XPath performance problem and it seems like = it's a classloader issue. As you can see in the following log the = slowest XPath evaluation is the first, no matter how the parsing = information is obtained.=20 391 ms->Load XML 2125 ms->Parse XML 31 ms->Evaluate XPath 0 ms->Evaluate XPath 0 ms->Evaluate XPath 453 ms->Store parse info 0 ms->Clear parse info 313 ms->Read Parse info 0 ms->Evaluate XPath 0 ms->Evaluate XPath I have been working in something more. I have done some changes to VTD = and I have succeeded in the following. 1) The byte[] of the XML file is accessed through an interface = (IByteBuffer).=20 2) When I use the UniByteBuffer implementation I get a bit slower = results at parsing 391 ms->Load XML 2109 ms->Parse XML (vs 1890 ms I obtained accessing directly the = byte[] buffer) 0,172 ms->Evaluate XPath=20 0,078 ms->Evaluate XPath 0,094 ms->Evaluate XPath 0,078 ms->Evaluate XPath 0,078 ms->Evaluate XPath 3) When I use an implementation that loads chunks as they are needed I = get much slower results in parsing the file, but I get the same results = evaluating a XPath expression. The advantage of this approach is that = there is no need to load all the XML file in memory, so I have obtained = the following results:=20 25406 ms->Parse XML 406 ms->Store parse info 0,156 ms->Evaluate XPath 0,093 ms->Evaluate XPath 0,078 ms->Evaluate XPath 0,093 ms->Evaluate XPath 0,094 ms->Evaluate XPath 0,078 ms->Evaluate XPath=20 500 ms->Read Parse info 0,235 ms->Evaluate XPath 0,094 ms->Evaluate XPath 0,078 ms->Evaluate XPath 0,094 ms->Evaluate XPath 0,078 ms->Evaluate XPath 0,094 ms->Evaluate XPath The great thing in these results is that the XML file was 100Mb and I = run the program with the -Xmx64Mb jvm option (just enough to store the = 30mb parsing info, and the 16mb buffer) Well, as I said before I can send you a technical description of the = changes, pros and cons, and the code.=20 cheers, Fernando On 3/7/07, Fernando Gonzalez <fer...@gm...> wrote: Hi Jimmy, Thanks for your response. I think I'm using the version 2.0 since I have tested the = "VTDGen.writeIndex" method. I looked for another solution because I = cannot remove the original XML file so I would have to store the XML = file twice: the original xml file and the file with the XML, VTD and LCs = created by "VTDGen.writeIndex". As I'm dealing with really big XML = files, that's a drawback. Yes, you're right, I have added code. Just three or four lines. If = you're interested I can explain thoroughly my solution. About the XPath = performance, I think that's a classloader issue. I will check that and I = will report the results.=20 greetings, Fernando On 3/6/07, Jimmy Zhang < cra...@co...> wrote: Hey Fernando, Thanks for the email.. I am glad VTD-XML is helpful. My question: Which version are you using? =20 If you are currently using 2.0, it contains the indexing feature = that might accomplish just what is described in your email. Your solution is to seperate XML from VTD and LC, which I think = you must have added code to do that... VTD+XML (as in version 2.0) is to package XML, VTD and LCs into=20 a single file... which should also work The only suspicious part is that the XPath performance dropped for = your case ... which shouldn't happen=20 Buffer reuse is useful if your app instantiates a VTDGen to = sequentially process many incoming XML document ... if you deal only with one XML doc... buffer reuse won't make a big = difference I think you might be interested in first investigating the = persistence feature in=20 2.0, and there is a directory under code examples... Cheers, Jimmy =20 ----- Original Message -----=20 From: Fernando Gonzalez=20 To: vtd...@li...=20 Sent: Tuesday, March 06, 2007 1:23 AM Subject: [Vtd-xml-users] Storing parsing info Hello, First of all I would like to congratulate you on your project, I = really think it's great. Second, I want to use the java VTD-XML to do a certain task and = I have succeeded but I don't know if I have done it in the right way, or = there is a better one. Can you give me some advice?=20 I want to evaluate some XPath expressions on a lot of files of = this size and larger, so the memory eficiency is critical. The first = idea that comes to my mind is to have a VTDGen object for each XML file = but this solution leads to having all the XMLs loaded in memory in the = "protected byte[] XMLDoc;" attribute in VTDGen class. So each time I = have to evaluate a XPath expression in a XML file I have to read the xml = file, parse it, evaluate XPath and set to null the VTDGen object to get = the memory freed by the garbage collector.=20 I have obtained these results reading a big XML file (~100Mb): 360 ms reading file 1890 ms parsing file 32 ms evaluating a XPath expression 93 ms showing results total =3D 2375 milliseconds Where the second step ("parsing file") means: VTDGen vg =3D new VTDGen(); vg.setDoc(b); vg.parse(true); To speed up the process I have stored the parsing information in = a file. After that I can read the XML file and the parsing information = file, evaluate the XPath expression and close everything again in a = shorter time:=20 344 ms reading the file 422 ms reading parsing information 125 ms evaluating a XPath expression 93 ms showing results total =3D 984 I think the result is good enough but maybe there's a better = solution than mine. I have stored the parsing info by serializing all = the VTDGen object but the XMLDoc attribute. Then I retrieve the object = from disk and I set the XMLDoc attribute. This way:=20 ObjectInputStream ois =3D new ObjectInputStream(new = FileInputStream(PARSING_INFO)); vg =3D (MyVTDGen) ois.readObject(); ois.close(); FileInputStream fis2 =3D new = FileInputStream(TEST_XML);=20 byte[] b2 =3D new byte[(int) f.length()]; fis2.read(b2); vg.setXML(b2); //This method only sets the XMLDoc = attribute Is this solution good? Is there a better one? Can "Buffer reuse" = solve my probem?=20 best regards, Fernando ------------------------------------------------------------------------ = -------------------------------------------------------------------------= Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance = to share your opinions on IT & business topics through brief surveys-and earn = cash = http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV=20 ------------------------------------------------------------------------ _______________________________________________ Vtd-xml-users mailing list Vtd...@li... https://lists.sourceforge.net/lists/listinfo/vtd-xml-users |