From: Din S. <di...@ya...> - 2006-07-31 12:23:50
|
Here is my requirement I need to split really big XML files(1 GB plus) into smaller sized files. I am in the process of evaluating different approaches. 1. Use Vtd-XML, parse and split. 2. Use Perl XML::Twig split function 3. Writing my own parser in perl on top of XML::Parser, which uses expat. 4. Use libxml2. I am not sure if this is the right place to post this question, but would like to know the best approach to get the job done effectively. I would like to know the pros/cons and limitations of my proposed solutions. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Jimmy Z. <cra...@co...> - 2006-07-31 15:59:45
|
I think VTD-XML should have a couple of distinct advantages for splitting XML, performance probably being the biggest reason... currently VTD-XML's file size support is 2GB, and you need to have enough memory to hold the document in memory... I haven't tried other approaches, but they seem like SAX based, and may be slower and less flexible (SAX is forward only), Let me know if there are any questions... you are welcome to share your experience with us ----- Original Message ----- From: "Din Sush" <di...@ya...> To: <vtd...@li...> Sent: Monday, July 31, 2006 5:23 AM Subject: [Vtd-xml-users] VTD-XML Query > Here is my requirement > > I need to split really big XML files(1 GB plus) into > smaller sized files. > I am in the process of evaluating different > approaches. > 1. Use Vtd-XML, parse and split. > 2. Use Perl XML::Twig split function > 3. Writing my own parser in perl on top of > XML::Parser, > which uses expat. > 4. Use libxml2. > > I am not sure if this is the right place to post this > question, but would like to know the best approach to > get the job done effectively. > > I would like to know the pros/cons and limitations of > my proposed solutions. > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Tatu S. <cow...@ya...> - 2006-07-31 18:53:39
|
--- Din Sush <di...@ya...> wrote: > Here is my requirement > > I need to split really big XML files(1 GB plus) into > smaller sized files. > I am in the process of evaluating different > approaches. > 1. Use Vtd-XML, parse and split. > 2. Use Perl XML::Twig split function > 3. Writing my own parser in perl on top of > XML::Parser, > which uses expat. > 4. Use libxml2. To me, this does sound like you would be better off using a streaming approach (SAX, StAX or XmlPull; or .net equivalent of the last 2; StAX and XmlPull are Java things). I don't know if there are perl-basd streaming equivalents, but I think expat and libxml2 have streaming SAX interfaces (or similar) There doesn't seem to be much need for random access, nor need to keep any portions in memory. Streaming approaches have no problem with files of any size (certainly no problems with 1 GB), and for splitting I personally do not think VTD-XML would be faster than the alternatives. This because all the content has to be accessed -- VTD-XML is fastest when you need to access as little data as possible. -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Jimmy Z. <cra...@co...> - 2006-07-31 19:24:08
|
Well, the problem with streaming approach is that you will need to parse then reserialize, both CPU intensive, with VTD-XML it becomes a lot more efficient, but you need to load the document in memory first, so there first has to be enough memory available... but on the other hand, using steaming API like SAX or PULL, you will need to read in the document piecewise anyway, so overall I think VTD-XML should win quite significantly... My view of VTD-XML is that it is just like DOM, you can jump back and forth as often as you want... yet it parses a lot faster than DOM... ----- Original Message ----- From: "Tatu Saloranta" <cow...@ya...> To: "Din Sush" <di...@ya...>; <vtd...@li...> Sent: Monday, July 31, 2006 11:53 AM Subject: Re: [Vtd-xml-users] VTD-XML Query > --- Din Sush <di...@ya...> wrote: > >> Here is my requirement >> >> I need to split really big XML files(1 GB plus) into >> smaller sized files. >> I am in the process of evaluating different >> approaches. >> 1. Use Vtd-XML, parse and split. >> 2. Use Perl XML::Twig split function >> 3. Writing my own parser in perl on top of >> XML::Parser, >> which uses expat. >> 4. Use libxml2. > > To me, this does sound like you would be better off > using a streaming approach (SAX, StAX or XmlPull; or > .net equivalent of the last 2; StAX and XmlPull are > Java things). I don't know if there are perl-basd > streaming equivalents, but I think expat and libxml2 > have streaming SAX interfaces (or similar) > > There doesn't seem to be much need for random access, > nor need to keep any portions in memory. Streaming > approaches have no problem with files of any size > (certainly no problems with 1 GB), and for splitting I > personally do not think VTD-XML would be faster than > the alternatives. This because all the content has to > be accessed -- VTD-XML is fastest when you need to > access as little data as possible. > > -+ Tatu +- > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |
From: Tatu S. <cow...@ya...> - 2006-07-31 20:16:01
|
--- Jimmy Zhang <cra...@co...> wrote: > Well, the problem with streaming approach is that > you will need to parse > then reserialize, > both CPU intensive, with VTD-XML it becomes a lot > more efficient, but you For small files, perhaps, but I would expect that splitting a _1 gig file_ will be much much slower with VTD-XML. Why? Because of the huge memory allocations, and non-locality of the content. Having to read it all in memory first, and then traversing it again second time will not be as efficient as doing it in chunks like streaming parsers do. ... > other hand, using steaming API like SAX or PULL, you > will need to read in > the document > piecewise anyway, so overall I think VTD-XML should Sure. But reading (and parsing) piece by piece, not as a huge memory consuming chunk, will actually be faster due to caching issues. Maybe I should write a simple test case to demonstrate that. I could start with simple tests I have for just parsing, and accessing all information needed. Now, simple token indexing that VTD-XML seems to be up to twice as fast as that of SAX parsers, at least for small to medium-sized files. That is, assuming no data is used for anything. Accessing data, for example reconstructing another tree model, seems to get speeds down to about equivalent level on my basic tests. -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Din S. <di...@ya...> - 2006-08-01 03:21:02
|
Well I only need to split the document and don't need to go back to parsed document, and I don't need DOM like functionality. Will VTD-XML be still better in this scenario. Secondly as the entire document needs to be loaded in the memory, the whole idea of splitting is that I am getting "Out of Memory" error won't I get the same error when I am using VTD-XML, than it kind of defeats the purpose. Correct me if I am wrong in the interpretation as I have never used VTD. --- Jimmy Zhang <cra...@co...> wrote: > Well, the problem with streaming approach is that > you will need to parse > then reserialize, > both CPU intensive, with VTD-XML it becomes a lot > more efficient, but you > need to load > the document in memory first, so there first has to > be enough memory > available... but on the > other hand, using steaming API like SAX or PULL, you > will need to read in > the document > piecewise anyway, so overall I think VTD-XML should > win quite > significantly... > > My view of VTD-XML is that it is just like DOM, you > can jump back and forth > as often > as you want... yet it parses a lot faster than > DOM... > ----- Original Message ----- > From: "Tatu Saloranta" <cow...@ya...> > To: "Din Sush" <di...@ya...>; > <vtd...@li...> > Sent: Monday, July 31, 2006 11:53 AM > Subject: Re: [Vtd-xml-users] VTD-XML Query > > > > --- Din Sush <di...@ya...> wrote: > > > >> Here is my requirement > >> > >> I need to split really big XML files(1 GB plus) > into > >> smaller sized files. > >> I am in the process of evaluating different > >> approaches. > >> 1. Use Vtd-XML, parse and split. > >> 2. Use Perl XML::Twig split function > >> 3. Writing my own parser in perl on top of > >> XML::Parser, > >> which uses expat. > >> 4. Use libxml2. > > > > To me, this does sound like you would be better > off > > using a streaming approach (SAX, StAX or XmlPull; > or > > .net equivalent of the last 2; StAX and XmlPull > are > > Java things). I don't know if there are perl-basd > > streaming equivalents, but I think expat and > libxml2 > > have streaming SAX interfaces (or similar) > > > > There doesn't seem to be much need for random > access, > > nor need to keep any portions in memory. Streaming > > approaches have no problem with files of any size > > (certainly no problems with 1 GB), and for > splitting I > > personally do not think VTD-XML would be faster > than > > the alternatives. This because all the content has > to > > be accessed -- VTD-XML is fastest when you need to > > access as little data as possible. > > > > -+ Tatu +- > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > protection around > > http://mail.yahoo.com > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of > IT > > Join SourceForge.net's Techsay panel and you'll > get the chance to share > > your > > opinions on IT & business topics through brief > surveys -- and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Vtd-xml-users mailing list > > Vtd...@li... > > > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Jimmy Z. <cra...@co...> - 2006-08-01 06:41:34
|
Usually the out of memory happens when you parses the file into a DOM tree... Assuming that you have enough memory to hold the document in memory, VTD-XML should compare very favorably against SAX or Pull in terms of coding effort and performance... even you don't need to go back parsed document and don't care about DOM like functionalites ----- Original Message ----- From: "Din Sush" <di...@ya...> To: "Jimmy Zhang" <cra...@co...>; "Tatu Saloranta" <cow...@ya...>; <vtd...@li...> Sent: Monday, July 31, 2006 8:20 PM Subject: Re: [Vtd-xml-users] VTD-XML Query > Well I only need to split the document and don't need > to go back to parsed document, and I don't need DOM > like functionality. > > Will VTD-XML be still better in this scenario. > > Secondly as the entire document needs to be loaded in > the memory, the whole idea of splitting is that I am > getting "Out of Memory" error won't I get the same > error when I am using VTD-XML, than it kind of defeats > the purpose. Correct me if I am wrong in the > interpretation as I have never used VTD. > > > > --- Jimmy Zhang <cra...@co...> wrote: > >> Well, the problem with streaming approach is that >> you will need to parse >> then reserialize, >> both CPU intensive, with VTD-XML it becomes a lot >> more efficient, but you >> need to load >> the document in memory first, so there first has to >> be enough memory >> available... but on the >> other hand, using steaming API like SAX or PULL, you >> will need to read in >> the document >> piecewise anyway, so overall I think VTD-XML should >> win quite >> significantly... >> >> My view of VTD-XML is that it is just like DOM, you >> can jump back and forth >> as often >> as you want... yet it parses a lot faster than >> DOM... >> ----- Original Message ----- >> From: "Tatu Saloranta" <cow...@ya...> >> To: "Din Sush" <di...@ya...>; >> <vtd...@li...> >> Sent: Monday, July 31, 2006 11:53 AM >> Subject: Re: [Vtd-xml-users] VTD-XML Query >> >> >> > --- Din Sush <di...@ya...> wrote: >> > >> >> Here is my requirement >> >> >> >> I need to split really big XML files(1 GB plus) >> into >> >> smaller sized files. >> >> I am in the process of evaluating different >> >> approaches. >> >> 1. Use Vtd-XML, parse and split. >> >> 2. Use Perl XML::Twig split function >> >> 3. Writing my own parser in perl on top of >> >> XML::Parser, >> >> which uses expat. >> >> 4. Use libxml2. >> > >> > To me, this does sound like you would be better >> off >> > using a streaming approach (SAX, StAX or XmlPull; >> or >> > .net equivalent of the last 2; StAX and XmlPull >> are >> > Java things). I don't know if there are perl-basd >> > streaming equivalents, but I think expat and >> libxml2 >> > have streaming SAX interfaces (or similar) >> > >> > There doesn't seem to be much need for random >> access, >> > nor need to keep any portions in memory. Streaming >> > approaches have no problem with files of any size >> > (certainly no problems with 1 GB), and for >> splitting I >> > personally do not think VTD-XML would be faster >> than >> > the alternatives. This because all the content has >> to >> > be accessed -- VTD-XML is fastest when you need to >> > access as little data as possible. >> > >> > -+ Tatu +- >> > >> > >> > __________________________________________________ >> > Do You Yahoo!? >> > Tired of spam? Yahoo! Mail has the best spam >> protection around >> > http://mail.yahoo.com >> > >> > >> > ------------------------------------------------------------------------- >> > Take Surveys. Earn Cash. Influence the Future of >> IT >> > Join SourceForge.net's Techsay panel and you'll >> get the chance to share >> > your >> > opinions on IT & business topics through brief >> surveys -- and earn cash >> > >> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> > _______________________________________________ >> > Vtd-xml-users mailing list >> > Vtd...@li... >> > >> > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users >> > >> >> >> > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > |
From: Tatu S. <cow...@ya...> - 2006-08-01 18:34:42
|
--- Din Sush <di...@ya...> wrote: > Well I only need to split the document and don't > need > to go back to parsed document, and I don't need DOM > like functionality. > > Will VTD-XML be still better in this scenario. I would suggest that if you do have time, you investigate both using VTD-XML, and a Stax implementation (such as http://woodstox.codehaus.org). My feeling is that it all comes down to which one API you feel more comfortable with, or perhaps whether have to use a xml-compliant standard-based solution or not. Both can perform well enough, assuming you are not limited by VTD-XML due to main memory requirements. Stax memory usage is not linear with document length, so there are no practical input size limitations. If you do end up both approaches, it would be very nice to get the performance numbers, since this would be an actual real-world use case, instead of benchmarks. Plus if code is simple enough, perhaps it could become a benchmark for these types of operations? > Secondly as the entire document needs to be loaded > in > the memory, the whole idea of splitting is that I am > getting "Out of Memory" error won't I get the same > error when I am using VTD-XML, than it kind of > defeats > the purpose. Correct me if I am wrong in the > interpretation as I have never used VTD. You are correct here. While limit is much higher than with, say, DOM (2x or perhaps 3x), there is a limit. -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Din S. <di...@ya...> - 2006-08-02 05:21:28
|
Hi, I have explored woodstox also, issue with STAX parsers that I am having is I need to build complete xml, whereas in vtd or sax i can get the fragment and write it to a file. Example <persons> <person> <name>name1</name> <age>22</age> <address>address</address> </person> <person> . . </person> . . </persons> Now I want to put say 10 person records in file 1and so on, with vtd, i will get the fragment and I will write that to file, with sax also I can get entire person record, but with SAX, I don't get the complete person record. So I have to create a XMLWriter, use, functions like writeStartElement, writeAttribute, etc. Basically building the entire structure which is already there. Please let me know if there is a way to extract the complete person record. Thanks. --- Tatu Saloranta <cow...@ya...> wrote: > --- Din Sush <di...@ya...> wrote: > > > Well I only need to split the document and don't > > need > > to go back to parsed document, and I don't need > DOM > > like functionality. > > > > Will VTD-XML be still better in this scenario. > > I would suggest that if you do have time, you > investigate both using VTD-XML, and a Stax > implementation (such as > http://woodstox.codehaus.org). > My feeling is that it all comes down to which one > API > you feel more comfortable with, or perhaps whether > have to use a xml-compliant standard-based solution > or > not. > Both can perform well enough, assuming you are not > limited by VTD-XML due to main memory requirements. > Stax memory usage is not linear with document > length, > so there are no practical input size limitations. > > If you do end up both approaches, it would be very > nice to get the performance numbers, since this > would > be an actual real-world use case, instead of > benchmarks. Plus if code is simple enough, perhaps > it > could become a benchmark for these types of > operations? > > > Secondly as the entire document needs to be loaded > > in > > the memory, the whole idea of splitting is that I > am > > getting "Out of Memory" error won't I get the same > > error when I am using VTD-XML, than it kind of > > defeats > > the purpose. Correct me if I am wrong in the > > interpretation as I have never used VTD. > > You are correct here. While limit is much higher > than > with, say, DOM (2x or perhaps 3x), there is a limit. > > -+ Tatu +- > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Tatu S. <cow...@ya...> - 2006-08-02 22:05:05
|
--- Din Sush <di...@ya...> wrote: > Hi, > I have explored woodstox also, issue with STAX > parsers that I am having is I need to build complete > xml, > whereas in vtd or sax i can get the fragment and > write it to a file. I'm pretty sure you are confusing SAX and DOM here... since SAX only gives you individual nodes, similar to what StAX does, not subtrees (you can of course reconstruct sub-trees from events, but that's not the same thing). So, I assume you mean 'VTD or DOM'. But I'm not quite sure what would be complicated in building XML using Event API of Stax (it is bit more complicated if using raw cursor API, ie. XMLStreamReader and XMLStreamWriter -- maybe you have only used it so far?). With Event API It's just events in and events out. Bit of recursion for copying, and that's pretty much it, for simple merging. ... > Now I want to put say 10 person records in file 1and > so on, with vtd, i will get the fragment and I will > write that to file, with sax also I can get entire If bitwise exact copy does work, yes. This is not necessarily the case if namespaces are used (or if DTD-based entities are used). > person record, but with SAX, I don't get the > complete > person record. So I have to create a XMLWriter, use, > functions like writeStartElement, writeAttribute, > etc. > Basically building the entire structure which is > already there. Yes. That's streaming. With XMLEventWriter you just add XMLEvents you get from XMLEventReader, but you do need to pipe them through, looping. Bit more work, but not a lot (just need to keep track of pairing start/end tags). -+ Tatu +- __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Din S. <di...@ya...> - 2006-08-03 11:36:59
|
I tried woodstox parser, it seems to be working and for a 1 GB file it is taking around 11 mins to split the file in multiple 1 MB files. Thanks for your suggestion!! I was just wondering if I can make it any faster, I am using "copyEventFromEventMethod" to write to the file. Thanks again. --- Tatu Saloranta <cow...@ya...> wrote: > --- Din Sush <di...@ya...> wrote: > > > Well I only need to split the document and don't > > need > > to go back to parsed document, and I don't need > DOM > > like functionality. > > > > Will VTD-XML be still better in this scenario. > > I would suggest that if you do have time, you > investigate both using VTD-XML, and a Stax > implementation (such as > http://woodstox.codehaus.org). > My feeling is that it all comes down to which one > API > you feel more comfortable with, or perhaps whether > have to use a xml-compliant standard-based solution > or > not. > Both can perform well enough, assuming you are not > limited by VTD-XML due to main memory requirements. > Stax memory usage is not linear with document > length, > so there are no practical input size limitations. > > If you do end up both approaches, it would be very > nice to get the performance numbers, since this > would > be an actual real-world use case, instead of > benchmarks. Plus if code is simple enough, perhaps > it > could become a benchmark for these types of > operations? > > > Secondly as the entire document needs to be loaded > > in > > the memory, the whole idea of splitting is that I > am > > getting "Out of Memory" error won't I get the same > > error when I am using VTD-XML, than it kind of > > defeats > > the purpose. Correct me if I am wrong in the > > interpretation as I have never used VTD. > > You are correct here. While limit is much higher > than > with, say, DOM (2x or perhaps 3x), there is a limit. > > -+ Tatu +- > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Tatu S. <cow...@ya...> - 2006-08-03 18:19:06
|
--- Din Sush <di...@ya...> wrote: > I tried woodstox parser, it seems to be working and > for a 1 GB file it is taking around 11 mins to split > the file in multiple 1 MB files. Hmmh. That sounds bit slow, for typical disks and all (with maybe 30 MBps read speed, and bit higher write speed). I'd expect it to take roughly maybe a minute or so. Can you share the code? I would profile it (just using 'java -Xrunhprof:cpu=samples', running the code for a minute or so). > Thanks for your suggestion!! I was just wondering if > I > can make it any faster, I am using > "copyEventFromEventMethod" to write to the file. I guess it all depends on code in question (and possibly file in question might affect speed a bit, shouldn't matter very much though). Can you send the code? I could test it against test files I have created. -+ Tatu +- > > Thanks again. > > --- Tatu Saloranta <cow...@ya...> wrote: > > > --- Din Sush <di...@ya...> wrote: > > > > > Well I only need to split the document and don't > > > need > > > to go back to parsed document, and I don't need > > DOM > > > like functionality. > > > > > > Will VTD-XML be still better in this scenario. > > > > I would suggest that if you do have time, you > > investigate both using VTD-XML, and a Stax > > implementation (such as > > http://woodstox.codehaus.org). > > My feeling is that it all comes down to which one > > API > > you feel more comfortable with, or perhaps whether > > have to use a xml-compliant standard-based > solution > > or > > not. > > Both can perform well enough, assuming you are not > > limited by VTD-XML due to main memory > requirements. > > Stax memory usage is not linear with document > > length, > > so there are no practical input size limitations. > > > > If you do end up both approaches, it would be very > > nice to get the performance numbers, since this > > would > > be an actual real-world use case, instead of > > benchmarks. Plus if code is simple enough, perhaps > > it > > could become a benchmark for these types of > > operations? > > > > > Secondly as the entire document needs to be > loaded > > > in > > > the memory, the whole idea of splitting is that > I > > am > > > getting "Out of Memory" error won't I get the > same > > > error when I am using VTD-XML, than it kind of > > > defeats > > > the purpose. Correct me if I am wrong in the > > > interpretation as I have never used VTD. > > > > You are correct here. While limit is much higher > > than > > with, say, DOM (2x or perhaps 3x), there is a > limit. > > > > -+ Tatu +- > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > http://mail.yahoo.com > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get > the chance to share your > opinions on IT & business topics through brief > surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Jimmy Z. <cra...@co...> - 2006-08-04 01:16:33
|
Is there any data on the performance of splitting files using VTD-XML? It would certainly be interesting to know about.... ----- Original Message ----- From: "Din Sush" <di...@ya...> To: "Tatu Saloranta" <cow...@ya...>; <vtd...@li...> Sent: Thursday, August 03, 2006 4:36 AM Subject: Re: [Vtd-xml-users] VTD-XML Query >I tried woodstox parser, it seems to be working and > for a 1 GB file it is taking around 11 mins to split > the file in multiple 1 MB files. > > Thanks for your suggestion!! I was just wondering if I > can make it any faster, I am using > "copyEventFromEventMethod" to write to the file. > > Thanks again. > > --- Tatu Saloranta <cow...@ya...> wrote: > >> --- Din Sush <di...@ya...> wrote: >> >> > Well I only need to split the document and don't >> > need >> > to go back to parsed document, and I don't need >> DOM >> > like functionality. >> > >> > Will VTD-XML be still better in this scenario. >> >> I would suggest that if you do have time, you >> investigate both using VTD-XML, and a Stax >> implementation (such as >> http://woodstox.codehaus.org). >> My feeling is that it all comes down to which one >> API >> you feel more comfortable with, or perhaps whether >> have to use a xml-compliant standard-based solution >> or >> not. >> Both can perform well enough, assuming you are not >> limited by VTD-XML due to main memory requirements. >> Stax memory usage is not linear with document >> length, >> so there are no practical input size limitations. >> >> If you do end up both approaches, it would be very >> nice to get the performance numbers, since this >> would >> be an actual real-world use case, instead of >> benchmarks. Plus if code is simple enough, perhaps >> it >> could become a benchmark for these types of >> operations? >> >> > Secondly as the entire document needs to be loaded >> > in >> > the memory, the whole idea of splitting is that I >> am >> > getting "Out of Memory" error won't I get the same >> > error when I am using VTD-XML, than it kind of >> > defeats >> > the purpose. Correct me if I am wrong in the >> > interpretation as I have never used VTD. >> >> You are correct here. While limit is much higher >> than >> with, say, DOM (2x or perhaps 3x), there is a limit. >> >> -+ Tatu +- >> >> >> __________________________________________________ >> Do You Yahoo!? >> Tired of spam? Yahoo! Mail has the best spam >> protection around >> http://mail.yahoo.com >> > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Vtd-xml-users mailing list > Vtd...@li... > https://lists.sourceforge.net/lists/listinfo/vtd-xml-users > |