parse huge xml

2010-04-19
2013-05-15
  • Marko Debac

    Marko Debac - 2010-04-19

    Hi,
    can I use this code
    http://snippets.dzone.com/posts/show/8999

    to slice xml of 3 GB with gazzilion elements of aph.selectXPath("//PostCard")  into smaller xml's of 100MB of PostCard elements fragments.

    Can somebody help me.

     
  • jimmy zhang

    jimmy zhang - 2010-04-20

    Yes, you just have to use VTD-XML's extended version … the ability to cut and split is identical to that of the standard version…
    does my answer make sense?

     
  • Eric J Schwarzenbach

    You could also do it in a streaming fashion, with the SAX API, implementing an org.xml.sax.XMLFilter, and not use much memory at all.

     
  • Marko Debac

    Marko Debac - 2010-04-20

    tricky part is this

      File f = new File("huge3GB.xml");
    FileInputStream fis =  new FileInputStream(f);
            byte b = new byte;  //this will not pass for 3 GB
        fis.read(b);
    //get fragment
    long l= vnh.getElementFragment();
    int offset = (int)l;
    int len = (int)(l>>64);

    //and input it into new file

    fos.write(b, offset, len);
    fos.write('\n');

    So, I didnt solve anything with extebded classess (getFragment)

     
  • jimmy zhang

    jimmy zhang - 2010-04-20

    You need a 64bit jvm to process a 3GB file… for 32JVM, the max doc size is 2GB..

     
  • Marko Debac

    Marko Debac - 2010-04-20

    you telling me that I would succefuly initialize byte array with 3 gb file if I have 64 bit java?

    from what version of java starts to be 64 bit?

     
  • jimmy zhang

    jimmy zhang - 2010-04-20

    No, I think you should instead use memory mapping mode of reading XML document instead … does it make sense?

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    Doing it in SAX wouldn't cost much memory, but it would be tedious to write and slow because of the encoding/decoding/object allocation involved…

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    64 bit JVM has been around for a while, not that you can allocate a single buffer of 3GB in size, but you can allocate multiple byte buffers with combined capacity of more than 2GB, which is needed for your use case

     
  • Marko Debac

    Marko Debac - 2010-04-21

    >>you should instead use memory mapping mode of reading XML document instead

    please can you give me an example, or direct me on examples how to do it

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    Below is an example as part of v2.8 distribution

    /**
    * This is a demonstration of how to use the extended VTD parser
    * to process large XML file.
    */
    import com.ximpleware.extended.*;
    public class mem_mapped_read {
    /* first read is the longer version of loading the XML file */
    public static void first_read() throws Exception{
    XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
    VTDGenHuge vg = new VTDGenHuge();
    xb.readFile("test.xml");
    vg.setDoc(xb);
    vg.parse(true);
    VTDNavHuge vn = vg.getNav();
    System.out.println("text data ===>" + vn.toString(vn.getText()));
    }

    /* second read is the shorter version of loading the XML file */
    public static void second_read() throws Exception{
    VTDGenHuge vg = new VTDGenHuge();
    if (vg.parseFile("test.xml",true,VTDGenHuge.MEM_MAPPED)){
    VTDNavHuge vn = vg.getNav();
    System.out.println("text data ===>" + vn.toString(vn.getText()));
    }
    }

    public static void main(String s) throws Exception{
    first_read();
    second_read();
      }
    }

     
  • Marko Debac

    Marko Debac - 2010-04-21

    ok, but in the end I need to use

    fos.write(bytes, offset, len);

    for write to File, and I still dont have a bytes, and in doc for XMLMemMappedBuffer it has been writen

    byte getBytes(int offset, int len)
              not implemented yet

     
  • Eric J Schwarzenbach

    > Doing it in SAX wouldn't cost much memory, but it would be tedious to write and slow because of the
    > encoding/decoding/object allocation involved…

    Maybe, but I'd have to see numbers to be convinced the difference was that significant. SAX is generally pretty fast. However it's quite clear that loading up a multii gig file into memory just to chop it up, is a ludicrous expenditure of resources. Avoiding that kind of waste is what streaming is for. Besides SAX, there are streaming pull parsers, and if the XML is simple and predictable enough chopping up such a file with "manual" text manipulation may even be worth considering. Nothing against VTD, but this just seems like a terrible usage of a whole-document-in-memory parser, especially since random access is not at all needed.

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    I think that you can call VTDNavHuge's getXML and call writeToFileOutputStream()…

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    Regarding comment on SAX parsing comparison, the fact is that, to identify a fragment/segment in XML, all you need is offset and length. Once you have both values, you can just do straight byte copy from source to destination, this is what VTD-XML brings… if the only concern is that vtd-xml read whole thing in memory (which may seem as wasteful), with extended vtd-xml you no longer have to do that… extended vtd-xml does memory mapping, which allows partial file loading… so you no longer have to read whole thing in memory. There are other issues with SAX based splitting, can you use XPath, can you skip the elements easily? with vtd-xml, you just don't have to worry about those issues

     
  • Marko Debac

    Marko Debac - 2010-04-21

    Have you ever try to use
    getXML
    public final com.ximpleware.extended.IByteBuffer getXML()Get the XML document

    Returns:
    IByteBuffer

    it returns null on same snipet where I with long l= vnh.getElementFragment(); get perfect offset and length, so getXML dont work.

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    Yes ,we tried it before, if for any reason it doesn't work, a fix will be available asap!

     
  • jimmy zhang

    jimmy zhang - 2010-04-21

    It seems to work for me…I used the code below

    VTDGenHuge vgh = new VTDGenHuge();
    if (vgh.parseFile("c:/xml/text1.xml",true,VTDGenHuge.MEM_MAPPED)){
    VTDNavHuge vnh = vgh.getNav();
    vnh.toElement(VTDNavHuge.FC);
    long la = vnh.getElementFragment();
    vnh.getXML().writeToFileOutputStream(new FileOutputStream("c:/xml/text2.xml"), la, la);

    }

     

Log in to post a comment.