Babeldoc: Universal Document Processor / Discussion / Help: Creating a new Pipeline stage

John MacLatchie - 2004-02-04

I am attempting to parse the content of several extremely large .xml files( > 90M) and add the data to two RDBMS tables. To avoid the out of memory condition produced by using the XslTransform/SQLWriter stages I have written a new Pipeline stage that extends the PipelineStage object and implements a process() method which uses Digester to parse the xml and seed hibernator objects whose .save() method can be called to insert rows controlled at a rate 100 or so per commit. Following the dev guide I've placed all code and build instructions in a path structure under the modules folder in the BabelDoc project and the 'build setup build' proceeds to successful completion. However when I attempt to use the pipeline stage I get the message "invalid stage type: SaxLoader" where SaxLoader is the name I've specified in the MODULE name/value pair in my build.properties and the one returned in the new pipeline stage class constructor.

Can someone please tell me how exactly I configure the stageType name? I realize I've missed some small configuration piece but I don't seem to be able to find out what it is. I've tested the digester/hibernator components in the open and they work fine, I just need to get the new pipeline stage driver configured correctly. Any advice would be appreciated. Thanks
...John

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Dejan Krsmanovic - 2004-02-05
  
  I am not sure what actually you are trying to do. Are you creating a new module or just adding new pipeline stage into existing module?
  In each case, you should define your pipeline stage type in config/service/query.properties file in corresponding module.
  Dejan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- John MacLatchie - 2004-02-05
  
  Thank you, that was the piece I was missing. I'm creating a new stage to support parsing rather large mainframe generated .xml files in to RDMS rows. I tried using a XslTransform and SQLWriter combination but received and out of memory condition on the JVM at about 17meg of file size. I increased the JVM stack but the best I could parse was around 30 meg. Some of the host files run upwards of 90meg in size. I've created a digester based stage which creates Hibernator based objects. These then can persist themselves to any database supported by Hibernator and because it's digester I can externalize the parsing rules and the number of inserts between commits thus controlling how much memory is used during the stage.
  
  Thanks again for the help, it's working now.
  ...John
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sherman Wood - 2004-02-05
    
    When you say you are using "Hibernator", do you mean the Hibernate Object/Relational mapping framework? I have been doing something similar - cross referencing XML documents - using Hibernate and Babeldoc.
    
    Did the XPathSplitter stage not help you? ie. split the large XML file into smaller chunks, and process each chunk.
    
    I would be interested in addressing the large document problem you seem to have solved. Is this something you want to or can share with the community?
    
    Cheers,
    
    Sherman
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- John MacLatchie - 2004-02-05
  
  Yes that's what I meant by hibernator.
  
  To be honest I didn't think about using a XPathSplitter stage structure. Being a lazy type I took an all or nothing approach making recovery simply, an initial backup tables step and then either successful load or restore original table set and fix failure in the morning. Not a luxury everyone has. The only hook turned out to be regulating the number of Hibernate objects created before persisting them, much more than 150 before a commit produced the same out of memory condition as a dom structure.
  
  I'll need to check to see if it's something we can share although presently it's very non-generic and certainly not ready for prime time. At this point it's just a feasiblity prototype.
  ...John
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Creating a new Pipeline stage

Forums

Help

Creating a new Pipeline stage document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Creating a new Pipeline stage