Babeldoc: Universal Document Processor / Discussion / Help: Loading database from flat files

Mitch Christensen - 2003-12-19

Hi,

I need a pipeline that will scan a set of sub-directories, read a tab-delimited text file (*.ODI)and update a database based on the text file contents.

This seems right up BabelDoc's alley. I am new to BableDoc and was wondering if I could get some pointers. I was thinking of the following pipeline,

DirectoryScanner->FlatToXml->XpathSplitter->XslTransform->SqlWriter

Am I even in the ballpark?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sherman Wood - 2003-12-19
  
  I have not played with the FlatToXml stage, but the general pattern you outline is fine.
  
  Sherman
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - maxdaros - 2004-11-02
    
    Hi all, pheraps my question seems trivial (sorry for this)
    
    I have a pipeline like FlatToXml --> XslkTransform --> SqlWriter
    
    here is the pipeline configuration file:
    entryStage                 = convert
    
    convert.stageType       = FlatToXml
    convert.nextStage       = transform
    convert.flatToXmlFile    = convertions/file-convert.XML
    convert.tracked           = true
    
    transform.stageType     = XslTransform
    transform.nextStage       = dbwriter
    transform.transformationFile = dbloader/convertions/transform.xsl
    transform.tracked          = true
    
    dbwriter.stageType        = SqlWriter
    dbwriter.nextStage         = null
    dbwriter.messageTag      = messageTagErrorSql
    dbwriter.resourceName    = jdbcConnection
    dbwriter.sql                   = ${document}
    dbwriter.failOnFirst         = false
    dbwriter.useBatch          = true
    dbwriter.tracked            = true
    
    after the transform phase the output file is:
    <?xml version="1.0" encoding="UTF-8"?>
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Marco","Descrizione di Marco");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Mauro","Descrizione di Mauro");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Giovanni","Descrizione di Giovanni");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Francesco","Descrizione di Francesco");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Aldo","Descrizione di Aldo");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Luigi","Descrizione di Luigi");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Anna","Descrizione di Anna");
    
        INSERT INTO tb_Max ( Text, Description ) VALUES ("Eleonora","Descrizione di Eleonora");
    
    Unfortunately I have a problem submitting this document to the SqlWriter stage. Can anyone help me? How can I submit a sql query file obtained by a previous step?
    
    Thanks in advance ;)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mitch Christensen - 2004-11-02
      
      We do this very thing (successfully).
      
      First off, get rid of the XML preamble (i.e. <?xml...?>) by setting the document output type to 'text' in your XSLT script. This is done by adding the following:
      
      <xsl:output method="text"/>
      
      to the top of your XSLT script.
      
      Once you have pure SQL in your output document the SqlWriter should handle it just fine.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Dejan Krsmanovic - 2003-12-20
  
  I have used Babeldoc for similiar tasks (altough I had XML documents instead of 'flat' files). Besides other tasks, I have used XslTransform stage to transform XML file to sql script that was executed inside SqlWriter stage. So, maybe you won't need XpathSplitter at all.
  
  On the other side I needed many other pipeline stages for performing other tasks (validations etc). Anyway, this was one year ago when Babeldoc was in 0.5 version and when many other features and useful pipeline stages haven't existed so I believe now it wouldn't be that complex...
  
  Dejan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mitch Christensen - 2003-12-26
  
  Excellent solution.
  
  I now have a pipeline that reads a CSV flat-file, converts to XML using FlatToXml, then to an SQL script (multiple insert statements) using XslTransform. Now I want to execute that SQL script using SqlWriter. However, the 'sql' config parameter to SqlWriter takes an SQL script rather than processing the pipeline document contents.
  
  I tried doing the following:
  
  dbload.sql=$document
  
  however I always get the following exception:
  
  "Error executing script: Non supported SQL92 token at position: 71: mime-type"
  
  Is there a trick to getting SqlWriter to process the document contents as a multi-statment SQL script?
  
  TIA,
  -Mitch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Dejan Krsmanovic - 2003-12-26
  
  Try using braces. That is:
  dbload.sql=${document}
  or
  dbload.sql=${document.toString()}
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Dejan Krsmanovic - 2003-12-26
    
    I would also suggest you to use FileWriter stage after XslTrasnform, so you can check if your document is valid.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mitch Christensen - 2003-12-26
      
      I sincerely appreciate the response.
      
      Unfortunately, I've done all that. My SQL statement is simply "SELECT SYSDATE FROM DUAL", which works fine if I paste it directly in the loaddb.sql parameter in the pipeline.config file.
      
      When I step into SqlWriterPipelineStage.jave line 177, the sqlScript value is as follows (via debugger inspection):
      
      sqlScript= "com.babeldoc.core.pipeline.PipelineDocument@1938039[attributes={binary=false, file_name=\\dev\\rapid\\babeldoc\\io\\done\\results.txt},buffer={83,69,76,69,67,84,32,115,121,115,100,97,116,101,32,70,82,79,77,32,68,85,65,76}]"
      
      it is as if the toString() method is never getting called. The data in the 'buffer=' array is my SQL query. This is beginning to feel like a bug.
      
      Any other suggestions?
      
      Thanks again for your help.
      -Mitch
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mitch Christensen - 2003-12-26
  
  Actually, looking at the code, $document.toString() results in the creation of a jakarta commons ToStringBuilder() which is documented to do exactly what I'm seeing (i.e. string-ify complex objects). However, this is not the same as accessing the document contents as a string.
  
  It appears that there is no way (currently) to access the PipelineDocument contents as a string.
  
  Should I open a bug?
  
  -Mitch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Dejan Krsmanovic - 2003-12-27
    
    Yes, you are right. This is a bug. If you take a look at CVS repository you could notice that toString() method was changed in latest revision. Before that, toString would return document content.
    As I currently cannot see how the document content could be accessed I guess this should be reverted to the earlier revision, or new method should be introduced.
    So please submit this issues as new bug. I would like to discuss about this with Bruce, since he had changed toString() method. This is very serious issue since few examples from documentation uses this method too.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mitch Christensen - 2003-12-27
      
      I did open bug #866167 last night. I forgot to update this thread with this fact.
      
      FWIW, since PipelineDocument is a complex object, I'm not sure that using toString() to represent the contents is such a good idea anyways. It may be better to go with the getContents() method I proposed in the bug report.
      
      Of course this doesn't address the broken examples (or outstanding code others may have written that relies on toString()).
      
      -Mitch
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bruce mcdonald - 2003-12-27
  
  You are right - the toString on the PipelineDocument is the currently correct. What is needed is a way to correcly convert the array of bytes that to a string object.
  
  You can get the pipelinedocument payload as an array of bytes - can this be used directly?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mitch Christensen - 2003-12-27
    
    My description of bug #866167 describes the solution that I have implemented locally.
    
    This involves adding a getContents() method to PipelineDocument so I can do the following:
    
    dbload.sql=$document.contents
    
    This makes sense to me, and works fine.
    
    I did try to use the byte array to no avail. It needs to be a String, and I couldn't figure a way to do the conversion within a velocity reference.
    
    -Mitch
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Santiago Begué - 2004-03-04
  
  Scanning from a directory and sending whith a smtpwriter I've founded the same problem for write the document contents in the smtpMessage.
  I've implemented your solution as suggested in the bug track and works ok.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- maxdaros - 2004-11-02
  
  Many thanks, mitchellch
  
  I'm trying with your suggestion.
  Now the output file after the transform phase is pure sql, so, I'm a little bit near the solution.
  
  But the SqlWriter stage throws me an Exception:
  
  the command line with it's output is:
  babeldoc process -p dbloader -f data/inputData.txt
  <2004-11-02 17:58:33,900> INFO [main] : PipelineStage name: convert
  <2004-11-02 17:58:34,181> INFO [main] : PipelineStage name: fWriter
  <2004-11-02 17:58:34,197> INFO [main] : PipelineStage name: transform
  <2004-11-02 17:58:34,384> INFO [main] : PipelineStage name: fWriter2
  <2004-11-02 17:58:34,400> INFO [main] : PipelineStage name: dbwriter
  <2004-11-02 17:58:34,634> WARN [main] : Error executing script: Syntax error or access violation m
  essage from server: "You have an error in your SQL syntax. Check the manual that corresponds to you
  r MySQL server version for the right syntax to use near 'com.babeldoc.core.pipeline.PipelineDocument
  @e06940[attributes=,"
  <2004-11-02 17:58:34,634> ERROR [main] : [DefaultPipelineStageErrorHandler.handlePipelineStageError
  ] PipelineStage name: dbwriter Error: com.babeldoc.core.pipeline.PipelineException: Error executing
  sql: .... (continues) ...
  
  I'm not sure that I can forward the output document of the transorm phase to SqlWriter stage by the attribute dbwriter.sql=$[document}
  
  Is there any example (apart examples in the babeldoc user guide) that I can examinate?
  
  Thank you very much.
  
  Massimo
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - johnMac - 2004-11-02
    
    Try dbwriter.sql=$document.contents
    ...johnM
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- maxdaros - 2004-11-02
  
  mmmhh,
  
  $document.contents doesn't work.
  
  and dbwriter.sql={$document.contents}
  
  causes the following Exception:
  
  <2004-11-02 21:17:23,091> WARN [main] : Error executing script: Query can not be null or empty
  
  <2004-11-02 21:17:23,111> ERROR [main] : [DefaultPipelineStageErrorHandler.handlePipelineStageError] PipelineStage name
  
  dbwriter Error: com.babeldoc.core.pipeline.PipelineException: Error executing sql:
  <2004-11-02 21:17:23,111> ERROR [main] : [PipelineFeeder.process]
  com.babeldoc.core.pipeline.PipelineException: Error executing sql:
          at com.babeldoc.sql.pipeline.stage.SqlWriterPipelineStage.process(Unknown Source)
          at com.babeldoc.core.pipeline.PipelineStage.processStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
          at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
          at com.babeldoc.core.pipeline.PipelineStageFactory.process(Unknown Source)
          at com.babeldoc.core.pipeline.PipelineFactory.process(Unknown Source)
          at com.babeldoc.core.pipeline.PipelineFactoryFactory.process(Unknown Source)
          at com.babeldoc.core.pipeline.feeder.SynchronousFeeder.process(Unknown Source)
          at com.babeldoc.core.pipeline.feeder.FeederFactory.process(Unknown Source)
          at com.babeldoc.core.pipeline.command.PipelineFeeder.process(Unknown Source)
          at com.babeldoc.core.pipeline.command.PipelineFeeder.execute(Unknown Source)
          at com.babeldoc.core.BabeldocCommand.executeCommand(Unknown Source)
          at com.babeldoc.core.BabeldocCommand.<init>(Unknown Source)
          at com.babeldoc.core.pipeline.command.PipelineFeeder.<init>(Unknown Source)
          at com.babeldoc.core.pipeline.command.PipelineFeeder.main(Unknown Source)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:585)
          at com.babeldoc.core.Main.handleCommand(Unknown Source)
          at com.babeldoc.core.Main.main(Unknown Source)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:585)
          at com.babeldoc.init.Main.runBabeldocMain(Unknown Source)
          at com.babeldoc.init.Main.main(Unknown Source)
  Caused by: com.babeldoc.core.pipeline.PipelineException: Query can not be null or empty
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mitch Christensen - 2004-11-02
    
    Hey,
    
    I think you might be dealing with an old issue. I fixed this problem locally, and posted the solution to the bug tracker (see https://sourceforge.net/tracker/?group_id=56976&atid=482496&func=detail&aid=866167\). The fix requires building a new Babeldoc (very easy actually).
    
    I assume that becuase the bug still has a status of 'open' that the fix had not yet been incorporated back into the codebase.
    
    The problem is (or may be) that there was a bug in Babeldoc that results in $document (which really results in a call to document.toString()) returning an array of bytes, which isn't what your SqlWriter stage expects. My proposed fix simply fixes document.getString() to return the truly stringified document contents.
    
    In a nutshell, if all other things are correct, implement the fix described in bug #866167 and you should be good to go.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- maxdaros - 2004-11-02
  
  Fine!!!
  
  I've modified com.babeldoc.core.pipeline.PipelineDocument as you said in the bug tracker, and now the pipeline works fine.
  
  Thank you very much ;)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mitch Christensen - 2004-11-02
    
    Glad to hear it.
    
    It would be nice if I could appeal to someone with committer priveliges to apply this fix to the codebase.
    
    Is there anyone still supporting/working on Babeldoc?
    
    -Mitch
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Loading database from flat files

Forums

Help

Loading database from flat files document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Loading database from flat files