Strange parsing

Help
jet
2009-02-11
2013-05-28
  • jet
    jet
    2009-02-11

    Hello!
    Thanks for this library!
    I'm using jpod for text extraction in lucene indexing.
    Because of I'm new in pdf and jpod I have one problem with parsing: jpod parses some documents in one letter per row.
    I tried to solve it, but failed. I've found no comments about this. I was looking for page orientation, its dimensions etc.
    My code:
    <code>
    package de.intarsys.pdf;

    import java.io.File;
    import java.io.IOException;

    import de.intarsys.pdf.content.CSDeviceBasedInterpreter;
    import de.intarsys.pdf.content.text.CSTextExtractor;
    import de.intarsys.pdf.parser.COSLoadException;
    import de.intarsys.pdf.pd.PDDocument;
    import de.intarsys.pdf.pd.PDPage;
    import de.intarsys.pdf.pd.PDPageTree;
    import de.intarsys.pdf.pd.PDResources;
    import de.intarsys.tools.locator.FileLocator;

    public class Main2 {

      protected static PDDocument document;

        /**
         * @param args
         */
        public static void main(String[] args) {
            try {
                File file = new File("/home/jet/Desktop/pdf/6310i_usersguide_en.pdf");
                FileLocator locator = new FileLocator(file);
                locator.setReadOnly();
                document =  PDDocument.createFromLocator(locator);
             
                PDPageTree pageTree = document.getPageTree();
                PDPage page = pageTree.getPageAt(0);
              
                CSTextExtractor extractor = new CSTextExtractor();
                PDResources pdr = page.getResources();
                CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
                interpreter.process(page.getContentStream(), pdr);

                String contents = extractor.getContent();
                System.out.println(contents);
            } catch (IOException e) {
                e.printStackTrace();
            } catch (COSLoadException e) {
                e.printStackTrace();
            } finally {
                try {
                    document.close();
                    document = null;
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
    </code>

    Here is an example of pdf document <a href="http://nds1.nokia.com/phones/files/guides/6310i_usersguide_en.pdf" target="_blank">6310i_usersguide_en.pdf</a>

    Result of parsing:
    <code>
    E
    l e
    c
    t
    r
    o
    n i
    c

    u
    s
    e
    r

    s

    g u
    i
    d e

    r
    e
    l e a
    s
    e
    d

    s
    u
    b
    j e
    c
    t

    t
    o

    "
    N
    o
    k
    i a

    U
    s
    e
    r

    s

    G
    u
    i d
    e
    s

    T
    e
    r
    m s

    a
    n

    C
    o
    n d
    i
    t
    i
    o
    n s ,

    7
    t
    h

    J
    u
    n e ,

    1
    9 9
    8
    "
    U
    s
    e
    r
    ´
    s

    G
    u
    i
    d
    e 9
    3
    5
    4
    2
    6
    0
    I
    s
    s
    u
    e

    3
    </code>
    Questinon: what should I do to get "normal" text?
    Thanks a lot!

     
    • jet
      jet
      2009-02-11

      CABAReT PDF Viewer uses JPoD and the document looks fine. So the solution is out here :)

       
    • Elfi Heck
      Elfi Heck
      2009-02-13

      But the CABAReT Stage text extract will put every character on its separate line too ;-)

      I found that this is because the document's pages are rotated 90 degrees and the text is rotated back so it appears in a horizontal line.
      CSTextExtractor tries to guess if a character is on a new line by it's vertical distance to the previous character. The thresholds are computed from the current font size and the current scaling (and two rather arbitrary distance values). Only that because the text is rotated the values to use for the computation would be not "scale" but "shear".
      The methods to change are CSTextExtractor.textSetFont() and CSTextExtractor.textSetTransform(). You can get a reasonable value to handle both "normal" and rotated text by using getDeterminant() instead of getScaleX()/getScaleY(). Of course you could also compute the actual scaling factor from getScaleX/Y and getShearX/Y (we don't do it for performance reasons).
      Why these two particular methods have getScale in them I don't know. Probably an oversight. We will fix this.

       
    • jet
      jet
      2009-02-21

      I've made a decision, which is based on yours. And I hope that could help somebody else.
      You told, that I should to play with fonts to get a reasonable value. That could help and I tried, but failed again: I didn't tried to create own font, used just standard and got no result.

      I decided, that font affects parsing because text blocks depend on it. So I took a look to CSTextExtractor.textSetFont() and CSTextExtractor.textSetTransform() methods.
      There I commented all the calculations of offsets: there were 2 expressions in each method. I need just text and font doesn't important for me.
      Now it seems, that parsing works fine.
      Thanks a lot for this idea.

       
      • Elfi Heck
        Elfi Heck
        2009-02-24

        OK, if that works for you. It probably does because your document contains the spaces between words in the content stream (so they are sort of "drawn").
        Other documents just render the next glyph at an offset. That's what the distance guessing is for. If the distance between one glyph and the next exceeds a certain value a space is assumed. And the font size is used to compute that value. So that's why we need the font in the two methods.

         
    • jet
      jet
      2009-03-02

      It was a quick decision, so it's not best:)
      Me and my friend worked on it to find a better solution.
      Finally we've found new one - using static transform matrix.
      So, now CSPlainTextExtractor is used instead of CSTextExtractor
      <code>
      public class CSPlainTextExtractor extends CSTextExtractor {

          @Override
          public void textSetTransform(float a, float b, float c, float d, float e,
                  float f) {
              super.textSetTransform(1, 0, 0, 1, 0, 0);
          }

      }
      </code>
      The result of extraction is fine. Also this realization works a bit faster (of course, no calculation:) )

       
    • javatechman
      javatechman
      2009-08-11

      Thank you jet, I had the same problem n it worked for me.

       
    • javatechman
      javatechman
      2009-08-11

      Well  though it resolved my problem with one pdf, but it started giving probs with other one which otherwise was working fine. (1.6_Acrobat_7.x_Powered By Crystal.pdf). So any idea?

       
      • mtraut
        mtraut
        2009-08-11

        while i do not have access to a document "1.6_Acrobat_7.x_Powered By Crystal.pdf"  :-) i assume that the problem is still the same. The simple text extractor (that more or less demonstrates the possibilities you have) does not take into account all transformation related scenarios.

        MAYBE the most recent (not yet released) version has improvements, but to know i need your document....

         
        • javatechman
          javatechman
          2009-08-12

          First thanks for this lib. By name I was impressing on how the document has been created -
          PDFVersion :1.6(Acrobat 7.x) , PDF Producer:Powered by Crystal.
          Anyway, I can provide you both the documents but I dont know how to pass them to you ? So if you can help?
          Thanks

           
          • mtraut
            mtraut
            2009-08-12

            You can upload files using the Tracker - Feature Request.

            I can have a look, but i can't help if its transformation related - there is simply no time in the moment. But with the code you should be able to adapt the transformation and extraction logic to whatever you need.

             
    • javatechman
      javatechman
      2009-08-13

      I uploaded the document. I am not familiar with PDF specs so hope you will give me some trick to identify it so that I can apply different parsing..