Strange parsing

Help
jet
2009-02-11
2013-05-28
  • jet

    jet - 2009-02-11

    Hello!
    Thanks for this library!
    I'm using jpod for text extraction in lucene indexing.
    Because of I'm new in pdf and jpod I have one problem with parsing: jpod parses some documents in one letter per row.
    I tried to solve it, but failed. I've found no comments about this. I was looking for page orientation, its dimensions etc.
    My code:
    <code>
    package de.intarsys.pdf;

    import java.io.File;
    import java.io.IOException;

    import de.intarsys.pdf.content.CSDeviceBasedInterpreter;
    import de.intarsys.pdf.content.text.CSTextExtractor;
    import de.intarsys.pdf.parser.COSLoadException;
    import de.intarsys.pdf.pd.PDDocument;
    import de.intarsys.pdf.pd.PDPage;
    import de.intarsys.pdf.pd.PDPageTree;
    import de.intarsys.pdf.pd.PDResources;
    import de.intarsys.tools.locator.FileLocator;

    public class Main2 {

      protected static PDDocument document;

        /**
         * @param args
         */
        public static void main(String[] args) {
            try {
                File file = new File("/home/jet/Desktop/pdf/6310i_usersguide_en.pdf");
                FileLocator locator = new FileLocator(file);
                locator.setReadOnly();
                document =  PDDocument.createFromLocator(locator);
             
                PDPageTree pageTree = document.getPageTree();
                PDPage page = pageTree.getPageAt(0);
              
                CSTextExtractor extractor = new CSTextExtractor();
                PDResources pdr = page.getResources();
                CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
                interpreter.process(page.getContentStream(), pdr);

                String contents = extractor.getContent();
                System.out.println(contents);
            } catch (IOException e) {
                e.printStackTrace();
            } catch (COSLoadException e) {
                e.printStackTrace();
            } finally {
                try {
                    document.close();
                    document = null;
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
    </code>

    Here is an example of pdf document <a href="http://nds1.nokia.com/phones/files/guides/6310i_usersguide_en.pdf" target="_blank">6310i_usersguide_en.pdf</a>

    Result of parsing:
    <code>
    E
    l e
    c
    t
    r
    o
    n i
    c

    u
    s
    e
    r

    s

    g u
    i
    d e

    r
    e
    l e a
    s
    e
    d

    s
    u
    b
    j e
    c
    t

    t
    o

    "
    N
    o
    k
    i a

    U
    s
    e
    r

    s

    G
    u
    i d
    e
    s

    T
    e
    r
    m s

    a
    n

    C
    o
    n d
    i
    t
    i
    o
    n s ,

    7
    t
    h

    J
    u
    n e ,

    1
    9 9
    8
    "
    U
    s
    e
    r
    ´
    s

    G
    u
    i
    d
    e 9
    3
    5
    4
    2
    6
    0
    I
    s
    s
    u
    e

    3
    </code>
    Questinon: what should I do to get "normal" text?
    Thanks a lot!

     
    • jet

      jet - 2009-02-11

      CABAReT PDF Viewer uses JPoD and the document looks fine. So the solution is out here :)

       
    • Elfi Heck

      Elfi Heck - 2009-02-13

      But the CABAReT Stage text extract will put every character on its separate line too ;-)

      I found that this is because the document's pages are rotated 90 degrees and the text is rotated back so it appears in a horizontal line.
      CSTextExtractor tries to guess if a character is on a new line by it's vertical distance to the previous character. The thresholds are computed from the current font size and the current scaling (and two rather arbitrary distance values). Only that because the text is rotated the values to use for the computation would be not "scale" but "shear".
      The methods to change are CSTextExtractor.textSetFont() and CSTextExtractor.textSetTransform(). You can get a reasonable value to handle both "normal" and rotated text by using getDeterminant() instead of getScaleX()/getScaleY(). Of course you could also compute the actual scaling factor from getScaleX/Y and getShearX/Y (we don't do it for performance reasons).
      Why these two particular methods have getScale in them I don't know. Probably an oversight. We will fix this.

       
    • jet

      jet - 2009-02-21

      I've made a decision, which is based on yours. And I hope that could help somebody else.
      You told, that I should to play with fonts to get a reasonable value. That could help and I tried, but failed again: I didn't tried to create own font, used just standard and got no result.

      I decided, that font affects parsing because text blocks depend on it. So I took a look to CSTextExtractor.textSetFont() and CSTextExtractor.textSetTransform() methods.
      There I commented all the calculations of offsets: there were 2 expressions in each method. I need just text and font doesn't important for me.
      Now it seems, that parsing works fine.
      Thanks a lot for this idea.

       
      • Elfi Heck

        Elfi Heck - 2009-02-24

        OK, if that works for you. It probably does because your document contains the spaces between words in the content stream (so they are sort of "drawn").
        Other documents just render the next glyph at an offset. That's what the distance guessing is for. If the distance between one glyph and the next exceeds a certain value a space is assumed. And the font size is used to compute that value. So that's why we need the font in the two methods.

         
    • jet

      jet - 2009-03-02

      It was a quick decision, so it's not best:)
      Me and my friend worked on it to find a better solution.
      Finally we've found new one - using static transform matrix.
      So, now CSPlainTextExtractor is used instead of CSTextExtractor
      <code>
      public class CSPlainTextExtractor extends CSTextExtractor {

          @Override
          public void textSetTransform(float a, float b, float c, float d, float e,
                  float f) {
              super.textSetTransform(1, 0, 0, 1, 0, 0);
          }

      }
      </code>
      The result of extraction is fine. Also this realization works a bit faster (of course, no calculation:) )

       
    • javatechman

      javatechman - 2009-08-11

      Thank you jet, I had the same problem n it worked for me.

       
    • javatechman

      javatechman - 2009-08-11

      Well  though it resolved my problem with one pdf, but it started giving probs with other one which otherwise was working fine. (1.6_Acrobat_7.x_Powered By Crystal.pdf). So any idea?

       
      • mtraut

        mtraut - 2009-08-11

        while i do not have access to a document "1.6_Acrobat_7.x_Powered By Crystal.pdf"  :-) i assume that the problem is still the same. The simple text extractor (that more or less demonstrates the possibilities you have) does not take into account all transformation related scenarios.

        MAYBE the most recent (not yet released) version has improvements, but to know i need your document....

         
        • javatechman

          javatechman - 2009-08-12

          First thanks for this lib. By name I was impressing on how the document has been created -
          PDFVersion :1.6(Acrobat 7.x) , PDF Producer:Powered by Crystal.
          Anyway, I can provide you both the documents but I dont know how to pass them to you ? So if you can help?
          Thanks

           
          • mtraut

            mtraut - 2009-08-12

            You can upload files using the Tracker - Feature Request.

            I can have a look, but i can't help if its transformation related - there is simply no time in the moment. But with the code you should be able to adapt the transformation and extraction logic to whatever you need.

             
    • javatechman

      javatechman - 2009-08-13

      I uploaded the document. I am not familiar with PDF specs so hope you will give me some trick to identify it so that I can apply different parsing..

       

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks