Menu

#4 pdf2xml crops lies prematurely

open
nobody
None
5
2012-09-10
2009-09-22
Anonymous
No

Hi all,

I am using the pdf2xml binray found at SourceForge. When I try to convert a PDF file to XML (pdf2xml -f 10 -l 10 -noImageInline 1.pdf 1.xml), it seems to randomly crop some lines (i.e. it outputs a single line from PDF file as 2 separate lines in the XML file). My understanding is that each <TEXT> tag in the XML will contain one line of text as it appears in the PDF file, but sometimes the 1 line from the PDF files appear in 2 separate <TEXT> lines - 1 part of it is in one <TEXT> tag and another part of is in the next <TEXT> tag. Is this a known problem? Any suggestions?

Thanks
Andy

Discussion

  • Herve Dejean

    Herve Dejean - 2009-09-23

    They is no notion of line in PDF. The <TEXT> tag is based on a heuristics so that near tokens are grouped into one single<TEXT>.
    One solution would maybe to set up the threshold in the command line (not possible currently)

    The robust detection of "real" lines has to be done by a more complex algorithm (taking into account multi-columns document, and many other things).

     
  • Nobody/Anonymous

    Dejean,

    Thanks for your comments. I did finally modify my code to revise the definition of a "line" (all tokens with same "y" co-ordinate values).

    I have one more question that I would appreciate if you or someone else could answer.

    It appears that when I convert PDFs to XMLs some characters (single quotes, for example) get replaced in the XML with some junk characters. What is the way around this?

    Other than that, I want to compliment the writers (and other contributors) of this software for doing such a magnificent job. Having looked at several other PDF manipulation software (including commercial ones) which never did that good a job, I can really appreciate this tool. Well done !

    Andy

     
  • Herve Dejean

    Herve Dejean - 2009-09-25

    Andy,

    The XML encoding is UTF-8. Visualize it with your browser (setting encoding with uf-8). If there is still junk characters, it means that there is an font issue (typ3, not embedded). I rely on the xpdf library to extract characters. And I have to say that font management is a real nightmare !

    Regarding lines, you can use the @baseline attribute instated of the "y" if this information is present in the XML (not always present in the font information). It should be more reliable than @y

    Thanks for your feedback, but most of the credits go to xpdf.

    Hervé

     
  • Nobody/Anonymous

    Another quick question here.

    I have a set of OCR-ed documents, and it does not look like I am getting anything when I convert them into xml. Am I missing something here? Is it just that the software does not handle OCR-ed documents yet?

    Thanks yet again.

     
  • Herve Dejean

    Herve Dejean - 2009-10-19

    Are you sure text is present in the document ?
    It should work. (I often apply an ocr engine on documents and then convert them to XML with pdftoxml)

     
  • Nobody/Anonymous

    Herve,

    My apologies for the last note. The PDF files that I was talking about are in fact scanned image files.

    The question still remains though. How can I handle these files? I have a large number of them, and do not want to ignore them. I need to preserve the layout and font information.

    Thanks.

     
  • Herve Dejean

    Herve Dejean - 2009-10-21

    The only solution is to apply an OCR engine to them.

     
  • Nobody/Anonymous

    Which one would you recommend?

     
  • Nobody/Anonymous

    One more thing. My collection has many such documents. What is a good way to determine which ones are scanned, as I do processing on the fly? At the moment, I convert them to txt anyway (using pdftotext) and if the txt has less than say 100chars I discard it.

    Thanks.

     
  • Herve Dejean

    Herve Dejean - 2009-10-22

    1- I usually use FineReader or Scansoft (but they are no free). You can also try http://groups.google.com/group/ocropus (free)., but I was not able to install it.

    2- Your approach for detecting the scanned document makes sense, and should be robust

     
  • Nobody/Anonymous

    Ocropus looks pretty good. I will play with it.

    Thank you very much for your advice.

    PS: I am doing extensive document processing. I will be back here very soon :)

     
  • Herve Dejean

    Herve Dejean - 2009-10-27

    by the way you need an OCR engine which generates PDF files.

     
  • Nobody/Anonymous

    Herve,

    Back after a long hiatus, as the project had been put in cold storage :)

    One question I have is related to storage. I noticed that you have done some work related to segmenting/chapterizing PDFs. Basically, I am trying to replicate that work on the PDF documents collection at my disposal :) What would be an efficient way to store these segments/chapters? Are you storing them in a relational DB? XML-DB? Which one?

    Thanks.
    Andy

     
  • Herve Dejean

    Herve Dejean - 2010-02-22

    Andy,

    I've no good piece of advice. For our research we stop after the recognition of the different structure. So we store them on a file system,

    Hervé

     
  • Nobody/Anonymous

    Herve,

    That's fine. Thanks for writing though.

    Andy

     

Anonymous
Anonymous

Add attachments
Cancel