Hi all,
I am using the pdf2xml binray found at SourceForge. When I try to convert a PDF file to XML (pdf2xml -f 10 -l 10 -noImageInline 1.pdf 1.xml), it seems to randomly crop some lines (i.e. it outputs a single line from PDF file as 2 separate lines in the XML file). My understanding is that each <TEXT> tag in the XML will contain one line of text as it appears in the PDF file, but sometimes the 1 line from the PDF files appear in 2 separate <TEXT> lines - 1 part of it is in one <TEXT> tag and another part of is in the next <TEXT> tag. Is this a known problem? Any suggestions?
Thanks
Andy
Anonymous
They is no notion of line in PDF. The <TEXT> tag is based on a heuristics so that near tokens are grouped into one single<TEXT>.
One solution would maybe to set up the threshold in the command line (not possible currently)
The robust detection of "real" lines has to be done by a more complex algorithm (taking into account multi-columns document, and many other things).
Dejean,
Thanks for your comments. I did finally modify my code to revise the definition of a "line" (all tokens with same "y" co-ordinate values).
I have one more question that I would appreciate if you or someone else could answer.
It appears that when I convert PDFs to XMLs some characters (single quotes, for example) get replaced in the XML with some junk characters. What is the way around this?
Other than that, I want to compliment the writers (and other contributors) of this software for doing such a magnificent job. Having looked at several other PDF manipulation software (including commercial ones) which never did that good a job, I can really appreciate this tool. Well done !
Andy
Andy,
The XML encoding is UTF-8. Visualize it with your browser (setting encoding with uf-8). If there is still junk characters, it means that there is an font issue (typ3, not embedded). I rely on the xpdf library to extract characters. And I have to say that font management is a real nightmare !
Regarding lines, you can use the @baseline attribute instated of the "y" if this information is present in the XML (not always present in the font information). It should be more reliable than @y
Thanks for your feedback, but most of the credits go to xpdf.
Hervé
Another quick question here.
I have a set of OCR-ed documents, and it does not look like I am getting anything when I convert them into xml. Am I missing something here? Is it just that the software does not handle OCR-ed documents yet?
Thanks yet again.
Are you sure text is present in the document ?
It should work. (I often apply an ocr engine on documents and then convert them to XML with pdftoxml)
Herve,
My apologies for the last note. The PDF files that I was talking about are in fact scanned image files.
The question still remains though. How can I handle these files? I have a large number of them, and do not want to ignore them. I need to preserve the layout and font information.
Thanks.
The only solution is to apply an OCR engine to them.
Which one would you recommend?
One more thing. My collection has many such documents. What is a good way to determine which ones are scanned, as I do processing on the fly? At the moment, I convert them to txt anyway (using pdftotext) and if the txt has less than say 100chars I discard it.
Thanks.
1- I usually use FineReader or Scansoft (but they are no free). You can also try http://groups.google.com/group/ocropus (free)., but I was not able to install it.
2- Your approach for detecting the scanned document makes sense, and should be robust
Ocropus looks pretty good. I will play with it.
Thank you very much for your advice.
PS: I am doing extensive document processing. I will be back here very soon :)
by the way you need an OCR engine which generates PDF files.
Herve,
Back after a long hiatus, as the project had been put in cold storage :)
One question I have is related to storage. I noticed that you have done some work related to segmenting/chapterizing PDFs. Basically, I am trying to replicate that work on the PDF documents collection at my disposal :) What would be an efficient way to store these segments/chapters? Are you storing them in a relational DB? XML-DB? Which one?
Thanks.
Andy
Andy,
I've no good piece of advice. For our research we stop after the recognition of the different structure. So we store them on a file system,
Hervé
Herve,
That's fine. Thanks for writing though.
Andy