pdf2xml / Support Requests / #4 pdf2xml crops lies prematurely

Herve Dejean - 2009-09-23

They is no notion of line in PDF. The <TEXT> tag is based on a heuristics so that near tokens are grouped into one single<TEXT>.
One solution would maybe to set up the threshold in the command line (not possible currently)

The robust detection of "real" lines has to be done by a more complex algorithm (taking into account multi-columns document, and many other things).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-09-24

Dejean,

Thanks for your comments. I did finally modify my code to revise the definition of a "line" (all tokens with same "y" co-ordinate values).

I have one more question that I would appreciate if you or someone else could answer.

It appears that when I convert PDFs to XMLs some characters (single quotes, for example) get replaced in the XML with some junk characters. What is the way around this?

Other than that, I want to compliment the writers (and other contributors) of this software for doing such a magnificent job. Having looked at several other PDF manipulation software (including commercial ones) which never did that good a job, I can really appreciate this tool. Well done !

Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2009-09-25

Andy,

The XML encoding is UTF-8. Visualize it with your browser (setting encoding with uf-8). If there is still junk characters, it means that there is an font issue (typ3, not embedded). I rely on the xpdf library to extract characters. And I have to say that font management is a real nightmare !

Regarding lines, you can use the @baseline attribute instated of the "y" if this information is present in the XML (not always present in the font information). It should be more reliable than @y

Thanks for your feedback, but most of the credits go to xpdf.

Hervé

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-10-16

Another quick question here.

I have a set of OCR-ed documents, and it does not look like I am getting anything when I convert them into xml. Am I missing something here? Is it just that the software does not handle OCR-ed documents yet?

Thanks yet again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2009-10-19

Are you sure text is present in the document ?
It should work. (I often apply an ocr engine on documents and then convert them to XML with pdftoxml)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-10-20

Herve,

My apologies for the last note. The PDF files that I was talking about are in fact scanned image files.

The question still remains though. How can I handle these files? I have a large number of them, and do not want to ignore them. I need to preserve the layout and font information.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2009-10-21

The only solution is to apply an OCR engine to them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-10-21

Which one would you recommend?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-10-22

One more thing. My collection has many such documents. What is a good way to determine which ones are scanned, as I do processing on the fly? At the moment, I convert them to txt anyway (using pdftotext) and if the txt has less than say 100chars I discard it.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2009-10-22

1- I usually use FineReader or Scansoft (but they are no free). You can also try http://groups.google.com/group/ocropus (free)., but I was not able to install it.

2- Your approach for detecting the scanned document makes sense, and should be robust

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2009-10-23

Ocropus looks pretty good. I will play with it.

Thank you very much for your advice.

PS: I am doing extensive document processing. I will be back here very soon :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2009-10-27

by the way you need an OCR engine which generates PDF files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2010-02-17

Herve,

Back after a long hiatus, as the project had been put in cold storage :)

One question I have is related to storage. I noticed that you have done some work related to segmenting/chapterizing PDFs. Basically, I am trying to replicate that work on the PDF documents collection at my disposal :) What would be an efficient way to store these segments/chapters? Are you storing them in a relational DB? XML-DB? Which one?

Thanks.
Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Herve Dejean - 2010-02-22

Andy,

I've no good piece of advice. For our research we stop after the recognition of the different structure. So we store them on a file system,

Hervé

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2010-02-23

Herve,

That's fine. Thanks for writing though.

Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

pdf2xml crops lies prematurely

Group

Searches

Help

#4 pdf2xml crops lies prematurely

Discussion