I wonder how one can choose such a headache like PDF as it's occupation for long time. Stefano, you certainly can be proud of your self!
ok, I'm trying to align text extracted from PDF in it's natural order. We know that order of occurrence in original file not always tells us about it's position, because there are x and y. So, i thought, ordering by y will give me what i need and it actually does work for most of PDFs but unfortunately not for all. For instance pdf that Wikipedia generates when downloading article as PDF, for 3 of my tries, even has negative values for "y" property of textString.Box.Value. The absolute value of it looked so like minus sign can be ignored, and I tried to sort text lines by Math.Abs(y), and after that failed too I couldn't find anything beter that asking here.
How can I find out real text position?
TextExtractor is supposed to properly order extracted text (although its implementation is currently incomplete, as it doesn't care about columns and table layouts - see ISSUES file in the downloadable distribution); the anomalous coordinates you got are presumably caused by the wrong handling of a peculiar coordinate transformation, so please open a Bug Tracker report attaching your problematic PDF file.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.