Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

textString.Box Coordinates

Help
toshy kava
2011-01-20
2013-01-26
  • toshy kava
    toshy kava
    2011-01-20

    Hi everyone

    I wonder how one can choose such a headache like PDF as it's occupation for long time. Stefano, you certainly can be proud of your self!

    ok, I'm trying to align text extracted from PDF in it's natural order. We know that order of occurrence in original file not always tells us about it's position, because there are x and y. So, i thought, ordering by y will give me what i need and it actually does work for most of PDFs but unfortunately not for all. For instance pdf that Wikipedia generates when downloading article as PDF, for 3 of my tries, even has negative values for "y" property of textString.Box.Value. The absolute value of it looked so like minus sign can be ignored, and I tried to sort text lines by Math.Abs(y), and after that failed too I couldn't find anything beter that asking here.

    How can I find out real text position?

    Thanks.

     
  • TextExtractor  is supposed to properly order extracted text (although its implementation is currently incomplete, as it doesn't care about columns and table layouts -  see ISSUES file in the downloadable distribution); the anomalous coordinates you got are presumably caused by the wrong handling of a peculiar coordinate transformation, so please open a Bug Tracker report attaching your problematic PDF file.

    Thank you!
    Stefano

    http://clown.sourceforge.net/API/it/stefanochizzolini/clown/tools/TextExtractor.html