I am using PDF clown to find and highlight text in PDF documents. For the most part it works great. I am running into an issue with some PDF documents though. It appears that the text on the page has a transformation applied to it. I.E the text appears to be in an object that is placed at X=10 and then the text might be placed at X=5 with in that object - hence making the text at X=15 on the page. The problem is when I use the text extractor it gets the texts location with in the object (X=5), so when I try to place the highlight on the same point it places it on x=5 on the page - not on X=15 where the text is at.
Is there an easy way to have a page get rid of transformations and use page coordinates?
your request needs some clarifications as you are apparently misinterpreting the PDF graphics model.
Well, all the positions of text characters described by PDF content streams are the result of transformations; PDF Clown takes care to resolve such transformations into corresponding page coordinates, so there's no need of further elaboration - you already get the actual position of each single character (and aggregated text strings).
Therefore, if you experience wrong placements there's a bug that has to be tracked down: please open a bug tracker entry and attach a sample PDF file so we can reproduce its behavior.
Thanks for the response Stechio, I will add the pdf to the bug tracker as you suggested.
I spent some time and hunted down the source for the pdfs and I found out they were being generated by some iteration of ghostscript. As a work around I simply had Acrobat save all as reduced sized pdfs and poof! they all read fine.