I am developing a PDF touchup module where I load the existing PDF document using jPod and work most of the cases. However, I got some scanned PDF documents (with OCR), where the document will have text layer (OCR generates a PDF file with only text) and scanned image overlayed on top of that. In this case, when I get the PDF Content stream using "getContentStream" method from PDPage, I am not getting any of the COSString operands. How do I access these objects? Could any of provide me some pointer?
Maybe the generator program puts the text in an annotation or a form. Hard to say without actually looking at the document.
Thanks for your quick comments and please find attached the test document that I am using for. Since I don't know how to attach documents, I have already sent it as an e-mail. Please accept my appologies for this. Please guide to resolve this issue or let me know, if there is any other approachs to solve my problem. BTW, I am trying to do touchup tool to correct typo errors.
Yes, the text is in a form.
The "Contents" of page 1 consist of two streams: in one the image (/Im0) is drawn and the form (/Xi0) in the other.
You can use the "COS Browser" tool in CABAReT Stage (which uses the jPod library and has a free download for evaluation) to get a hierarchical view of the PDF structure. Navigate to "Root->Pages->Kids->0->Contents" to see the page contents and to "Root->Pages->Kids->0->Resources->XObject->Xi0->Contents" to see the form contents. You can get CABAReT Stage from http://www.cabaret-solutions.com .
Thanks for your time that you took to analysis my test document. Is it possible to access the text content stream using jPod?
Log in to post a comment.