Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
Hi, I have a question about using this package for editing text in a PDF.
For example, if a page of in a document contained the text "This is a book about a dog"
and I wanted to change the word "dog" to "cat" then resave the entire PDF document, is this be possible?
Is this possible to be done with the classes within "de.intarsys.pdf.content.text" I know the extracting of text can be done easily but can it be changed then reinserted in some way.
Any help would be appreciated, I just need an idea of what classes in the package would enable me to do this, if it is possible at all.
To create a content stream there is the CSCreator (another subclass of CSDeviceAdapter). THere should be some examples how to use this from scratch.
The CSCreator is comparable to java.awt.Graphics - it transforms its method calls to the operation tags in a content stream that can be extracted after finishing. The operations accepted mirror closely the PDF content stream operation set.
A simple filter could be implemented by interpreting an existing content stream on a filter CSDeviceAdapter (that is your part) that emits its output to a CSCreator instance.
When the interpreter is done, the new content stream is created.
Thanks for the reply. I have another question about the Textsearcher.
Form what I can see with this it worked in the same way as the Textextractor (in terms of how it is linked with the interpreter) but I'm having issues getting it to find text. This is the code I have where I access the searcher
PDPage page = (PDPage) node;
CSTextSearcher searcher = new CSTextSearcher();
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, searcher);
ArrayList dire = new ArrayList();
dire = searcher.getHits();
int size = dire.size();
From this the size of the arraylist is always returned as 0 as it is always empty. Is this the correct way to use the textsearcher or am I mistaken?
This pattern is correct.
But - if you don't get what you expect, remember to take a look in the content stream itself (using the Cabaret Stage PDF Browser for example). There are a lot of reasons why a document may not contain what you think you see:
- its an image / images
- it has a strange encoding set by the source system without a Unicode Mapping to reverse the maping function
- the letters are strangely scattered or distance is too great