i have a little question, maybe someone can find the time to help… I've been working on a project that one of the goals is to analyze an existing PDF document. Not just getting the text (simple, using TextExtractor), but to analyze it smartly: recognize titles, subtitles, etc. So I've looked at Bookmark class, but i still don't know how to connect between the title (bookmark) and the text underneath.
can u help in this issue?
if your goal is to analyze the structure of a PDF document, two main approaches can be followed:
- deterministic: text hierarchy extraction relies on metainformation provided by the document itself (see "Logical Structure", § 10.6, PDF Reference 1.7) - accuracy depends on the file generator;
- heuristic: text hierarchy extraction derives from rules implying an inherent level of arbitrariness - accuracy depends on the file analyzer.
Assuming that your documents don't provide structural metainformation, bookmark (aka outline) -driven hierarchy extraction could be a feasible strategy. I can suggest you to get the Destination (or Action ) object associated to the bookmark (Bookmark.getTarget()): that will give you the target location along with page coordinate information to spot the start of your section. Then you can extract that page text with the above-mentioned TextExtractor , and iterate over the extracted text looking for contents below the target location. A simple way to infer text hierarchy could be, for example, to compare relative text sizes, but actual rules are up to you…
Thanks for the reply.
You are right - i deal with docs with no metainformation. so my strategy is iterating over the bookmarks (outlines) if exist, and get the data for each bookmark. unfortunately, and that is my problem, there is no relationship between a bookmark and the text underneath it; (please correct me if I'm wrong here). So one needs to work a bit hard to get this connection. AFAIU, the bookmark contains a "pointer" to the place where the text exists. (Is that right?)
So i followed your advice, and i get the list of bookmarks, and for each one of them i get its Target (by the way - why should i take the Target, and then cast it to Destination? why can't I call to
? is there a difference?)
after i get the destination, I make some "analysis" on it, something like :
PdfArray baseDataObject = destination.getBaseDataObject()
and then if the 1st element is "XYZ", I take the coordinates (2nd and 3rd elements) of the place the bookmark points to. Am I right till now? Is this the right way to get to the location in the page where the text exist?
when I have this location, I guess there is a way to get the text, but I did not find it yet… could you help me in that?
Is there any other way to extract the text from a bookmark? is there another way to connect text to title?
When working with ILink implementations, target property is the way to go, since you cannot assume that is associated to a Destination - as I stressed above, it could be an Action too! Furthermore, destination and action properties in ILink interface are gonna be deprecated and removed in next release, as they are unnecessarily redundant.
Your low-level approach to the extraction of destination parameters is fine (till 0.1.1 version view parameters aren't exposed at high level - I'll add them in 0.1.2); I can suggest you to get the mode property (Destination.getMode()), then collect the parameters according to the ModeEnum documentation (keep in mind that native Y-axis coordinates are bottom-up-oriented, so you have to convert them this way in order to consistently work with text extraction: pageHeight - yParam).
Once you got the normalized location, you can extract the text from the page referenced by the same Destination object (see getPageRef()) using the TextExtractor class (there's plenty of sample code in the downloadable distribution about that). As I wrote in my initial reply, you have to iterate over the extracted text looking for contents below the target location. A simple way to infer text hierarchy could be, for example, to compare relative text sizes, but actual rules are up to you…
1. iterate the bookmarks;
2. extract the destination location from each bookmark;
3. extract the text from the destination page using TextExtractor;
4. filter text based on the destination location, classifying it through some rule of thumb such as relative font size.
you are the best! thanks for the help, much appreciated. I feel like I'm getting close to the solution, thanks to you ! I've ended up with the code below, as you suggested. I iterate the bookmarks (recursively), then get the Destination (hope I do it right…), then get the Page, with its "box" (dimensions), and location of the text.
could you please have a look if this is the right way to do things? if so, I get an exception after several iteration, something in the TextExtractor (the exception is attached…)
private void printBookmarks(Bookmarks bookmarks)
if(bookmarks == null)
for(Bookmark bookmark : bookmarks)
// Show current bookmark!
System.out.println("Bookmark: '" + bookmark.getTitle() + "'");
PdfObjectWrapper<?> target = bookmark.getTarget();
// Destination destination = bookmark.getDestination(); //the location in the page
if(target instanceof Destination)
else if(target instanceof Action)
else if(target == null)
System.out.println("[unknown type: " + target.getClass().getSimpleName() + "]");
// Show child bookmarks!
private void printDestination(Destination destination)
PdfArray baseDataObject = destination.getBaseDataObject();
System.out.println(destination.getClass().getSimpleName() + " " + destination.getBaseObject());
if(baseDataObject != null)
PdfName pdfDirectObject = (PdfName)baseDataObject.get(1);
Object pageRef = destination.getPageRef();
if(!(pageRef instanceof Page))
System.err.println("the page ref is not a Page object. cannot extract text from this object");
Page refPage = (Page)pageRef;
Rectangle2D box = refPage.getBox();
if(pdfDirectObject.compareTo(PdfName.XYZ) == 0)
PdfInteger pdfDirectObjectX = (PdfInteger)baseDataObject.get(2);
PdfInteger pdfDirectObjectY = (PdfInteger)baseDataObject.get(3);
Rectangle2D rect = new Rectangle(
(int)box.getHeight() - pdfDirectObjectY.getIntValue(),
List<Rectangle2D> list = new ArrayList<Rectangle2D>();
TextExtractor extractor = new TextExtractor(list, false, false);
int index = refPage.getIndex();
StringBuffer sb = new StringBuffer();
Map<Rectangle2D, List<ITextString>> extract = extractor.extract( refPage );
Collection<List<ITextString>> values = extract.values();
for(List<ITextString> strings : values)
for(ITextString textString : strings)
System.out.println( sb );
the exception i get:
Your code is substantially ok. In order to debug the CffParser exception I need to reproduce its behavior: could you please open a bug tracker entry attaching your problematic PDF file? Alternatively, if you didn't want to make it public, you could send it to me via email.
Stefano, I can do both.
1. how to open a bug tracker entry? give me a link and i will open.
2. what is your email? :-)
Log in to post a comment.