attaching text while parsing to its bookmark

Help
Anonymous
2012-01-26
2013-01-26

  • Anonymous
    2012-01-26

    hi there,

    i have a little question, maybe someone can find the time to help… I've been working on a project that one of the goals is to analyze an existing PDF document. Not just getting the text (simple, using TextExtractor), but to analyze it smartly: recognize titles, subtitles, etc. So I've looked at Bookmark class, but i still don't know how to connect between the title (bookmark) and the text underneath.
    can u help in this issue?

    thanks!
    Ohad

     

  • Anonymous
    2012-01-29

    Hi Stefano,

    Thanks for the reply.
    You are right - i deal with docs with no metainformation. so my strategy is iterating over the bookmarks (outlines) if exist, and get the data for each bookmark. unfortunately, and that is my problem, there is no relationship between a bookmark and the text underneath it; (please correct me if I'm wrong here). So one needs to work a bit hard to get this connection. AFAIU, the bookmark contains a "pointer" to the place where the text exists. (Is that right?)

    So i followed your advice, and i get the list of bookmarks, and for each one of them i get its Target (by the way - why should i take the Target, and then cast it to Destination? why can't I call to

    bookmark.getDestination()
    

    ? is there a difference?)
    after i get the destination, I make some "analysis" on it, something like :

    PdfArray baseDataObject = destination.getBaseDataObject()
    

    and then if the 1st  element is "XYZ", I take the coordinates (2nd and 3rd elements) of the place the bookmark points to. Am I right till now? Is this the right way to get to the location in the page where the text exist?
    when I have this location, I guess there is a way to get the text, but I did not find it yet… could you help me in that?

    Is there any other way to extract the text from a bookmark? is there another way to connect text to title?

    thanks

    Ohad

     

  • Anonymous
    2012-01-30

    Hei Stefano,

    you are the best! thanks for the help, much appreciated. I feel like I'm getting close to the solution, thanks to you ! I've ended up with the code below, as you suggested. I iterate the bookmarks (recursively), then get the Destination (hope I do it right…), then get the Page, with its "box" (dimensions), and location of the text.
    could you please have a look if this is the right way to do things? if so, I get an exception after several iteration, something in the TextExtractor (the exception is attached…)
    thanks again!

    Ohad

        private void printBookmarks(Bookmarks bookmarks)
        {
            if(bookmarks == null)
                return;
            for(Bookmark bookmark : bookmarks)
            {
                // Show current bookmark!
                System.out.println("Bookmark: '" + bookmark.getTitle() + "'");
                PdfObjectWrapper<?> target = bookmark.getTarget();
    //            Destination destination = bookmark.getDestination();    //the location in the page
                if(target instanceof Destination)
                {
                    printDestination((Destination)target);
                }
                else if(target instanceof Action)
                {
    //                printAction((Action)target);
                }
                else if(target == null)
                {
                    System.out.println("[not available]");
                }
                else
                {
                    System.out.println("[unknown type: " + target.getClass().getSimpleName() + "]");
                }
                // Show child bookmarks!
                printBookmarks(bookmark.getBookmarks());
            }
        }
        private void printDestination(Destination destination)
        {
            PdfArray baseDataObject = destination.getBaseDataObject();
            System.out.println(destination.getClass().getSimpleName() + " " + destination.getBaseObject());
            System.out.print("Page ");
            if(baseDataObject != null)
            {
                PdfName pdfDirectObject = (PdfName)baseDataObject.get(1);
                Object pageRef = destination.getPageRef();
                if(!(pageRef instanceof Page))
                {
                    System.err.println("the page ref is not a Page object. cannot extract text from this object");
                    return;
                }
                Page refPage = (Page)pageRef;
                Rectangle2D box = refPage.getBox();
                if(pdfDirectObject.compareTo(PdfName.XYZ) == 0)
                {
                    PdfInteger pdfDirectObjectX = (PdfInteger)baseDataObject.get(2);
                    PdfInteger pdfDirectObjectY = (PdfInteger)baseDataObject.get(3);
                    Rectangle2D rect = new Rectangle(
                            pdfDirectObjectX.getIntValue(),
                            (int)box.getHeight() - pdfDirectObjectY.getIntValue(),
                            (int)box.getWidth(),
                            (int)box.getHeight());
                    List<Rectangle2D> list = new ArrayList<Rectangle2D>();
                    list.add(rect);
                    TextExtractor extractor = new TextExtractor(list, false, false);
                    int index = refPage.getIndex();
                    System.out.println((index+1));
                    StringBuffer sb = new StringBuffer();
                    Map<Rectangle2D, List<ITextString>> extract = extractor.extract( refPage );
                    Collection<List<ITextString>> values = extract.values();
                    for(List<ITextString> strings : values)
                    {
                        for(ITextString textString : strings)
                        {
                            sb.append(textString.getText());
                        }
                    }
                    System.out.println( sb );
                }
            }
        }
    

    the exception i get:

    java.lang.RuntimeException
    at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:703)
    at org.pdfclown.documents.contents.fonts.CffParser.<init>(CffParser.java:640)
    at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:104)
    at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:151)
    at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
    at org.pdfclown.documents.contents.fonts.Font.load(Font.java:737)
    at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
    at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
    at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
    at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:64)
    at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
    at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:158)
    at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
    at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
    at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
    at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1310)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:791)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:757)
    at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:750)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:670)
    at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:662)
    at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1134)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
    at TestLogic.printDestination(TestLogic.java:257)
    at TestLogic.printBookmarks(TestLogic.java:200)

     
  • Your code is substantially ok. In order to debug the CffParser exception I need to reproduce its behavior: could you please open a bug tracker entry attaching your problematic PDF file? Alternatively, if you didn't want to make it public, you could send it to me via email.

    thank you
    Stefano

     

  • Anonymous
    2012-01-31

    Stefano, I can do both.
    1. how to open a bug tracker entry? give me a link and i will open.
    2. what is your email? :-)