copying a page to a new document

Help
2009-11-22
2013-05-28
  • Sergey Ushakov
    Sergey Ushakov
    2009-11-22

    Hi, sorry for a newbie's question.

    I'm trying to write a basic application that would extract pages from one pdf file into another. The code is the following:

        FileLocator srcLocator = new FileLocator (sSrcName);
        PDDocument srcDocument = PDDocument.createFromLocator (srcLocator);
        int iTotalPages = srcDocument.getPageTree ().getCount ();
        for (int iPage = 0; iPage < iTotalPages; ++iPage) {
          PDDocument newDocument = PDDocument.createNew ();
          PDPage srcPage = srcDocument.getPageTree ().getPageAt (iPage);
          CSContent cSContent = srcPage.getContentStream ();
          PDPage newPage = (PDPage) PDPage.META.createNew ();
          newPage.setContentStream (cSContent);
          newDocument.addPageNode (newPage);
          String sOutFileFullPathAndName = …;
          FileLocator newLocator = new FileLocator (sOutFileFullPathAndName);
          newDocument.save (newLocator, null);
          newDocument.close ();
        }

    No errors do happen, but all the output pages are empty.

    What may be missing? I believe the mistake is silly, sorry for that…

    Many thanks anyway.

    Sergey

     
  • mtraut
    mtraut
    2009-11-23

    You must at least copy the resources from the old page (containing the resources needed to render the content stream).

    FileLocator srcLocator = new FileLocator("c:/temp/test.pdf");
    PDDocument srcDocument = PDDocument.createFromLocator(srcLocator);
    int iTotalPages = srcDocument.getPageTree().getCount();
    for (int iPage = 0; iPage < iTotalPages; ++iPage) {
    PDDocument newDocument = PDDocument.createNew();
    PDPage srcPage = srcDocument.getPageTree().getPageAt(iPage);

    // get resources
    PDResources srcResources = srcPage.getResources();

    CSContent cSContent = srcPage.getContentStream();
    PDPage newPage = (PDPage) PDPage.META.createNew();

    // set resources
    PDResources newResources = (PDResources) PDResources.META
    .createFromCos(srcResources.cosGetObject().copyDeep());
    newPage.setResources(newResources);

    newPage.setContentStream(cSContent);
    newDocument.addPageNode(newPage);
    String sOutFileFullPathAndName = "c:/temp/split." + iPage + ".pdf";
    FileLocator newLocator = new FileLocator(sOutFileFullPathAndName);
    newDocument.save(newLocator, null);
    newDocument.close();
    }

    Another problem may arise when your page contains annotations - these are not part of the page content…

     
  • Sergey Ushakov
    Sergey Ushakov
    2009-11-23

    Yes, many thanks. This got my small program working, and my idea of the API more celar :)

    Regards,
    Sergey

     
  • Sergey Ushakov
    Sergey Ushakov
    2009-11-25

    Hi again,

    I've got everything working but noticed a curious thing. When I split the original file into pages, every single page file is almost as big as the original multipage file. When I merge them back into a new file, this new file is much bigger than the original. Looks like something gets copied multiple times, most probably without much use…

    What may be wrong?

    Many thanks,
    Sergey

     
  • Sergey Ushakov
    Sergey Ushakov
    2009-11-26

    Well, I have maybe got a partial answer myself :)

    When one big pdf file gets splitted into pages, the common resources are inevitably copied into separate page files and thus multiplied.

    When individual pages are joined together again, the straightforward way is to retain their resources separate, so thus these resources are present in the resulting file several times.

    Matching and re-referencing the resources that are already present in the file may be a tricky task when adding a new page, but perhaps realistic, I don't know.

    Could anybody comment please?

    Thanks,
    Sergey

     
  • mtraut
    mtraut
    2009-11-27

    hi - i've been with a customer this week, so you had the chance to find the answer yourself :-)

    yes, the most obvious reason are the common resources (like fonts). When copying PD objects like a page with copyDeep, you can end up much worse, copying the whole source document. This is for things like annotations that reference another page, that reference the root page tree which references nearly all resources in the document… There are lots of other caveats and you have to study the PDF reference thoroughly to avoid failures.

    Matching resources when joining a page is complicated when you do it exactly. Two fonts with the same name do not need to be the same, have the same encoding or event font program… But if you have a priori knowledge about the resources, technically its no problem to change the reference in the resources dictionary to the shared one.

     
  • Sergey Ushakov
    Sergey Ushakov
    2009-11-27

    Ok, many thanks again.

    Now it's really a good point to read the spec better :)

    Regards, Sergey

     
  • abevig
    abevig
    2010-07-13

    Don't forget to set the new document's size to the source documents size:
    newPage.setMediaBox(srcPage.getMediaBox().copy());