PDF Clown / Discussion / Help: Split by Size

Wilson - 2010-12-11

Hi,

Thanks very much for the works you put on the project.

Our IT department has just start using PDF Clown and now we find ourself a problem which we hope you would be able to give us some light with.

We are using PDF Clown to merge 10s of PDF files into a massive file before we email it out.

However, we might need to split the file if it is too big to be received by the email recipients.

Says they could only receipt file atttachment, pdf in our case, of 4M, how could we split the massive file by every 4M of size?

Best regards,
Wilson

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Wilson,

your case seemed quite interesting, so I successfully made a working sample to test it - the following code is encapsulated in a Sample-derived class (it.stefanochizzolini.clown.samples.Sample is part of the clown.samples.cli project included in the downloadable distribution). The data size calculation algorithm will be added to the org.pdfclown.tools.PageManager implementation of the next 0.1.0 release.

public class SplitSample
  extends Sample
{
  private static final long MaxDataSize = 4 << 20;
  private PageManager manager;
  @Override
  public boolean run(
    )
  {
    // 1. Opening the PDF file...
    File file;
    {
      String filePath = promptPdfFileChoice("Please select a PDF file");
      try
      {file = new File(filePath);}
      catch(Exception e)
      {throw new RuntimeException(filePath + " file access error.",e);}
    }
    Document document = file.getDocument();
    Pages pages = document.getPages();
    // 2. Splitting the document...
    manager = new PageManager(document);
    int splitIndex = 0;
    long incrementalDataSize = 0;
    int beginPageIndex = 0;
    Set<PdfReference> visitedReferences = new HashSet<PdfReference>();
    for(Page page : pages)
    {
      long pageDifferentialDataSize = getSize(page,visitedReferences);
      incrementalDataSize += pageDifferentialDataSize;
      if(incrementalDataSize > MaxDataSize) // Data size limit reached.
      {
        int endPageIndex = page.getIndex();
        // Split the current document page range!
        splitDocument(++splitIndex,beginPageIndex,endPageIndex);
        beginPageIndex = endPageIndex;
        incrementalDataSize = getSize(page);
      }
    }
    // Split the last document page range!
    splitDocument(++splitIndex,beginPageIndex,pages.size());
    return true;
  }
  private void splitDocument(
    int splitIndex,
    int beginPageIndex,
    int endPageIndex
    )
  {
    System.out.println("Split " + splitIndex + ": " + (beginPageIndex+1) + "-" + endPageIndex);
    // 1. Split the document!
    Document splitDocument = manager.extract(beginPageIndex,endPageIndex);
    // (boilerplate metadata insertion -- ignore it)
    buildAccessories(splitDocument,"Split (" + (splitIndex) + ")","splitting a PDF document into 4MB files");
    // 2. Serialize the split file!
    serialize(splitDocument.getFile(),this.getClass().getSimpleName() + "." + (splitIndex),false);
  }
  /**
    Gets the data size of the specified page expressed in bytes.
    @param page Page whose data size has to be calculated.
  */
  private static long getSize(
    Page page
    )
  {return getSize(page,new HashSet<PdfReference>());}
  /**
    Gets the data size of the specified page expressed in bytes.
    @param page Page whose data size has to be calculated.
    @param visitedReferences References to data objects excluded from calculation.
      This set is useful, for example, to avoid recalculating the data size of shared resources.
      During the operation, this set is populated with references to visited data objects. 
  */
  private static long getSize(
    Page page,
    Set<PdfReference> visitedReferences
    )
  {return getSize(page.getBaseObject(),visitedReferences,true);}
  /**
    Gets the data size of the specified page expressed in bytes.
    @param object Data object whose size has to be calculated.
    @param visitedReferences References to data objects excluded from calculation.
      This set is useful, for example, to avoid recalculating the data size of shared resources.
      During the operation, this set is populated with references to visited data objects. 
    @param isRoot Whether this data object represents the page root.
  */
  private static long getSize(
    PdfDirectObject object,
    Set<PdfReference> visitedReferences,
    boolean isRoot
    )
  {
    long dataSize = 0;
    {
      PdfDataObject dataObject = File.resolve(object);
      // 1. Evaluating the current object...
      if(object instanceof PdfReference)
      {
        PdfReference reference = (PdfReference)object;
        if(visitedReferences.contains(reference))
          return 0; // Avoids circular references.
        if(dataObject instanceof PdfDictionary
          && PdfName.Page.equals(((PdfDictionary)dataObject).get(PdfName.Type))
          && !isRoot)
          return 0; // Avoids references to other pages.
        visitedReferences.add(reference);
        // Calculate the data size of the current object!
        IOutputStream out = new Buffer();
        reference.getIndirectObject().writeTo(out);
        dataSize += out.getLength();
      }
      // 2. Evaluating the current object's children...
      Collection<PdfDirectObject> values = null;
      {
        if(dataObject instanceof PdfStream)
        {
          PdfStream streamDataObject = (PdfStream)dataObject;
          dataObject = streamDataObject.getHeader();
        }
        if(dataObject instanceof PdfDictionary)
        {values = ((PdfDictionary)dataObject).values();}
        else if(dataObject instanceof PdfArray)
        {values = (PdfArray)dataObject;}
      }
      if(values != null)
      {
        // Calculate the data size of the current object's children!
        for(PdfDirectObject value : values)
        {dataSize += getSize(value,visitedReferences,false);}
      }
    }
    return dataSize;
  }
}

Stefano *<:o)

PS: If PDF Clown was beneficial to your activity, please consider making a donation via PayPal to demonstrate your appreciation and willing to support its next developments. The amount to donate is up to your choice. Thank you!

I'm back again: the code sample I previously posted had an imperfection (it didn't reinitialize the visitedReferences on split). Here it is the corrected version:

public class SplitSample
  extends Sample
{
  private static final long MaxDataSize = 4 << 20;
  private PageManager manager;
  @Override
  public boolean run(
    )
  {
    // 1. Opening the PDF file...
    File file;
    {
      String filePath = promptPdfFileChoice("Please select a PDF file");
      try
      {file = new File(filePath);}
      catch(Exception e)
      {throw new RuntimeException(filePath + " file access error.",e);}
    }
    Document document = file.getDocument();
    Pages pages = document.getPages();
    // 2. Splitting the document...
    manager = new PageManager(document);
    int splitIndex = 0;
    long incrementalDataSize = 0;
    int beginPageIndex = 0;
    Set<PdfReference> visitedReferences = new HashSet<PdfReference>();
    for(Page page : pages)
    {
      long pageDifferentialDataSize = getSize(page,visitedReferences);
      incrementalDataSize += pageDifferentialDataSize;
      if(incrementalDataSize > MaxDataSize) // Data size limit reached.
      {
        int endPageIndex = page.getIndex();
        // Split the current document page range!
        splitDocument(++splitIndex,beginPageIndex,endPageIndex);
        beginPageIndex = endPageIndex;
        incrementalDataSize = getSize(page,visitedReferences = new HashSet<PdfReference>());
      }
    }
    // Split the last document page range!
    splitDocument(++splitIndex,beginPageIndex,pages.size());
    return true;
  }
  private void splitDocument(
    int splitIndex,
    int beginPageIndex,
    int endPageIndex
    )
  {
    System.out.println("Split " + splitIndex + ": " + (beginPageIndex+1) + "-" + endPageIndex);
    // 1. Split the document!
    Document splitDocument = manager.extract(beginPageIndex,endPageIndex);
    // (boilerplate metadata insertion -- ignore it)
    buildAccessories(splitDocument,"Split (" + (splitIndex) + ")","splitting a PDF document into 4MB files");
    // 2. Serialize the split file!
    serialize(splitDocument.getFile(),this.getClass().getSimpleName() + "." + (splitIndex),false);
  }
  /**
    Gets the data size of the specified page expressed in bytes.
    @param page Page whose data size has to be calculated.
    @param visitedReferences References to data objects excluded from calculation.
      This set is useful, for example, to avoid recalculating the data size of shared resources.
      During the operation, this set is populated with references to visited data objects. 
  */
  private static long getSize(
    Page page,
    Set<PdfReference> visitedReferences
    )
  {return getSize(page.getBaseObject(),visitedReferences,true);}
  /**
    Gets the data size of the specified page expressed in bytes.
    @param object Data object whose size has to be calculated.
    @param visitedReferences References to data objects excluded from calculation.
      This set is useful, for example, to avoid recalculating the data size of shared resources.
      During the operation, this set is populated with references to visited data objects. 
    @param isRoot Whether this data object represents the page root.
  */
  private static long getSize(
    PdfDirectObject object,
    Set<PdfReference> visitedReferences,
    boolean isRoot
    )
  {
    long dataSize = 0;
    {
      PdfDataObject dataObject = File.resolve(object);
      // 1. Evaluating the current object...
      if(object instanceof PdfReference)
      {
        PdfReference reference = (PdfReference)object;
        if(visitedReferences.contains(reference))
          return 0; // Avoids circular references.
        if(dataObject instanceof PdfDictionary
          && PdfName.Page.equals(((PdfDictionary)dataObject).get(PdfName.Type))
          && !isRoot)
          return 0; // Avoids references to other pages.
        visitedReferences.add(reference);
        // Calculate the data size of the current object!
        IOutputStream out = new Buffer();
        reference.getIndirectObject().writeTo(out);
        dataSize += out.getLength();
      }
      // 2. Evaluating the current object's children...
      Collection<PdfDirectObject> values = null;
      {
        if(dataObject instanceof PdfStream)
        {
          PdfStream streamDataObject = (PdfStream)dataObject;
          dataObject = streamDataObject.getHeader();
        }
        if(dataObject instanceof PdfDictionary)
        {values = ((PdfDictionary)dataObject).values();}
        else if(dataObject instanceof PdfArray)
        {values = (PdfArray)dataObject;}
      }
      if(values != null)
      {
        // Calculate the data size of the current object's children!
        for(PdfDirectObject value : values)
        {dataSize += getSize(value,visitedReferences,false);}
      }
    }
    return dataSize;
  }
}

enjoy!
Stefano *<:o)

Wilson - 2010-12-14

Hi Stefano,

Really appreciate your assistance on this.

Any idea when the 0.1.0 would be released?

Our IT department couldn't use it because he couldn't convert it to VB.net.

Yeah, we surely would contribute.

Best regards,
Wilson Fung

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano Chizzolini - 2010-12-14

0.1.0 should be released on late January 2011.
Stefano *<:o)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Split by Size

General-Purpose PDF Library for Java and .NET

Forums

Help

Split by Size

Split by Size

General-Purpose PDF Library for Java and .NET

Forums

Help

Split by Size document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Split by Size