Help save net neutrality! Learn more.

ToUnicode C# Question about PDFClown

  • Parham Mohammadi

    Dear Stefano ....
    First thanks, thanks, thanks .... and again thanks for your great work (PDFClown) and professional programming
    i am a C# programmer and almost know PDF format (i studied Adobe PDF Specification 3 years ago)
    i am working on an application that focused on unicode feature of PDF ....
    here, there are some PDF files that in arabic or some other languages that in them , the ToUnicode stream of fonts are incorrect thus, all converting applications in word will convert these files with wrong output and content of output files are not readable.
    so, my application should change (edit) these pdf's ToUnicode stream. i calculated valid ToUnicode table and now, i have valid and correct ToUnicode stream.
    my problem is only replacement of this stream with old one (replace my new ToUnicode stream with existing PDF's ToUnicode stream) .
    I know, in PDF every modification and editing will not remove existing objects. instead of removing or replacing content of object's stream, a new object with same ID will generate in PDF and then a new XRef Table with new Trailer will append to the end of PDF file.
    Now , my question :
    Is there a solution in  PDFClown to do it for me ?
    in  other word, can PDFClown has define new ToUnicode (or any object) stream and append it to existing PDF and then write new XRef and Trailer to the file .
    Please help me ...  if there is a way , please write a small sample in C#
    again, thank you for your support.

    • Stefano Chizzolini

      Hi Parham,
      sorry for my laaaaaate reply...

      Before answering your question, I want to clarify a misleading assumption you seem to make in your preamble: when you say "I know, in PDF every modification and editing will not remove existing objects...", what you describe is just the way *incremental updates* work [PDF:1.6:2.2.7]; besides that, a PDF file can also be serialized *compactly*, automatically dropping removed/replaced contents *without* appending anything!

      PDF Clown relieves you from managing such low-level syntax elements as xref tables and trailers: you can just work with the files.IndirectObjects class [1](which represents the collection of all the top-level objects contained within a PDF file), that you can access through the IndirectObjects property of the File class [2].
      To accomplish your specific requirement (that is replacing ToUnicode streams), you have to get the indirect object containing the old ToUnicode stream and to modify such existing stream; this way, you can smoothly correct your ToUnicode stream keeping alive its indirect references.
      NOTE: Current version of PDF Clown (0.0.6) doesn't implement the substitution of the data object contained within an indirect object: however, it's a trivial operation that I'll include in the next release (0.0.7). In the meantime you should follow my above-mentioned suggestion: modify the existing stream instead of replacing it with a new one.

      Here it is a fully-functional sample that does exactly what you need (enjoy! -- please look at the existing samples in the downloadable distribution of PDF Clown for any reference to such entities as PDFClownSampleLoader):

      using it.stefanochizzolini.clown.documents;
      using it.stefanochizzolini.clown.files;
      using it.stefanochizzolini.clown.objects;
      namespace it.stefanochizzolini.clown.samples
          <summary>This sample demonstrates how to modify existing indirect objects.</summary>
        public class ObjectEditingSample
          : ISample
          public void Run(
            PDFClownSampleLoader loader
            // (boilerplate user choice -- ignore it)
            string filePath = loader.GetPdfFileChoice("Please select a PDF file");

            // 1. Open the PDF file!
            File file = new File(filePath);

            // 2. Iterate through the indirect objects to discover existing ToUnicode streams!
              NOTE: For the sake of simplicity, I assume that all font objects
              tipically reside in distinct indirect objects.
            foreach(PdfIndirectObject indirectObject in file.IndirectObjects)
              // Filter font objects!
              PdfDictionary dataObject = indirectObject.DataObject as PdfDictionary; // NOTE: Font object is a dictionary.
              if(dataObject == null // Data object is NOT a dictionary.
                || !PdfName.Font.Equals(dataObject[PdfName.Type])) // Dictionary is NOT a font object.
              // Get the indirect reference to the ToUnicode stream associated to the font object!
              PdfReference toUnicodeReference = (PdfReference)dataObject[PdfName.ToUnicode];
              if(toUnicodeReference == null) // No ToUnicode stream.
              // Get the ToUnicode stream!
              PdfStream toUnicodeStream = (PdfStream)toUnicodeReference.DataObject;
              toUnicodeStream.Body.SetLength(0); // Erases the stream content to prepare it for new content insertion.
              toUnicodeStream.Body.Append("..."); // Adds arbitrary contents (HERE you can add your ToUnicode map content!!!).
              toUnicodeReference.IndirectObject.Update(); // Ensures that the indirect object is updated.

            // 3. Serialize the PDF file (again, boilerplate code -- see the PDFClownSampleLoader class source code)!

      Best regards



Log in to post a comment.