How to compare two pdfs

Help
2008-11-19
2013-01-26
  • Allen Moore
    Allen Moore
    2008-11-19

    Stefano,

    Greetings! And congratulations on what appears to be a great project.

    I have been tasked with providing a pdf comparison solution for a large set of nightly reports.  Currently I have a bit stream comparison routine that tells me at a high level if the pdfs match.  What I hope PDF Clown can provide for me is a way to display the differences between two pdf files in the case when it fails the bit stream compare.

    Can you point me in the right direction for leveraging PDF Clown to compare two pdfs (A and B) and produce a third (C) which is a copy of A except with the difference between A and B highlighted.  Unfortunately, I have not yet had time to dig into the library very far so at this point I am asking for high-level direction on what the basic process would look like and which objects/methods of PDF Clown will best lend themselves to this solution.  One additional note is that I will be dealing with somewhere in the range of 2000 pdfs at a time so speed is important.

    I am using C#.

    Any help you can provide is much appreciated.

    Thanks,

    Allen

     
    • Allen Moore
      Allen Moore
      2008-11-20

      More info.

      I realize that the semantics of comparing two complex objects like a pdf document are extremely complex.  One possible solution I have come up with is to store a list of differences (byte ranges) from the bit stream compare. If there is a way that I could translate the byte ranges into the pdf objects that are defined in that range then I could derive page coordinates and overlay those objects with a highlight box.

      Is there anyway, given a byte in the file stream to determine what object(s) are being addressed? 

      I know this is probably a "shot in the dark", but maybe I will be pleasantly suprised. 

      Thanks,

      Allen