Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

About readXRefTable

2009-04-17
2013-01-26
  • exazettayotta
    exazettayotta
    2009-04-17

    Hello,

    I'm using PDF Clown usefully.

    By the way, when I read a little bit large file(more than 1MB), then readXRefTable method spend too much time.

    Some pdf file is having more than 5000 xref entries.

    I think it could be changed as reading whenever data will be needed really, not reading entirely at first. 

    Is it possible? If it possible, I'm willing to code for it and open it and send it to you who are the author.

    If it possible, please teach me the way if you don't have any plan to code about it.

    Thank you.

     
    • Hi

      your optimization proposal is absolutely reasonable: according to the PDF spec [PDF:1.6:3.4.3], cross-reference tables are actually intended to be accessed randomly. Current implementation is just a (robust) simplification that takes care to retrieve all the cross-reference entries at a time.

      As readXRefTable's purpose is to populate the cross-reference entries used by the files.IndirectObjects collection [1], I suggest you to examine the way this class consumes such entries (see instance variable 'xrefEntries'): you'll find that there's just 1 place which uses them (I love simplicity!).
      Instead of directly getting the entry from an array (as it currently does), you could implement a proxy class that hides the on-demand reading operations caching those already done and submitting those pending to the tokens.Reader [2] class.
      So the initial ReadXRefTable method should only retrieve the byte offset and object number range of each xref section, in order to allow the following calls from the proxy to be efficiently routed toward the proper file locations.

      Summarizing, these are the activities to do:
      1) Reader.ReadXRefTable must only read xref section information (byte offset and object number range);
      2) a proxy class delegates the requests of xref entries to the Reader, caching the retrieved entries;
      3) IndirectObjects class uses the proxy to retrieve xref entries on demand.

      I think it's quite feasible: please consider the KISS principle as our major rule.

      Thank you!
      Stefano

      [1] http://clown.sourceforge.net/API/it/stefanochizzolini/clown/files/IndirectObjects.html
      [2] http://clown.sourceforge.net/API/it/stefanochizzolini/clown/tokens/Reader.html