Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#5 text extraction unnecessary loads images and other resources

closed
mtraut
None
5
2010-06-25
2009-12-18
Paul Pogonyshev
No

This is a feature request to reduce memory consumption when extracting text from PDF. Problem diagnostic and possible solution follow.

I'm using jPod to index PDF files, i.e. I need only text contents. However, memory consumption grows enormously on files that contain images or other drawing-like stuff (e.g. AutoCAD drawings). I was able to track the problem down to CSDeviceBasedInterpreter. Basically, its rendering operation methods always load resources, even if device (CSTextExtractor in my case, any CSTextDevice will do) does nothing with them.

One possible thing is to extend ICSDevice interface with methods like "bool isInlineImageImplemented()". If it returns false, render_EI() in CSDeviceBasedInterpreter does nothing and immediatly returns. Similar "implemented" methods could be added for other methods, at least for those that use expensive resources like doXObject().

BTW, thank you for the library!

Discussion

  • mtraut
    mtraut
    2010-06-25

    ICSDeviceFeatures added to 5.3.0 (soon)

     
  • mtraut
    mtraut
    2010-06-25

    • assigned_to: nobody --> mtraut
     
  • mtraut
    mtraut
    2010-06-25

    • status: open --> closed