#5 text extraction unnecessary loads images and other resources

closed
mtraut
None
5
2010-06-25
2009-12-18
No

This is a feature request to reduce memory consumption when extracting text from PDF. Problem diagnostic and possible solution follow.

I'm using jPod to index PDF files, i.e. I need only text contents. However, memory consumption grows enormously on files that contain images or other drawing-like stuff (e.g. AutoCAD drawings). I was able to track the problem down to CSDeviceBasedInterpreter. Basically, its rendering operation methods always load resources, even if device (CSTextExtractor in my case, any CSTextDevice will do) does nothing with them.

One possible thing is to extend ICSDevice interface with methods like "bool isInlineImageImplemented()". If it returns false, render_EI() in CSDeviceBasedInterpreter does nothing and immediatly returns. Similar "implemented" methods could be added for other methods, at least for those that use expensive resources like doXObject().

BTW, thank you for the library!

Discussion

  • mtraut

    mtraut - 2010-06-25

    ICSDeviceFeatures added to 5.3.0 (soon)

     
  • mtraut

    mtraut - 2010-06-25
    • assigned_to: nobody --> mtraut
     
  • mtraut

    mtraut - 2010-06-25
    • status: open --> closed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks