I have been scanning a lot of papers with the venerable gscan2pdf and needed to automate more of the task. The attached patch allows automatically selecting all blank pages. It uses Image::Magick->Statistics() to get the standard deviation of the image. If the standard deviation is below a threshold, then the page is considered blank. You can change the threshold in the .gscan2pdf file only at the moment.
This patch also keeps track of the validity of the OCR text. If the page has been rotate or thresholded, the page is flagged for update. A new menu item "Update OCR and Analyze" will update those pages which need it. When a page is re-OCRed, the results replace any old text.
Other changes include an experimental feature to automatically select pages with landscape orientation. The current algorithm uses the OCR text to identify mis-oriented pages. This may be replaced with a more reliable algorithm in the future. There are also some new shortcut keys to speed up the process of re-orienting pages and deleting blank pages.
I consider this a work in progress. Any feedback is welcome.
There are two patches. I tested the code on 0.9.26 but there is also a patch for 0.9.27 if someone can help with testing.
Log in to post a comment.