Thread: [Pdfedit-support] Reducing disk space of images | PDFedit

pdfedit-support

[Pdfedit-support] Reducing disk space of images

From: Reiner M. <re...@mi...> - 2011-05-05 11:28:50

I have some scanned books, containing black/white pages. Currently a page 
takes about 1MB of disk space.
* Is there a way to reduce the amount to eg. 500 kB without loosing the (OCR-) 
text ? 

Regards
Reiner Miericke

Re: [Pdfedit-support] Reducing disk space of images

From: Reiner M. <re...@mi...> - 2011-05-13 13:31:57

Is there really nobody around who knows how to exchange images by smaller ones 
(gray --> blay/white) in PDF files without loosing the OCR-Text?

Am Donnerstag, 5. Mai 2011, 13:11:56 schrieb Reiner Miericke:
> I have some scanned books, containing black/white pages. Currently a page
> takes about 1MB of disk space.
> * Is there a way to reduce the amount to eg. 500 kB without loosing the
> (OCR-) text ?


Mit freundlichen Grüssen
Reiner Miericke

Re: [Pdfedit-support] Reducing disk space of images

From: Jozef <mis...@ho...> - 2011-05-13 14:45:18

Dne 13.5.2011 15:34, Reiner Miericke napsal(a):
> Is there really nobody around who knows how to exchange images by smaller ones
> (gray -->  blay/white) in PDF files without loosing the OCR-Text?
>
>

Well, this question is a difficult one. It depends on how the OCR 
software has done it but there is no easy straightforward way how to do 
it in pdfedit at this moment.
If it is about removing, reducing, inserting back an image then the 
script language could help you but really not an easy way.

that would require deeper investigation which I doubt that somebody 
would sacrifice his free time on this marginal issue. Sorry.

jozef


>> I have some scanned books, containing black/white pages. Currently a page
>> takes about 1MB of disk space.
>> * Is there a way to reduce the amount to eg. 500 kB without loosing the
>> (OCR-) text ?
>
> Mit freundlichen Grüssen
> Reiner Miericke
>
> ------------------------------------------------------------------------------
> Achieve unprecedented app performance and reliability
> What every C/C++ and Fortran developer should know.
> Learn how Intel has extended the reach of its next-generation tools
> to help boost performance applications - inlcuding clusters.
> http://p.sf.net/sfu/intel-dev2devmay
> _______________________________________________
> Pdfedit-support mailing list
> Pdf...@li...
> https://lists.sourceforge.net/lists/listinfo/pdfedit-support
>
>

Re: [Pdfedit-support] Reducing disk space of images

From: Reiner M. <re...@mi...> - 2011-05-15 12:51:36

thanks fpr answering, Jozef

I thought the image is stored in one bunch somewhere on a page and all I have 
to do is 
- extract the image
- to a better compression
- encode the image for PDF
- and just exchange is (same position and dimensions etc)

isn't it as easy so it can be done eg. by a perl script?


Am Freitag, 13. Mai 2011, 16:45:08 schrieb Jozef:
> Dne 13.5.2011 15:34, Reiner Miericke napsal(a):
> > Is there really nobody around who knows how to exchange images by smaller
> > ones (gray -->  blay/white) in PDF files without loosing the OCR-Text?
> 
> Well, this question is a difficult one. It depends on how the OCR
> software has done it but there is no easy straightforward way how to do
> it in pdfedit at this moment.
> If it is about removing, reducing, inserting back an image then the
> script language could help you but really not an easy way.
> 
> that would require deeper investigation which I doubt that somebody
> would sacrifice his free time on this marginal issue. Sorry.
> 
> jozef
> 
> >> I have some scanned books, containing black/white pages. Currently a
> >> page takes about 1MB of disk space.
> >> * Is there a way to reduce the amount to eg. 500 kB without loosing the
> >> (OCR-) text ?


-- 
Mit freundlichen Grüssen
Reiner Miericke

Re: [Pdfedit-support] Reducing disk space of images

From: Eric D. <er...@do...> - 2011-05-15 13:15:14

On 05/15/2011 08:54 AM, Reiner Miericke wrote:
> I thought the image is stored in one bunch somewhere on a page and all I have
> to do is
> - extract the image
> - to a better compression
> - encode the image for PDF
> - and just exchange is (same position and dimensions etc)
>    


I've never done anything like this before, but below are some tips that 
I found on this page:  http://forums.debian.net/viewtopic.php?f=10&t=55341

The script shows how OCR software can extract the text and how the file 
size of the images can be reduced. You could probably modify it to meet 
your needs.

Good Luck!,
- Eric



Step One:  Install the necessary packages:

|apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils


Step Two:  Create a script like the following:

#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages

## we will do all work in a temporary directory
## so remember where we started
DIR=$( pwd )

## pass name of PDF file to script
INFILE=$1

if [ ! $INFILE ] ; then
     printf "No file specified. Exiting.\n"
     exit 1
fi

if [ ! -f $INFILE ] ; then
     printf "$INFILE is not a file. Exiting.\n"
     exit 1
fi

## create temp directory and CD into it
## but get rid of anything that used to live there first
if [ -d /tmp/image2text ] ; then
     rm -rf /tmp/image2text
fi

mkdir /tmp/image2text
cp $INFILE /tmp/image2text/.
cd /tmp/image2text


## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

## make sure file was burst
if [ ! -f pg_0001.pdf ] ; then
     printf "Failed to burst $INFILE. Exiting.\n"
     exit
else
     ## do you really need doc_data.txt ???
     rm doc_data.txt
fi


## now let's turn each PDF page into text
for i in pg*.pdf ; do

     ## convert it to a PPM image file at 600 dots per inch
     pdftoppm -r 600 $i ${i%.pdf}.ppm

     ## make sure the command worked
     if [ -f ${i%.pdf}.ppm-1.ppm ] ; then

     ## change the goofy file name
     mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm

     else
     printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n"
     exit 1
     fi

     ## convert the file to a JPEG image with ImageMagick
     ## scanning the JPEG yields slighly better results
     ## and you get a much smaller file size
     convert ${i%.pdf}.ppm ${i%.pdf}.jpg

     ## make sure the command worked
     if [ -f ${i%.pdf}.jpg ] ; then

     ## get rid of the massive PPM file and the PDF file
     rm ${i%.pdf}.ppm $i

     else
     printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n"
     exit 1
     fi

     ## read text from the page
     djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt

     ## make sure the command worked
     if [ -f ${i%.pdf}.txt ] ; then

     ## get rid of the JPG file
     rm ${i%.pdf}.jpg

     else
     printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n"
     exit 1
     fi

done

## concatenate the pages into a single text file
cat pg*.txt > $DIR/${INFILE%.pdf}.txt

## remove the temporary directory
cd $DIR

if [ -f ${INFILE%.pdf}.txt ] ; then

     rm -rf /tmp/image2text

     ## get out of here!
     printf "All done. Have fun! \n"

else
     printf "Failed to generate ${INFILE%.pdf}.txt\n"
     printf "Individual text files can be found in: /tmp/image2text/ \n"
fi

exit

|