From: Reiner M. <re...@mi...> - 2011-05-05 11:28:50
|
I have some scanned books, containing black/white pages. Currently a page takes about 1MB of disk space. * Is there a way to reduce the amount to eg. 500 kB without loosing the (OCR-) text ? Regards Reiner Miericke |
From: Reiner M. <re...@mi...> - 2011-05-13 13:31:57
|
Is there really nobody around who knows how to exchange images by smaller ones (gray --> blay/white) in PDF files without loosing the OCR-Text? Am Donnerstag, 5. Mai 2011, 13:11:56 schrieb Reiner Miericke: > I have some scanned books, containing black/white pages. Currently a page > takes about 1MB of disk space. > * Is there a way to reduce the amount to eg. 500 kB without loosing the > (OCR-) text ? Mit freundlichen Grüssen Reiner Miericke |
From: Jozef <mis...@ho...> - 2011-05-13 14:45:18
|
Dne 13.5.2011 15:34, Reiner Miericke napsal(a): > Is there really nobody around who knows how to exchange images by smaller ones > (gray --> blay/white) in PDF files without loosing the OCR-Text? > > Well, this question is a difficult one. It depends on how the OCR software has done it but there is no easy straightforward way how to do it in pdfedit at this moment. If it is about removing, reducing, inserting back an image then the script language could help you but really not an easy way. that would require deeper investigation which I doubt that somebody would sacrifice his free time on this marginal issue. Sorry. jozef >> I have some scanned books, containing black/white pages. Currently a page >> takes about 1MB of disk space. >> * Is there a way to reduce the amount to eg. 500 kB without loosing the >> (OCR-) text ? > > Mit freundlichen Grüssen > Reiner Miericke > > ------------------------------------------------------------------------------ > Achieve unprecedented app performance and reliability > What every C/C++ and Fortran developer should know. > Learn how Intel has extended the reach of its next-generation tools > to help boost performance applications - inlcuding clusters. > http://p.sf.net/sfu/intel-dev2devmay > _______________________________________________ > Pdfedit-support mailing list > Pdf...@li... > https://lists.sourceforge.net/lists/listinfo/pdfedit-support > > |
From: Reiner M. <re...@mi...> - 2011-05-15 12:51:36
|
thanks fpr answering, Jozef I thought the image is stored in one bunch somewhere on a page and all I have to do is - extract the image - to a better compression - encode the image for PDF - and just exchange is (same position and dimensions etc) isn't it as easy so it can be done eg. by a perl script? Am Freitag, 13. Mai 2011, 16:45:08 schrieb Jozef: > Dne 13.5.2011 15:34, Reiner Miericke napsal(a): > > Is there really nobody around who knows how to exchange images by smaller > > ones (gray --> blay/white) in PDF files without loosing the OCR-Text? > > Well, this question is a difficult one. It depends on how the OCR > software has done it but there is no easy straightforward way how to do > it in pdfedit at this moment. > If it is about removing, reducing, inserting back an image then the > script language could help you but really not an easy way. > > that would require deeper investigation which I doubt that somebody > would sacrifice his free time on this marginal issue. Sorry. > > jozef > > >> I have some scanned books, containing black/white pages. Currently a > >> page takes about 1MB of disk space. > >> * Is there a way to reduce the amount to eg. 500 kB without loosing the > >> (OCR-) text ? -- Mit freundlichen Grüssen Reiner Miericke |
From: Eric D. <er...@do...> - 2011-05-15 13:15:14
|
On 05/15/2011 08:54 AM, Reiner Miericke wrote: > I thought the image is stored in one bunch somewhere on a page and all I have > to do is > - extract the image > - to a better compression > - encode the image for PDF > - and just exchange is (same position and dimensions etc) > I've never done anything like this before, but below are some tips that I found on this page: http://forums.debian.net/viewtopic.php?f=10&t=55341 The script shows how OCR software can extract the text and how the file size of the images can be reduced. You could probably modify it to meet your needs. Good Luck!, - Eric Step One: Install the necessary packages: |apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils Step Two: Create a script like the following: #!/bin/bash ## script to: ## * split a PDF up by pages ## * convert them to an image format ## * read the text from each page ## * concatenate the pages ## we will do all work in a temporary directory ## so remember where we started DIR=$( pwd ) ## pass name of PDF file to script INFILE=$1 if [ ! $INFILE ] ; then printf "No file specified. Exiting.\n" exit 1 fi if [ ! -f $INFILE ] ; then printf "$INFILE is not a file. Exiting.\n" exit 1 fi ## create temp directory and CD into it ## but get rid of anything that used to live there first if [ -d /tmp/image2text ] ; then rm -rf /tmp/image2text fi mkdir /tmp/image2text cp $INFILE /tmp/image2text/. cd /tmp/image2text ## split PDF file into pages, resulting files will be ## numbered: pg_0001.pdf pg_0002.pdf pg_0003.pdf pdftk $INFILE burst ## make sure file was burst if [ ! -f pg_0001.pdf ] ; then printf "Failed to burst $INFILE. Exiting.\n" exit else ## do you really need doc_data.txt ??? rm doc_data.txt fi ## now let's turn each PDF page into text for i in pg*.pdf ; do ## convert it to a PPM image file at 600 dots per inch pdftoppm -r 600 $i ${i%.pdf}.ppm ## make sure the command worked if [ -f ${i%.pdf}.ppm-1.ppm ] ; then ## change the goofy file name mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm else printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n" exit 1 fi ## convert the file to a JPEG image with ImageMagick ## scanning the JPEG yields slighly better results ## and you get a much smaller file size convert ${i%.pdf}.ppm ${i%.pdf}.jpg ## make sure the command worked if [ -f ${i%.pdf}.jpg ] ; then ## get rid of the massive PPM file and the PDF file rm ${i%.pdf}.ppm $i else printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n" exit 1 fi ## read text from the page djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt ## make sure the command worked if [ -f ${i%.pdf}.txt ] ; then ## get rid of the JPG file rm ${i%.pdf}.jpg else printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n" exit 1 fi done ## concatenate the pages into a single text file cat pg*.txt > $DIR/${INFILE%.pdf}.txt ## remove the temporary directory cd $DIR if [ -f ${INFILE%.pdf}.txt ] ; then rm -rf /tmp/image2text ## get out of here! printf "All done. Have fun! \n" else printf "Failed to generate ${INFILE%.pdf}.txt\n" printf "Individual text files can be found in: /tmp/image2text/ \n" fi exit | |