A single PDF file exposes many bugs. You should add this file to your test suite.
Two different workflows were used, one with gscan2pdf at the beginning of the workflow, and another at the end.
Characteristics of the file:
image is negated and shown inverted
image is mirrored and shown mirrored
all the pages are rotated - in two different directions
two pages per sheet
looks like a staple bound manual was unstiched and scanned by another person?
scan was manually duplexed so the sheets are in the wrong order
* manual duplexing resulted in the half the pages being in reverse order
Some of these are probably unusual but some are very common, such as those that result from unstiching and manual duplexing.
Goal was to produce a corrected file where the pages were 1-up, right side up, in the right order, and had text sandwiched behind so that these work: search, copy and paste, screen reader (the person who owns the radio is legallly blind).
Bugs include
1. * 2 Error message from unpaper:
2. * Unpaper has to be run 4 times on the same file because program stops and leaves dormant unpaper process
3. * Error message popups from tesseract cause program to effectively crash, You can't dismiss all the popups, no matter how many times you click close.
4. * when you move the window, one set of popups move with the window and another doesn't.
5. * You can't select multiple pages without losing your place
6. * You can't preview the files while you are selecting them (so you can tell which are upside down, for example). Compounds problem from next bug:
7. * You can't autoselect ODD PAIRS and EVEN PAIRS. Becaue after you have converted 2-up from 1 up, you will need to.
8. * Some of the filters, such as negate, are undone without consent after you run other filters.
9. * Sometimes, when you select "Double" on Clean Up, you can't select 2 pages and left tor right because they stay greyed out and some times you can.
10. * If you have 2-up pages that have been virtually cropped by other software to make 1 up pages, they show up as 2-up pages in gscan2pdf. So each page is duplicated twice and it will screw up the OCR, etc.
11. * Double negated pages show up as negated in preview (and probably in processing as well)
12. * It appears that if you don't use mogrify as a user filter to do your processing instead of the built in filters, that you may end up with a low quality file with negated and backwards or rotated bitmaps embedded that will cause trouble later.
13. * Formatting of OCR'ed paragraphs is horrible when cutting and pasting.
14. * If you run Clean Up on this file and select "Single", you end up with a bunch of blank pages except for the binding shaddow.
15. * Could not recollate file to compensate for oddball page ordering with interleaved pages and half reversed, etc. You need to be able to handle any page order that can results from running mpage or similar N-up software and other order and orientation aberations that occur when scanning.
gscan2pdf 2.1.0
Ubuntu 18.04.3 LTS
sudo apt-get install pdfsam
sudo add-apt-repository ppa:malteworld/ppa
sudo apt-get install pdftk
sudo apt-get install pdfmod pdfsam qpdf pdfshuffler
sudo apt-get install mupdf mupdf-tools &
sudo apt-get install gscan2pdf unpaper
sudo apt-get install pdfsandwich
pip install pypdfocr
sudo apt-get install ocrfeeder
sudo apt install ocrmypdf
wget http://www.radiomanual.info/schemi/YAESU_VU/FT-208R_user.pdf
mv FT-208R_user.pdf FT-208R_user_bad.pdf
# this command line based workflow fails at ocr stage
# could probably make it work by bursting the PDF to individual images and
# cleaning those up with mogrify first. I tried another workflow with gscan2pdf
# Surprise, x an y on poster are based on raw page rotation
# instead of viewed page rotation so we had to cut in Y instead of X
# note that poster just virtually crops the page, the other half is still
# there which cause trouble lateer in gscan2pdf
2798 mutool poster -y 2 FT-208R_user_step1.pdf FT-208R_user_step2.pdf
2805 mutool merge -o FT-208R_user_step3.pdf FT-208R_user_step2.pdf 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,52,50,48,46,44,42,40,38,36,34,32,30,28,26,24,22,20,18,16,14,12,10,8,6,4,2
# think this next one did nothing
2809 pdftk FT-208R_user_step3.pdf rotate evensouth output FT-208R_user_step4.pdf
# this did the work of rotating every other page upside down
2828 pdftk FT-208R_user_step3.pdf rotate 1-52evendown output FT-208R_user_step
2838 # echo pypdfocr no longer supported
2845 mutool clean FT-208R_user_step4.pdf FT-208R_user_step5.pdf
2846 time ocrmypdf --force-ocr FT-208R_user_step5.pdf FT-208R_user_step6.pdf
2848 echo ^errors no visable pages
2853 time pdfsandwich FT-208R_user_step6.pdf
2854 echo ^sandwich failed, too
2855 apt-cache search ocrfeeder
2857 man ocrfeeder
2858 ocrfeeder FT-208R_user_step5.pdf &
# ocrfeeder nearly crashed system?
using AVSTREam.codec to pass paramters is depeicated
pdftk A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output FT-208R_user_step9.pdf
Some error messages:
ERROR - Warning running unpaper: out of deviation range - NO ROTATING
ERROR - Error running unpaper: [image2 @ 0x555ed97d3500] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x555ed97d3500] Encoder did not produce proper pts, making some up.
[image2 @ 0x555ed97d32e0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
ERROR - Error processing with tesseract: Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 608
ERROR - Error processing with tesseract: Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 294
Thanks for the report. v.2.1.0 was released in April 2018 and many bugs have been fixed since then. Please upgrade to v2.6.3 via the PPA.
I did upgrade to gscan2pdf 2.6.3. However, I will point out that I am
running the latest LTS version of Ubuntu. If there were major
improvements in usability, the program should have been pushed to the main
repository. It did go smoother but most of the bugs still exist. Some
weren't tested again because they had been eliminated from the workflow.
A few others were noticed. Again, you should add this file to your test
corpus.
There was probably a second copy of gscan2pdf running at the time. Second
time it imported ok.
thng. There is no stop button. Undo didn't work; it reverted the pages
that had been processed already but new ones continued to be altered. Had
to kill the program. filter was user defined mogify
the old text on the pull down menu when you try to run it.
do one page per cpu core, assuming your code is reentrant. Replace a for
loop with a driver function. Mogrify itself does use multiple CPUs, at
least on some operations. Tesseract seems to be using multiple CPUs
although most are running considerably lower than 100%. Should be used
where appropriate. You probably don't want to run 6 copies of tesseract
but do need to run 6 copies of unpaper.
jumps to the top each time
see next item.
made a window larger than the screen that you cannot scroll and you lose
access to the control to close the window. For the average user, the
program is now hung. I was able to recover by using ALT-F7 to pan the
window to expose the close. However, if there had been more pages/errors
I would have lost the main window entirely off the edge of the screen as it
moves too when using ALT-F7. So, the worst bug of the previous version
still exists but in a very different form.
but author and title are now mickey mouse. Putting bogus data in fields
that end up on public search engines is not a good idea. Should have left
those blank if you didn't have anything useful.
commas, and other inappropriate characters. see detox(1).
generated.
A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output
FT-208R_user_step9.pdf
setup to avoid them.
Copied text
The
FT-208R
is
an
all-new
microprocessor-based
2m
FM
transceiver
for
the
demanding
amateur
operator.
Featuring
2.5
watts
of RF
output, the
FT-208R
provides
4 MHz (2 MHz) coverage
in 5 kHz or
10 kHz (12.5 kHz
or
25 kHz) steps,
along
with
10
memories
for
storage
of
favorite
channels.
On Sun, Jan 12, 2020 at 6:32 AM Jeffrey Ratcliffe ra28145@users.sourceforge.net wrote:
Related
Bugs: #340
Obviously, I've got no control over what Ubuntu pushes to LTS and what not. But generally, not security bugs and the such get fixed. The only programs to get bigger upgrades are those that are hard to patch, such as Firefox.
If you import a PDF or DjVu, gscan2pdf remembers the metadata. i.e. "MICKEY MOUSE" was the title and author in the PDF metadata. You can confirm this yourself with
pdfinfo FT-208R_user.pdf
You can change the default filename in Edit/Preferences.
The problem with the images importing negated is due to pdfimages (part of the poppler library) extracting them that way. You can confirm this for yourself with:
pdfimages FT-208R_user.pdf
The cancel button in the bottom right-hand corner worked for me.
I'm not sure what you mean by this. In your example, the odd and even pages are rotated in opposite directions. I fixed this by selecting all odd pages with Edit/Select/Odd, and rotated them in one direction, before selecting all even pages with Edit/Select/Even, and rotating them in the opposite direction.
Thanks for pointing out that when you select "Double" on Clean Up, you can't always select 2 output pages.
I've just fixed this. You'll see the fix in the next release in the next couple of days.
With 2.6.4, this all works for me:
The renumber step revealed a bug where no page was selected, and thus the thumbnails scrolled to the top every time. This is of course really annoying for a long list of pages and I've fixed it for the next release.
I've just fixed this by putting the messages in a scrolled window. The scrollbars show automatically if the size of the window would exceed what the user has resized the parent window to. You'll see this in the next release.
The problem with the message window was fixed in v2.6.5, which was released on Friday. This fixes all the remaining issues for me. Please give it a try and let me know how you got on.