gscan2pdf / Bugs / #340 single file exposes over a dozen bugs

I did upgrade to gscan2pdf 2.6.3. However, I will point out that I am
running the latest LTS version of Ubuntu. If there were major
improvements in usability, the program should have been pushed to the main
repository. It did go smoother but most of the bugs still exist. Some
weren't tested again because they had been eliminated from the workflow.
A few others were noticed. Again, you should add this file to your test
corpus.

First time I ran it, it stopped importing after the first page.
There was probably a second copy of gscan2pdf running at the time. Second
time it imported ok.
No obvious way to stop a filter that is obviously doing the wrong
thng. There is no stop button. Undo didn't work; it reverted the pages
that had been processed already but new ones continued to be altered. Had
to kill the program. filter was user defined mogify
User defined filters: If you edit a filter, it still shows up as
the old text on the pull down menu when you try to run it.
Clean up: processing is slow. 5 cores are idle. You could easily
do one page per cpu core, assuming your code is reentrant. Replace a for
loop with a driver function. Mogrify itself does use multiple CPUs, at
least on some operations. Tesseract seems to be using multiple CPUs
although most are running considerably lower than 100%. Should be used
where appropriate. You probably don't want to run 6 copies of tesseract
but do need to run 6 copies of unpaper.
the select ODD PAIRS and EVEN pairs is still missing
selecting multiple pages with ctrl-click is still busted, it still
jumps to the top each time
GOOD: Error messages were combined, which is good; but still broken,
see next item.
Unfortunately, all the errors for tesseract (I didn't hit ignore)
made a window larger than the screen that you cannot scroll and you lose
access to the control to close the window. For the average user, the
program is now hung. I was able to recover by using ALT-F7 to pan the
window to expose the close. However, if there had been more pages/errors
I would have lost the main window entirely off the edge of the screen as it
moves too when using ALT-F7. So, the worst bug of the previous version
still exists but in a very different form.
It remembered the subject from the previous session/program version
but author and title are now mickey mouse. Putting bogus data in fields
that end up on public search engines is not a good idea. Should have left
those blank if you didn't have anything useful.
The copied text still is badly formatted. See below
Default save file name is based on author, not title.
Default save file name is not detoxified. It contains spaces,
commas, and other inappropriate characters. see detox(1).
The same 3 error messages from unpaper and tesseract are still being
generated.
GOOD: I didn't need to restart Clean Up (unpaper) 4 times.
double negated pages still show up as negated in preview
still have to shuffle the pages using: pdftk
A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output
FT-208R_user_step9.pdf
Some of the bugs weren't tested because the more recent workflow was
setup to avoid them.

Copied text

The
FT-208R
is
an
all-new
microprocessor-based
2m
FM
transceiver
for
the
demanding
amateur
operator.
Featuring
2.5
watts
of RF
output, the
FT-208R
provides
4 MHz (2 MHz) coverage
in 5 kHz or
10 kHz (12.5 kHz
or
25 kHz) steps,
along
with
10
memories
for
storage
of
favorite
channels.

On Sun, Jan 12, 2020 at 6:32 AM Jeffrey Ratcliffe ra28145@users.sourceforge.net wrote:

Thanks for the report. v.2.1.0 was released in April 2018 and many bugs
have been fixed since then. Please upgrade to v2.6.3 via the PPA.

[bugs:#340] https://sourceforge.net/p/gscan2pdf/bugs/340/ single file
exposes over a dozen bugs*

Status: open
Group: v1.0_(example)
Created: Sun Jan 12, 2020 05:51 AM UTC by Mark Whitis
Last Updated: Sun Jan 12, 2020 05:51 AM UTC
Owner: nobody
Attachments:

FT-208R_user_bad.pdf
https://sourceforge.net/p/gscan2pdf/bugs/340/attachment/FT-208R_user_bad.pdf
(2.2 MB; application/pdf)

A single PDF file exposes many bugs. You should add this file to your test
suite.

Two different workflows were used, one with gscan2pdf at the beginning of
the workflow, and another at the end.
Characteristics of the file:

image is negated and shown inverted * image is mirrored and shown
mirrored

all the pages are rotated - in two different directions * two pages per
sheet

looks like a staple bound manual was unstiched and scanned by another
person? * scan was manually duplexed so the sheets are in the wrong order
manual duplexing resulted in the half the pages being in reverse order
Some of these are probably unusual but some are very common, such as those
that result from unstiching and manual duplexing.
Goal was to produce a corrected file where the pages were 1-up, right side
up, in the right order, and had text sandwiched behind so that these work:
search, copy and paste, screen reader (the person who owns the radio is
legallly blind).

Bugs include
1. * 2 Error message from unpaper:
2. * Unpaper has to be run 4 times on the same file because program stops
and leaves dormant unpaper process
3. * Error message popups from tesseract cause program to effectively
crash, You can't dismiss all the popups, no matter how many times you click
close.
4. * when you move the window, one set of popups move with the window and
another doesn't.
5. * You can't select multiple pages without losing your place
6. * You can't preview the files while you are selecting them (so you can
tell which are upside down, for example). Compounds problem from next bug:
7. * You can't autoselect ODD PAIRS and EVEN PAIRS. Becaue after you have
converted 2-up from 1 up, you will need to.
8. * Some of the filters, such as negate, are undone without consent after
you run other filters.
9. * Sometimes, when you select "Double" on Clean Up, you can't select 2
pages and left tor right because they stay greyed out and some times you
can.
10. * If you have 2-up pages that have been virtually cropped by other
software to make 1 up pages, they show up as 2-up pages in gscan2pdf. So
each page is duplicated twice and it will screw up the OCR, etc.
11. * Double negated pages show up as negated in preview (and probably in
processing as well)
12. * It appears that if you don't use mogrify as a user filter to do your
processing instead of the built in filters, that you may end up with a low
quality file with negated and backwards or rotated bitmaps embedded that
will cause trouble later.
13. * Formatting of OCR'ed paragraphs is horrible when cutting and
pasting.
14. * If you run Clean Up on this file and select "Single", you end up
with a bunch of blank pages except for the binding shaddow.
15. * Could not recollate file to compensate for oddball page ordering
with interleaved pages and half reversed, etc. You need to be able to
handle any page order that can results from running mpage or similar N-up
software and other order and orientation aberations that occur when
scanning.

gscan2pdf 2.1.0
Ubuntu 18.04.3 LTS

sudo apt-get install pdfsam
sudo add-apt-repository ppa:malteworld/ppa
sudo apt-get install pdftk
sudo apt-get install pdfmod pdfsam qpdf pdfshuffler
sudo apt-get install mupdf mupdf-tools &
sudo apt-get install gscan2pdf unpaper
sudo apt-get install pdfsandwich
pip install pypdfocr
sudo apt-get install ocrfeeder
sudo apt install ocrmypdf

wget http://www.radiomanual.info/schemi/YAESU_VU/FT-208R_user.pdf
mv FT-208R_user.pdf FT-208R_user_bad.pdf
tried gscan2pdf first, didn't look like it would work so I tried another
approach.

this command line based workflow fails at ocr stage

could probably make it work by bursting the PDF to individual images and

cleaning those up with mogrify first. I tried another workflow with

gscan2pdf

Surprise, x an y on poster are based on raw page rotation

instead of viewed page rotation so we had to cut in Y instead of X

note that poster just virtually crops the page, the other half is still

there which cause trouble lateer in gscan2pdf

2798 mutool poster -y 2 FT-208R_user_step1.pdf FT-208R_user_step2.pdf
Note, pdftk with its odd and even ranges could have done this without listing
all the actual page numbers.

2805 mutool merge -o FT-208R_user_step3.pdf FT-208R_user_step2.pdf
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,52,50,48,46,44,42,40,38,36,34,32,30,28,26,24,22,20,18,16,14,12,10,8,6,4,2

think this next one did nothing

2809 pdftk FT-208R_user_step3.pdf rotate evensouth output
FT-208R_user_step4.pdf

this did the work of rotating every other page upside down

2828 pdftk FT-208R_user_step3.pdf rotate 1-52evendown output
FT-208R_user_step

2838 # echo pypdfocr no longer supported

2845 mutool clean FT-208R_user_step4.pdf FT-208R_user_step5.pdf
2846 time ocrmypdf --force-ocr FT-208R_user_step5.pdf
FT-208R_user_step6.pdf
2848 echo ^errors no visable pages

2853 time pdfsandwich FT-208R_user_step6.pdf
2854 echo ^sandwich failed, too
2855 apt-cache search ocrfeeder

2857 man ocrfeeder
2858 ocrfeeder FT-208R_user_step5.pdf &

ocrfeeder nearly crashed system?

ocrfreeder made blank pages with no text. another attempt in gscan2pdf run
gscan2pdf open FT-208R_user_bad.pdf, import all pages these next steps
were skipped because the first time, when they were used, seemed to
indicate that we weren't actually changing the pixels but how they were
displayed, thus producing problems later. skip: edit -> select -> all skip:
view -> rotate 90 clockwise note, it might be better to use user defined
mogrify command to change the actual image rather than the view image is
negated skip: tools -> negate -> all skip: image is mirrored left to right skip:
preferences -> general -> user defined tools, add "mogrify -flop %i" do
all the page transorms in one preferences -> general -> user defined, add
"mogrify -flop -rotate 270 -negate %i" tools -> user defined -> all pages
-> mogrify skip: need to run negate again edit -> select -> all tools ->
cleanup page range all, layout: double output pages, 2, writing sytems
left to right, ok runs unpaper. Warning out of deviation range -no
rotating

using AVSTREam.codec to pass paramters is depeicated
select don't show this error again as it repeats for each page unpaper
runs slowly had to restart (selecting the unprocessed pages only) because
it failed half way through had to restart a second time Edit -> renumber,
all, 1, 1, using ctrl-shift, select the upside down pages only (i.e. skip
two, select two, repeate) bug: when you select a page, it jumps back up
to near the top each time making it exponentially more difficult as the
number of pages increase edit preferecnes, general, user defined tools,
add: "mogrify -rotate 180 %i" tools -> user defined -> Page range:
selected: -> mogrify -rotate 180 %i tools -> ocr, all, tesseract,
english, start warnings about "invalid resolution 0 dpi" now the whole
program is locked up because of 26 invalid resolution popups that popped up
before I could select dont show this message again and some of them i can't
get rid of. apparently, I can only close them in a specific order and if
you click near the edge you can end up with the one on top being out of
order. so, now, after an hour of work, I can't save my results. fortunately,
it has crashed session recovery so after killing the process using kill
command, I can resume file save all, pdf, date: 2020-01-11 title FT-20R
Instruction Manual Author Yaesu Musen Co, scanned, cleaned up saved as
...step8.pdf opened in okular, test copied first paragraph to an editor.
Ugly formatting but it worked. pages are still out of order.

pdftk A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output
FT-208R_user_step9.pdf

Some error messages:

ERROR - Warning running unpaper: out of deviation range - NO ROTATING

ERROR - Error running unpaper: [image2 @ 0x555ed97d3500] Using
AVStream.codec to pass codec parameters to muxers is deprecated, use
AVStream.codecpar instead.
[image2 @ 0x555ed97d3500] Encoder did not produce proper pts, making some
up.
[image2 @ 0x555ed97d32e0] Using AVStream.codec to pass codec parameters
to muxers is deprecated, use AVStream.codecpar instead.

ERROR - Error processing with tesseract: Warning. Invalid resolution 0
dpi. Using 70 instead.
Estimating resolution as 608

ERROR - Error processing with tesseract: Warning. Invalid resolution 0
dpi. Using 70 instead.
Estimating resolution as 294

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/gscan2pdf/bugs/340/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

Bugs: #340

alternate

Unfortunately, all the errors for tesseract (I didn't hit ignore)
made a window larger than the screen that you cannot scroll and you lose
access to the control to close the window. For the average user, the
program is now hung. I was able to recover by using ALT-F7 to pan the
window to expose the close. However, if there had been more pages/errors
I would have lost the main window entirely off the edge of the screen as it
moves too when using ALT-F7. So, the worst bug of the previous version
still exists but in a very different form.

I've just fixed this by putting the messages in a scrolled window. The scrollbars show automatically if the size of the window would exceed what the user has resized the parent window to. You'll see this in the next release.

single file exposes over a dozen bugs

Group

Searches

Help

#340 single file exposes over a dozen bugs

tried gscan2pdf first, didn't look like it would work so I tried another approach.

Note, pdftk with its odd and even ranges could have done this without

listing all the actual page numbers.

ocrfreeder made blank pages with no text.

another attempt in gscan2pdf

run gscan2pdf

open FT-208R_user_bad.pdf, import all pages

these next steps were skipped because the first time, when they were used,

seemed to indicate that we weren't actually changing the pixels but how they were displayed, thus producing problems later.

skip: edit -> select -> all

skip: view -> rotate 90 clockwise

note, it might be better to use user defined mogrify command to change

the actual image rather than the view

image is negated

skip: tools -> negate -> all

skip: image is mirrored left to right

skip: preferences -> general -> user defined tools, add "mogrify -flop %i"

do all the page transorms in one

preferences -> general -> user defined, add "mogrify -flop -rotate 270 -negate %i"

tools -> user defined -> all pages -> mogrify

skip: need to run negate again

edit -> select -> all

tools -> cleanup

page range all, layout: double

output pages, 2, writing sytems left to right, ok

runs unpaper. Warning out of deviation range -no rotating

select don't show this error again as it repeats for each page

unpaper runs slowly

had to restart (selecting the unprocessed pages only) because it failed half way through

had to restart a second time

Edit -> renumber, all, 1, 1,

using ctrl-shift, select the upside down pages only (i.e. skip two, select two, repeate)

bug: when you select a page, it jumps back up to near the top each time

making it exponentially more difficult as the number of pages increase

edit preferecnes, general, user defined tools, add:

"mogrify -rotate 180 %i"

tools -> user defined -> Page range: selected: -> mogrify -rotate 180 %i

tools -> ocr, all, tesseract, english, start

warnings about "invalid resolution 0 dpi"

now the whole program is locked up because of 26 invalid resolution popups that popped up before I could select dont show this message again and some of them i can't get rid of.

apparently, I can only close them in a specific order and if you click near the edge you can end up with the one on top being out of order.

so, now, after an hour of work, I can't save my results.

fortunately, it has crashed session recovery so after killing the process using kill command, I can resume

file save

all, pdf,

date: 2020-01-11

title FT-20R Instruction Manual

Author Yaesu Musen Co, scanned, cleaned up

saved as ...step8.pdf

opened in okular, test copied first paragraph to an editor. Ugly formatting but it worked.

pages are still out of order.

Related

Discussion

Copied text

this command line based workflow fails at ocr stage

could probably make it work by bursting the PDF to individual images and

cleaning those up with mogrify first. I tried another workflow with

Surprise, x an y on poster are based on raw page rotation

instead of viewed page rotation so we had to cut in Y instead of X

note that poster just virtually crops the page, the other half is still

there which cause trouble lateer in gscan2pdf

think this next one did nothing

this did the work of rotating every other page upside down

ocrfeeder nearly crashed system?

Related