Menu

#340 single file exposes over a dozen bugs

v1.0_(example)
open
nobody
None
5
2020-03-09
2020-01-12
Mark Whitis
No

A single PDF file exposes many bugs. You should add this file to your test suite.

Two different workflows were used, one with gscan2pdf at the beginning of the workflow, and another at the end.
Characteristics of the file:
image is negated and shown inverted
image is mirrored and shown mirrored
all the pages are rotated - in two different directions
two pages per sheet
looks like a staple bound manual was unstiched and scanned by another person?
scan was manually duplexed so the sheets are in the wrong order
* manual duplexing resulted in the half the pages being in reverse order
Some of these are probably unusual but some are very common, such as those that result from unstiching and manual duplexing.
Goal was to produce a corrected file where the pages were 1-up, right side up, in the right order, and had text sandwiched behind so that these work: search, copy and paste, screen reader (the person who owns the radio is legallly blind).

Bugs include
1. * 2 Error message from unpaper:
2. * Unpaper has to be run 4 times on the same file because program stops and leaves dormant unpaper process
3. * Error message popups from tesseract cause program to effectively crash, You can't dismiss all the popups, no matter how many times you click close.
4. * when you move the window, one set of popups move with the window and another doesn't.
5. * You can't select multiple pages without losing your place
6. * You can't preview the files while you are selecting them (so you can tell which are upside down, for example). Compounds problem from next bug:
7. * You can't autoselect ODD PAIRS and EVEN PAIRS. Becaue after you have converted 2-up from 1 up, you will need to.
8. * Some of the filters, such as negate, are undone without consent after you run other filters.
9. * Sometimes, when you select "Double" on Clean Up, you can't select 2 pages and left tor right because they stay greyed out and some times you can.
10. * If you have 2-up pages that have been virtually cropped by other software to make 1 up pages, they show up as 2-up pages in gscan2pdf. So each page is duplicated twice and it will screw up the OCR, etc.
11. * Double negated pages show up as negated in preview (and probably in processing as well)
12. * It appears that if you don't use mogrify as a user filter to do your processing instead of the built in filters, that you may end up with a low quality file with negated and backwards or rotated bitmaps embedded that will cause trouble later.
13. * Formatting of OCR'ed paragraphs is horrible when cutting and pasting.
14. * If you run Clean Up on this file and select "Single", you end up with a bunch of blank pages except for the binding shaddow.
15. * Could not recollate file to compensate for oddball page ordering with interleaved pages and half reversed, etc. You need to be able to handle any page order that can results from running mpage or similar N-up software and other order and orientation aberations that occur when scanning.

gscan2pdf 2.1.0
Ubuntu 18.04.3 LTS

sudo apt-get install pdfsam
sudo add-apt-repository ppa:malteworld/ppa
sudo apt-get install pdftk
sudo apt-get install pdfmod pdfsam qpdf pdfshuffler
sudo apt-get install mupdf mupdf-tools &
sudo apt-get install gscan2pdf unpaper
sudo apt-get install pdfsandwich
pip install pypdfocr
sudo apt-get install ocrfeeder
sudo apt install ocrmypdf

wget http://www.radiomanual.info/schemi/YAESU_VU/FT-208R_user.pdf
mv FT-208R_user.pdf FT-208R_user_bad.pdf

tried gscan2pdf first, didn't look like it would work so I tried another approach.

# this command line based workflow fails at ocr stage
# could probably make it work by bursting the PDF to individual images and
# cleaning those up with mogrify first. I tried another workflow with gscan2pdf
# Surprise, x an y on poster are based on raw page rotation
# instead of viewed page rotation so we had to cut in Y instead of X
# note that poster just virtually crops the page, the other half is still
# there which cause trouble lateer in gscan2pdf
2798 mutool poster -y 2 FT-208R_user_step1.pdf FT-208R_user_step2.pdf

Note, pdftk with its odd and even ranges could have done this without

listing all the actual page numbers.

2805 mutool merge -o FT-208R_user_step3.pdf FT-208R_user_step2.pdf 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,52,50,48,46,44,42,40,38,36,34,32,30,28,26,24,22,20,18,16,14,12,10,8,6,4,2
# think this next one did nothing
2809 pdftk FT-208R_user_step3.pdf rotate evensouth output FT-208R_user_step4.pdf
# this did the work of rotating every other page upside down
2828 pdftk FT-208R_user_step3.pdf rotate 1-52evendown output FT-208R_user_step

2838 # echo pypdfocr no longer supported

2845 mutool clean FT-208R_user_step4.pdf FT-208R_user_step5.pdf
2846 time ocrmypdf --force-ocr FT-208R_user_step5.pdf FT-208R_user_step6.pdf
2848 echo ^errors no visable pages

2853 time pdfsandwich FT-208R_user_step6.pdf
2854 echo ^sandwich failed, too
2855 apt-cache search ocrfeeder

2857 man ocrfeeder
2858 ocrfeeder FT-208R_user_step5.pdf &
# ocrfeeder nearly crashed system?

ocrfreeder made blank pages with no text.

another attempt in gscan2pdf

run gscan2pdf

open FT-208R_user_bad.pdf, import all pages

these next steps were skipped because the first time, when they were used,

seemed to indicate that we weren't actually changing the pixels but how they were displayed, thus producing problems later.

skip: edit -> select -> all

skip: view -> rotate 90 clockwise

note, it might be better to use user defined mogrify command to change

the actual image rather than the view

image is negated

skip: tools -> negate -> all

skip: image is mirrored left to right

skip: preferences -> general -> user defined tools, add "mogrify -flop %i"

do all the page transorms in one

preferences -> general -> user defined, add "mogrify -flop -rotate 270 -negate %i"

tools -> user defined -> all pages -> mogrify

skip: need to run negate again

edit -> select -> all

tools -> cleanup

page range all, layout: double

output pages, 2, writing sytems left to right, ok

runs unpaper. Warning out of deviation range -no rotating

using AVSTREam.codec to pass paramters is depeicated

select don't show this error again as it repeats for each page

unpaper runs slowly

had to restart (selecting the unprocessed pages only) because it failed half way through

had to restart a second time

Edit -> renumber, all, 1, 1,

using ctrl-shift, select the upside down pages only (i.e. skip two, select two, repeate)

bug: when you select a page, it jumps back up to near the top each time

making it exponentially more difficult as the number of pages increase

edit preferecnes, general, user defined tools, add:

"mogrify -rotate 180 %i"

tools -> user defined -> Page range: selected: -> mogrify -rotate 180 %i

tools -> ocr, all, tesseract, english, start

warnings about "invalid resolution 0 dpi"

now the whole program is locked up because of 26 invalid resolution popups that popped up before I could select dont show this message again and some of them i can't get rid of.

apparently, I can only close them in a specific order and if you click near the edge you can end up with the one on top being out of order.

so, now, after an hour of work, I can't save my results.

fortunately, it has crashed session recovery so after killing the process using kill command, I can resume

file save

all, pdf,

date: 2020-01-11

title FT-20R Instruction Manual

Author Yaesu Musen Co, scanned, cleaned up

saved as ...step8.pdf

opened in okular, test copied first paragraph to an editor. Ugly formatting but it worked.

pages are still out of order.

pdftk A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output FT-208R_user_step9.pdf

Some error messages:

ERROR - Warning running unpaper: out of deviation range - NO ROTATING

ERROR - Error running unpaper: [image2 @ 0x555ed97d3500] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x555ed97d3500] Encoder did not produce proper pts, making some up.
[image2 @ 0x555ed97d32e0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.

ERROR - Error processing with tesseract: Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 608

ERROR - Error processing with tesseract: Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 294

1 Attachments

Related

Bugs: #340

Discussion

  • Jeffrey Ratcliffe

    Thanks for the report. v.2.1.0 was released in April 2018 and many bugs have been fixed since then. Please upgrade to v2.6.3 via the PPA.

     
    • Mark Whitis

      Mark Whitis - 2020-01-13

      I did upgrade to gscan2pdf 2.6.3. However, I will point out that I am
      running the latest LTS version of Ubuntu. If there were major
      improvements in usability, the program should have been pushed to the main
      repository. It did go smoother but most of the bugs still exist. Some
      weren't tested again because they had been eliminated from the workflow.
      A few others were noticed. Again, you should add this file to your test
      corpus.

      1. First time I ran it, it stopped importing after the first page.
        There was probably a second copy of gscan2pdf running at the time. Second
        time it imported ok.
      2. No obvious way to stop a filter that is obviously doing the wrong
        thng. There is no stop button. Undo didn't work; it reverted the pages
        that had been processed already but new ones continued to be altered. Had
        to kill the program. filter was user defined mogify
      3. User defined filters: If you edit a filter, it still shows up as
        the old text on the pull down menu when you try to run it.
      4. Clean up: processing is slow. 5 cores are idle. You could easily
        do one page per cpu core, assuming your code is reentrant. Replace a for
        loop with a driver function. Mogrify itself does use multiple CPUs, at
        least on some operations. Tesseract seems to be using multiple CPUs
        although most are running considerably lower than 100%. Should be used
        where appropriate. You probably don't want to run 6 copies of tesseract
        but do need to run 6 copies of unpaper.
      5. the select ODD PAIRS and EVEN pairs is still missing
      6. selecting multiple pages with ctrl-click is still busted, it still
        jumps to the top each time
      7. GOOD: Error messages were combined, which is good; but still broken,
        see next item.
      8. Unfortunately, all the errors for tesseract (I didn't hit ignore)
        made a window larger than the screen that you cannot scroll and you lose
        access to the control to close the window. For the average user, the
        program is now hung. I was able to recover by using ALT-F7 to pan the
        window to expose the close. However, if there had been more pages/errors
        I would have lost the main window entirely off the edge of the screen as it
        moves too when using ALT-F7. So, the worst bug of the previous version
        still exists but in a very different form.
      9. It remembered the subject from the previous session/program version
        but author and title are now mickey mouse. Putting bogus data in fields
        that end up on public search engines is not a good idea. Should have left
        those blank if you didn't have anything useful.
      10. The copied text still is badly formatted. See below
      11. Default save file name is based on author, not title.
      12. Default save file name is not detoxified. It contains spaces,
        commas, and other inappropriate characters. see detox(1).
      13. The same 3 error messages from unpaper and tesseract are still being
        generated.
      14. GOOD: I didn't need to restart Clean Up (unpaper) 4 times.
      15. double negated pages still show up as negated in preview
      16. still have to shuffle the pages using: pdftk
        A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output
        FT-208R_user_step9.pdf
      17. Some of the bugs weren't tested because the more recent workflow was
        setup to avoid them.

      Copied text

      The
      FT-208R
      is
      an
      all-new
      microprocessor-based
      2m
      FM
      transceiver
      for
      the
      demanding
      amateur
      operator.
      Featuring
      2.5
      watts
      of RF
      output, the
      FT-208R
      provides
      4 MHz (2 MHz) coverage
      in 5 kHz or
      10 kHz (12.5 kHz
      or
      25 kHz) steps,
      along
      with
      10
      memories
      for
      storage
      of
      favorite
      channels.

      On Sun, Jan 12, 2020 at 6:32 AM Jeffrey Ratcliffe ra28145@users.sourceforge.net wrote:

      Thanks for the report. v.2.1.0 was released in April 2018 and many bugs
      have been fixed since then. Please upgrade to v2.6.3 via the PPA.


      Status: open
      Group: v1.0_(example)
      Created: Sun Jan 12, 2020 05:51 AM UTC by Mark Whitis
      Last Updated: Sun Jan 12, 2020 05:51 AM UTC
      Owner: nobody
      Attachments:

      A single PDF file exposes many bugs. You should add this file to your test
      suite.

      Two different workflows were used, one with gscan2pdf at the beginning of
      the workflow, and another at the end.
      Characteristics of the file:

      • image is negated and shown inverted * image is mirrored and shown
        mirrored

      • all the pages are rotated - in two different directions * two pages per
        sheet

      looks like a staple bound manual was unstiched and scanned by another
      person? * scan was manually duplexed so the sheets are in the wrong order
      manual duplexing resulted in the half the pages being in reverse order
      Some of these are probably unusual but some are very common, such as those
      that result from unstiching and manual duplexing.
      Goal was to produce a corrected file where the pages were 1-up, right side
      up, in the right order, and had text sandwiched behind so that these work:
      search, copy and paste, screen reader (the person who owns the radio is
      legallly blind).

      Bugs include
      1. * 2 Error message from unpaper:
      2. * Unpaper has to be run 4 times on the same file because program stops
      and leaves dormant unpaper process
      3. * Error message popups from tesseract cause program to effectively
      crash, You can't dismiss all the popups, no matter how many times you click
      close.
      4. * when you move the window, one set of popups move with the window and
      another doesn't.
      5. * You can't select multiple pages without losing your place
      6. * You can't preview the files while you are selecting them (so you can
      tell which are upside down, for example). Compounds problem from next bug:
      7. * You can't autoselect ODD PAIRS and EVEN PAIRS. Becaue after you have
      converted 2-up from 1 up, you will need to.
      8. * Some of the filters, such as negate, are undone without consent after
      you run other filters.
      9. * Sometimes, when you select "Double" on Clean Up, you can't select 2
      pages and left tor right because they stay greyed out and some times you
      can.
      10. * If you have 2-up pages that have been virtually cropped by other
      software to make 1 up pages, they show up as 2-up pages in gscan2pdf. So
      each page is duplicated twice and it will screw up the OCR, etc.
      11. * Double negated pages show up as negated in preview (and probably in
      processing as well)
      12. * It appears that if you don't use mogrify as a user filter to do your
      processing instead of the built in filters, that you may end up with a low
      quality file with negated and backwards or rotated bitmaps embedded that
      will cause trouble later.
      13. * Formatting of OCR'ed paragraphs is horrible when cutting and
      pasting.
      14. * If you run Clean Up on this file and select "Single", you end up
      with a bunch of blank pages except for the binding shaddow.
      15. * Could not recollate file to compensate for oddball page ordering
      with interleaved pages and half reversed, etc. You need to be able to
      handle any page order that can results from running mpage or similar N-up
      software and other order and orientation aberations that occur when
      scanning.

      gscan2pdf 2.1.0
      Ubuntu 18.04.3 LTS

      sudo apt-get install pdfsam
      sudo add-apt-repository ppa:malteworld/ppa
      sudo apt-get install pdftk
      sudo apt-get install pdfmod pdfsam qpdf pdfshuffler
      sudo apt-get install mupdf mupdf-tools &
      sudo apt-get install gscan2pdf unpaper
      sudo apt-get install pdfsandwich
      pip install pypdfocr
      sudo apt-get install ocrfeeder
      sudo apt install ocrmypdf

      wget http://www.radiomanual.info/schemi/YAESU_VU/FT-208R_user.pdf
      mv FT-208R_user.pdf FT-208R_user_bad.pdf
      tried gscan2pdf first, didn't look like it would work so I tried another
      approach.

      this command line based workflow fails at ocr stage

      could probably make it work by bursting the PDF to individual images and

      cleaning those up with mogrify first. I tried another workflow with

      gscan2pdf

      Surprise, x an y on poster are based on raw page rotation

      instead of viewed page rotation so we had to cut in Y instead of X

      note that poster just virtually crops the page, the other half is still

      there which cause trouble lateer in gscan2pdf

      2798 mutool poster -y 2 FT-208R_user_step1.pdf FT-208R_user_step2.pdf
      Note, pdftk with its odd and even ranges could have done this without listing
      all the actual page numbers.

      2805 mutool merge -o FT-208R_user_step3.pdf FT-208R_user_step2.pdf
      1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,52,50,48,46,44,42,40,38,36,34,32,30,28,26,24,22,20,18,16,14,12,10,8,6,4,2

      think this next one did nothing

      2809 pdftk FT-208R_user_step3.pdf rotate evensouth output
      FT-208R_user_step4.pdf

      this did the work of rotating every other page upside down

      2828 pdftk FT-208R_user_step3.pdf rotate 1-52evendown output
      FT-208R_user_step

      2838 # echo pypdfocr no longer supported

      2845 mutool clean FT-208R_user_step4.pdf FT-208R_user_step5.pdf
      2846 time ocrmypdf --force-ocr FT-208R_user_step5.pdf
      FT-208R_user_step6.pdf
      2848 echo ^errors no visable pages

      2853 time pdfsandwich FT-208R_user_step6.pdf
      2854 echo ^sandwich failed, too
      2855 apt-cache search ocrfeeder

      2857 man ocrfeeder
      2858 ocrfeeder FT-208R_user_step5.pdf &

      ocrfeeder nearly crashed system?

      ocrfreeder made blank pages with no text. another attempt in gscan2pdf run
      gscan2pdf open FT-208R_user_bad.pdf, import all pages these next steps
      were skipped because the first time, when they were used, seemed to
      indicate that we weren't actually changing the pixels but how they were
      displayed, thus producing problems later. skip: edit -> select -> all skip:
      view -> rotate 90 clockwise note, it might be better to use user defined
      mogrify command to change the actual image rather than the view image is
      negated skip: tools -> negate -> all skip: image is mirrored left to right skip:
      preferences -> general -> user defined tools, add "mogrify -flop %i" do
      all the page transorms in one preferences -> general -> user defined, add
      "mogrify -flop -rotate 270 -negate %i" tools -> user defined -> all pages
      -> mogrify skip: need to run negate again edit -> select -> all tools ->
      cleanup page range all, layout: double output pages, 2, writing sytems
      left to right, ok runs unpaper. Warning out of deviation range -no
      rotating

      using AVSTREam.codec to pass paramters is depeicated
      select don't show this error again as it repeats for each page unpaper
      runs slowly had to restart (selecting the unprocessed pages only) because
      it failed half way through had to restart a second time Edit -> renumber,
      all, 1, 1, using ctrl-shift, select the upside down pages only (i.e. skip
      two, select two, repeate) bug: when you select a page, it jumps back up
      to near the top each time making it exponentially more difficult as the
      number of pages increase edit preferecnes, general, user defined tools,
      add: "mogrify -rotate 180 %i" tools -> user defined -> Page range:
      selected: -> mogrify -rotate 180 %i tools -> ocr, all, tesseract,
      english, start warnings about "invalid resolution 0 dpi" now the whole
      program is locked up because of 26 invalid resolution popups that popped up
      before I could select dont show this message again and some of them i can't
      get rid of. apparently, I can only close them in a specific order and if
      you click near the edge you can end up with the one on top being out of
      order. so, now, after an hour of work, I can't save my results. fortunately,
      it has crashed session recovery so after killing the process using kill
      command, I can resume file save all, pdf, date: 2020-01-11 title FT-20R
      Instruction Manual Author Yaesu Musen Co, scanned, cleaned up saved as
      ...step8.pdf opened in okular, test copied first paragraph to an editor.
      Ugly formatting but it worked. pages are still out of order.

      pdftk A=FT-208R_user_step8.pdf cat A1-endeven Aend-1odd output
      FT-208R_user_step9.pdf

      Some error messages:

      ERROR - Warning running unpaper: out of deviation range - NO ROTATING

      ERROR - Error running unpaper: [image2 @ 0x555ed97d3500] Using
      AVStream.codec to pass codec parameters to muxers is deprecated, use
      AVStream.codecpar instead.
      [image2 @ 0x555ed97d3500] Encoder did not produce proper pts, making some
      up.
      [image2 @ 0x555ed97d32e0] Using AVStream.codec to pass codec parameters
      to muxers is deprecated, use AVStream.codecpar instead.

      ERROR - Error processing with tesseract: Warning. Invalid resolution 0
      dpi. Using 70 instead.
      Estimating resolution as 608

      ERROR - Error processing with tesseract: Warning. Invalid resolution 0
      dpi. Using 70 instead.
      Estimating resolution as 294


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/gscan2pdf/bugs/340/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #340

  • Jeffrey Ratcliffe

    If there were major improvements in usability, the program should
    have been pushed to the main repository

    Obviously, I've got no control over what Ubuntu pushes to LTS and what not. But generally, not security bugs and the such get fixed. The only programs to get bigger upgrades are those that are hard to patch, such as Firefox.

     
  • Jeffrey Ratcliffe

    1. It remembered the subject from the previous session/program version
      but author and title are now mickey mouse. Putting bogus data in fields
      that end up on public search engines is not a good idea. Should have left
      those blank if you didn't have anything useful.

    If you import a PDF or DjVu, gscan2pdf remembers the metadata. i.e. "MICKEY MOUSE" was the title and author in the PDF metadata. You can confirm this yourself with

    pdfinfo FT-208R_user.pdf

     
  • Jeffrey Ratcliffe

    1. Default save file name is based on author, not title.

    You can change the default filename in Edit/Preferences.

     
  • Jeffrey Ratcliffe

    The problem with the images importing negated is due to pdfimages (part of the poppler library) extracting them that way. You can confirm this for yourself with:

    pdfimages FT-208R_user.pdf

     
  • Jeffrey Ratcliffe

    No obvious way to stop a filter that is obviously doing the wrong
    thng. There is no stop button. Undo didn't work; it reverted the pages
    that had been processed already but new ones continued to be altered. Had
    to kill the program. filter was user defined mogify

    The cancel button in the bottom right-hand corner worked for me.

     
  • Jeffrey Ratcliffe

    the select ODD PAIRS and EVEN pairs is still missing

    I'm not sure what you mean by this. In your example, the odd and even pages are rotated in opposite directions. I fixed this by selecting all odd pages with Edit/Select/Odd, and rotated them in one direction, before selecting all even pages with Edit/Select/Even, and rotating them in the opposite direction.

     
  • Jeffrey Ratcliffe

    Thanks for pointing out that when you select "Double" on Clean Up, you can't always select 2 output pages.

    I've just fixed this. You'll see the fix in the next release in the next couple of days.

     
  • Jeffrey Ratcliffe

    With 2.6.4, this all works for me:

    • import the PDF into gscan2pdf
    • Tools/Negate all pages
    • Set up user-defined tool "convert %i -flip %o" and run it on all pages
    • Edit/Select/Odd pages and rotate right (clockwise) 90°
    • Edit/Select/Even pages and rotate left (anticlockwise) 90°
    • Tools/Clean Up layout=double, output pages=2 all pages
    • The resulting page numbering and order is a mess, of course, but this can be fixed by clicking on the page number in the thumbnail and typing the required page number
    • OCR
    • Save

    The renumber step revealed a bug where no page was selected, and thus the thumbnails scrolled to the top every time. This is of course really annoying for a long list of pages and I've fixed it for the next release.

     
  • Jeffrey Ratcliffe

    Unfortunately, all the errors for tesseract (I didn't hit ignore)
    made a window larger than the screen that you cannot scroll and you lose
    access to the control to close the window. For the average user, the
    program is now hung. I was able to recover by using ALT-F7 to pan the
    window to expose the close. However, if there had been more pages/errors
    I would have lost the main window entirely off the edge of the screen as it
    moves too when using ALT-F7. So, the worst bug of the previous version
    still exists but in a very different form.

    I've just fixed this by putting the messages in a scrolled window. The scrollbars show automatically if the size of the window would exceed what the user has resized the parent window to. You'll see this in the next release.

     
  • Jeffrey Ratcliffe

    The problem with the message window was fixed in v2.6.5, which was released on Friday. This fixes all the remaining issues for me. Please give it a try and let me know how you got on.

     

Log in to post a comment.