Menu

#14 PDF processing with 2.1

v1.3.7
closed-fixed
nobody
None
5
2018-04-26
2018-02-23
No

Version 2.1 behaves strangely on pdfs:
lot of false positives comparing to 2.0
reporting to text file is broken, image coordintaes are always: "-2147483648x-2147483648--2147483648x-2147483648"

How to reproduce:
osra --learn -i -u 2 -c -e -p -g -f smi -o test -w test21.csv annurev.pdf

test21.csv - output of version 2.1
test20.csv - output of version 2.0
annurev.pdf is attached as well

1 Attachments

Related

Bugs: #16

Discussion

  • Vsevolod Eremenko

    Attachments (continued)

     
  • Igor

    Igor - 2018-02-23

    --learn is a debug output option, it should not be used to obtain structures.
    Also -i and -u 2 should only be used if you know you really need them.

     
  • Igor

    Igor - 2018-02-23

    I suggest the following command line options:
    osra -f sdf -w output.sdf input.pdf

     
  • Igor

    Igor - 2018-02-23
    • status: open --> wont-fix
     
  • Vsevolod Eremenko

    Thank you for such a detailed answer!

    I try:
    "osra -c -f sdf -w test.sdf -o test annurev.pdf"

    1. Indeed, structure detection is improved: with 2.1 I have more true positives and comparable number of false positives.
    2. But '-c' option is not working for me in 2.1. It always produces "-2147483648x-2147483648--2147483648x-2147483648".
     
  • Igor

    Igor - 2018-02-24
    • status: wont-fix --> accepted
     
  • Igor

    Igor - 2018-02-24

    I will take a look at the bounding box problem.

     
  • Igor

    Igor - 2018-02-27

    Fixed in r1036.
    Please note that the coordinates for PDF (and PS) files depend on the resolution at which the rasterized image is rendered. The resolution used by OSRA may not be the same as the resolution used by a user's PDF viewer, so the printed out coordinates will not correspond to the on-screen locations.

     
  • Igor

    Igor - 2018-02-27
    • status: accepted --> closed-fixed
     
  • Vsevolod Eremenko

    Thanks a lot, Igor!

    yes, I see I need further exploration to make use of these coordinates.

     
  • Vsevolod Eremenko

    I can't compile src/osra_lib.cpp after [r1036]:
    in lines 363 and 364, namespace 'cout::' should be added before 'cout' and 'endl'

     

    Related

    Commit: [r1036]

  • Igor

    Igor - 2018-04-26

    Oops, my bad.
    Added missing namespace in 1038.

     

Log in to post a comment.