PDF processing with 2.1

Brought to you by: igor_filippov

#14 PDF processing with 2.1

Milestone: v1.3.7

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2018-04-26

Created: 2018-02-23

Creator: Vsevolod Eremenko

Private: No

Version 2.1 behaves strangely on pdfs:
lot of false positives comparing to 2.0
reporting to text file is broken, image coordintaes are always: "-2147483648x-2147483648--2147483648x-2147483648"

How to reproduce:
osra --learn -i -u 2 -c -e -p -g -f smi -o test -w test21.csv annurev.pdf

test21.csv - output of version 2.1
test20.csv - output of version 2.0
annurev.pdf is attached as well

1 Attachments

test20.csv

Related

Discussion

Vsevolod Eremenko - 2018-02-23

Attachments (continued)

annurev.pdf

test21.csv

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-23

--learn is a debug output option, it should not be used to obtain structures.
Also -i and -u 2 should only be used if you know you really need them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-23

I suggest the following command line options:
osra -f sdf -w output.sdf input.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-23

status: open --> wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vsevolod Eremenko - 2018-02-23

Thank you for such a detailed answer!

I try:
"osra -c -f sdf -w test.sdf -o test annurev.pdf"

Indeed, structure detection is improved: with 2.1 I have more true positives and comparable number of false positives.

But '-c' option is not working for me in 2.1. It always produces "-2147483648x-2147483648--2147483648x-2147483648".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-24

status: wont-fix --> accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-24

I will take a look at the bounding box problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-27

Fixed in r1036.
Please note that the coordinates for PDF (and PS) files depend on the resolution at which the rasterized image is rendered. The resolution used by OSRA may not be the same as the resolution used by a user's PDF viewer, so the printed out coordinates will not correspond to the on-screen locations.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-02-27

status: accepted --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vsevolod Eremenko - 2018-03-06

Thanks a lot, Igor!

yes, I see I need further exploration to make use of these coordinates.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vsevolod Eremenko - 2018-04-25

I can't compile src/osra_lib.cpp after [r1036]:
in lines 363 and 364, namespace 'cout::' should be added before 'cout' and 'endl'

Related

Commit: [r1036]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor - 2018-04-26

Oops, my bad.
Added missing namespace in 1038.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.