PDF processing with 2.1
Brought to you by:
igor_filippov
Version 2.1 behaves strangely on pdfs:
lot of false positives comparing to 2.0
reporting to text file is broken, image coordintaes are always: "-2147483648x-2147483648--2147483648x-2147483648"
How to reproduce:
osra --learn -i -u 2 -c -e -p -g -f smi -o test -w test21.csv annurev.pdf
test21.csv - output of version 2.1
test20.csv - output of version 2.0
annurev.pdf is attached as well
Attachments (continued)
--learn is a debug output option, it should not be used to obtain structures.
Also -i and -u 2 should only be used if you know you really need them.
I suggest the following command line options:
osra -f sdf -w output.sdf input.pdf
Thank you for such a detailed answer!
I try:
"osra -c -f sdf -w test.sdf -o test annurev.pdf"
I will take a look at the bounding box problem.
Fixed in r1036.
Please note that the coordinates for PDF (and PS) files depend on the resolution at which the rasterized image is rendered. The resolution used by OSRA may not be the same as the resolution used by a user's PDF viewer, so the printed out coordinates will not correspond to the on-screen locations.
Thanks a lot, Igor!
yes, I see I need further exploration to make use of these coordinates.
I can't compile src/osra_lib.cpp after [r1036]:
in lines 363 and 364, namespace 'cout::' should be added before 'cout' and 'endl'
Related
Commit: [r1036]
Oops, my bad.
Added missing namespace in 1038.