#2 spacing and line break problems

v1.0 (example)
wont-fix
nobody
None
5
2015-07-09
2013-08-29
Dowcet
No

I ran pdfsandwich with default options on a basic test file: http://www.watchocr.com/files/WatchOCRTestDoc.pdf

I am attaching the resulting PDF, and also pasting the first paragraph of text below. The words seem to be identified correctly but the spacing and line breaks are quite wrong. When the text is selected in the PDF, it becomes an unreadable jumble, apparently because the font size becomes too large.

We the
People
ofthe
United
States,
inOrder
toform
amore
perfect
Union,
establish
Justice,
insure
domestic
Tranquility,
provide
for thecommon
defence,
promote
thegeneral
Welfare,
andsecure
theBlessings
of Liberty
to ourselves and our Posterity, do ordain and establish this Constitution for the United Statesof America.

1 Attachments

Related

Bugs: #2

Discussion

  • Tobias Elze

    Tobias Elze - 2013-08-30

    Could you please provide version information? For instance, could you post the output of

    pdfsandwich -version
    tesseract --version
    hocr2pdf -h

    Thanks, Tobias

     
  • Dowcet

    Dowcet - 2013-08-30

    pdfsandwich version 0.0.8

    tesseract 3.02.01
    leptonica-1.69
    libgif 4.1.6 : libjpeg 8b : libpng 1.2.49 : libtiff 4.0.2 : zlib 1.2.7

    hOCR to PDF converter, version 0.8.5

    On Fri, Aug 30, 2013 at 11:44 AM, Tobias Elze tobias-elze@users.sf.net wrote:

    Could you please provide version information? For instance, could you post
    the output of

    pdfsandwich -version
    tesseract --version
    hocr2pdf -h

    Thanks, Tobias


    [bugs:#2] spacing and line break problems

    Status: open
    Created: Thu Aug 29, 2013 03:31 PM UTC by Dowcet
    Last Updated: Thu Aug 29, 2013 03:31 PM UTC
    Owner: nobody

    I ran pdfsandwich with default options on a basic test file:
    http://www.watchocr.com/files/WatchOCRTestDoc.pdf

    I am attaching the resulting PDF, and also pasting the first paragraph of
    text below. The words seem to be identified correctly but the spacing and
    line breaks are quite wrong. When the text is selected in the PDF, it
    becomes an unreadable jumble, apparently because the font size becomes too
    large.

    We the
    People
    ofthe
    United
    States,
    inOrder
    toform
    amore
    perfect
    Union,
    establish
    Justice,
    insure
    domestic
    Tranquility,
    provide
    for thecommon
    defence,
    promote
    thegeneral
    Welfare,
    andsecure
    theBlessings
    of Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution
    for the United Statesof America.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/pdfsandwich/bugs/2/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #2

  • Tobias Elze

    Tobias Elze - 2013-08-30

    Thanks, I'll look into this. As a current workaround, does the option -sloppy_text, that is calling

    pdfsandwich -sloppy_text input_file.pdf

    help?

     
  • Dowcet

    Dowcet - 2013-08-30

    The result with the "--sloppy_text" version is about the same. By the
    way, the output includes the following pair of errors, repeated four
    times (presumably once for each page).

    Warning: Image x/y resolution not set, defaulting to: 300
    Warning: unclosed tag: '?xml'

    On Fri, Aug 30, 2013 at 3:21 PM, Tobias Elze tobias-elze@users.sf.net wrote:

    Thanks, I'll look into this. As a current workaround, does the option
    -sloppy_text, that is calling

    pdfsandwich -sloppy_text input_file.pdf

    help?


    [bugs:#2] spacing and line break problems

    Status: open
    Created: Thu Aug 29, 2013 03:31 PM UTC by Dowcet
    Last Updated: Fri Aug 30, 2013 03:44 PM UTC
    Owner: nobody

    I ran pdfsandwich with default options on a basic test file:
    http://www.watchocr.com/files/WatchOCRTestDoc.pdf

    I am attaching the resulting PDF, and also pasting the first paragraph of
    text below. The words seem to be identified correctly but the spacing and
    line breaks are quite wrong. When the text is selected in the PDF, it
    becomes an unreadable jumble, apparently because the font size becomes too
    large.

    We the
    People
    ofthe
    United
    States,
    inOrder
    toform
    amore
    perfect
    Union,
    establish
    Justice,
    insure
    domestic
    Tranquility,
    provide
    for thecommon
    defence,
    promote
    thegeneral
    Welfare,
    andsecure
    theBlessings
    of Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution
    for the United Statesof America.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/pdfsandwich/bugs/2/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #2

    • Tobias Elze

      Tobias Elze - 2013-09-01

      Warning: Image x/y resolution not set, defaulting to: 300
      Warning: unclosed tag: '?xml'

      You can idgnore the first warning - it has nothing to do with the problems you describe above. In the next version of pdfsandwich, this warning won't appear again.

      The second warning is produced by hocr2pdf. I don't know why, but the program does not stop working. Can probably be ignored as well.

       
  • Tobias Elze

    Tobias Elze - 2013-09-01

    I investigated the problems you describe. They have nothing to do with pdfsandwich. Tesseract gives a nearly perfect output for your attached file, and hocr2pdf uses this output to generate the "sandwich pdf". However, it would be too simple to blame hocr2pdf here: Paradoxically, the text extraction from pdf files strongly depends on the pdf reader which is used to display the pdf file.

    If I open the pdf in Okular and select the first paragraph, I receive:

    We the People
    of the United
    States,
    in Order
    to form
    a more
    perfect
    Union,
    establish
    Justice,
    insure
    domesTranquility,
    provide for the common
    defence,
    promote
    the general
    Welfare, and secure
    the Blessings
    of Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution for the United Statesof America.

    The same in zathura looks like this:

    We thePeople
    oftheUnited
    States,
    inOrder
    toform
    amore
    perfect
    Union,
    establish
    Justice,
    insure
    domestic
    Tranquility,
    providefor thecommon
    defence,
    promote
    thegeneral
    Welfare,andsecure
    theBlessings
    of Liberty

    Now the same in acroread 9.5.5:

    ThCeo nstiftoutrth ioUen n itSetda tes
    We theP eopolfet h eU niteSdt ateinsO , rdetorf ormam orpee rfeUctn ioens, tablJisuhs ticines, udreo mestic
    Tranquilityp, rovidefo r thec ommodne fencep,r omoteth eg eneraWl elfarea, nds ecurteh eB lessingosf Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.

    The latter looks quite desastrous. However, all three are copied and pasted from exactly the same output file, just opened with different pdf readers. Interestingly, the full text search in all three readers works still quite good: If you search for a word, it is usually found at the right position in the file.

    I don't know how we can deal with this, but I'm afraid this is beyond the scope of pdfsandwich.

    Which pdf reader did you use, and could you try it with different readers as well and see if you can confirm my findings?

    Tobias

     
  • Dowcet

    Dowcet - 2013-09-04

    The original output I pasted was from Evince.

    I'm just amazed how difficult it is to get a decent OCR sandwiched PDF
    without switching to Windows and using Adobe software... Yours is the
    only free software I know of that even attempts to do this. Thanks for
    your efforts!

    On Sun, Sep 1, 2013 at 2:13 PM, Tobias Elze tobias-elze@users.sf.net wrote:

    I investigated the problems you describe. They have nothing to do with
    pdfsandwich. Tesseract gives a nearly perfect output for your attached file,
    and hocr2pdf uses this output to generate the "sandwich pdf". However, it
    would be too simple to blame hocr2pdf here: Paradoxically, the text
    extraction from pdf files strongly depends on the pdf reader which is used
    to display the pdf file.

    If I open the pdf in Okular and select the first paragraph, I receive:

    We the People
    of the United
    States,
    in Order
    to form
    a more

    perfect
    Union,
    establish
    Justice,
    insure
    domesTranquility,
    provide for the common
    defence,
    promote
    the general
    Welfare, and secure
    the Blessings

    of Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution
    for the United Statesof America.

    The same in zathura looks like this:

    We thePeople
    oftheUnited

    States,
    inOrder
    toform
    amore
    perfect
    Union,
    establish
    Justice,
    insure
    domestic
    Tranquility,
    providefor thecommon
    defence,
    promote
    thegeneral
    Welfare,andsecure
    theBlessings
    of Liberty

    Now the same in acroread 9.5.5:

    ThCeo nstiftoutrth ioUen n itSetda tes
    We theP eopolfet h eU niteSdt ateinsO , rdetorf ormam orpee rfeUctn ioens,
    tablJisuhs ticines, udreo mestic
    Tranquilityp, rovidefo r thec ommodne fencep,r omoteth eg eneraWl elfarea,
    nds ecurteh eB lessingosf Liberty

    to ourselves and our Posterity, do ordain and establish this Constitution
    for the United States of America.

    The latter looks quite desastrous. However, all three are copied and pasted
    from exactly the same output file, just opened with different pdf readers.
    Interestingly, the full text search in all three readers works still quite
    good: If you search for a word, it is usually found at the right position in
    the file.

    I don't know how we can deal with this, but I'm afraid this is beyond the
    scope of pdfsandwich.

    Which pdf reader did you use, and could you try it with different readers as
    well and see if you can confirm my findings?

    Tobias


    [bugs:#2] spacing and line break problems

    Status: open
    Created: Thu Aug 29, 2013 03:31 PM UTC by Dowcet
    Last Updated: Fri Aug 30, 2013 07:21 PM UTC
    Owner: nobody

    I ran pdfsandwich with default options on a basic test file:
    http://www.watchocr.com/files/WatchOCRTestDoc.pdf

    I am attaching the resulting PDF, and also pasting the first paragraph of
    text below. The words seem to be identified correctly but the spacing and
    line breaks are quite wrong. When the text is selected in the PDF, it
    becomes an unreadable jumble, apparently because the font size becomes too
    large.

    We the
    People
    ofthe
    United
    States,
    inOrder
    toform
    amore
    perfect
    Union,
    establish
    Justice,
    insure
    domestic
    Tranquility,
    provide
    for thecommon
    defence,
    promote
    thegeneral
    Welfare,
    andsecure
    theBlessings
    of Liberty
    to ourselves and our Posterity, do ordain and establish this Constitution
    for the United Statesof America.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/pdfsandwich/bugs/2/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #2

  • Tobias Elze

    Tobias Elze - 2015-07-09
    • status: open --> wont-fix
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks