#200 djvused extracts text layer w/artifacts

djvulibre
wont-fix
nobody
utilities (27)
5
2013-06-30
2012-10-06
NBell
No

in attached file text copied to ms word is correct
if text outputed-setted back by djvused - when it copied to ms word additional spaces appears

Discussion

  • Leon Bottou

    Leon Bottou - 2012-10-06

    I am not sure I understand the bug.

    However I am sure that none of these text layers are correct
    because they contain non-printable characters.
    Things like:

    (page .....
    (word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
    (char 1524 3875 1573 3930 "\t") <------- this is wrong. tab does not print anything: no bounding box should be given.
    .... )

    How did you produce these files in the first place?

    - L.

     
  • NBell

    NBell - 2012-10-06

    abbyy FineReader 11.0.102.583 ocr -> save as djvu -> 0000.djvu
    open 0000.djvu -> select text->copy to ms word xp -> see good text
    djvuocr 2.4 b4 txt from 0000.djvu -> txtfile
    djvuocr 2.4 b4 txt from txtfile -> 0000.djvu
    open 0000.djvu -> select text->copy to ms word xp -> see text with additional spaces

    i like handcoded djvu - > so testing TXT copy after ocr

     
  • NBell

    NBell - 2012-10-06

    I greatly apologize for borrowing you
    its a djvuocr buggy djvused (its hard to know the version of djvused - its not show it - i asked for this feature to add)
    when i replace it with djvused of last release for windows (may 2012) - i get just what i want...
    can you answer - what dll's must be copied with djvused.exe for it using apart? i copied all, but think they are not all nessesary

     
  • NBell

    NBell - 2012-10-06

    sorry, but bug exists with last djvused edition

    run simple bat with 0000.djvu

    @djvused 0000.djvu -e "output-txt" > 01.djvu.txt
    @copy 0000.djvu 0000-retexted.djvu
    @djvused 0000-retexted.djvu -f 01.djvu.txt -s
    @pause

    then open both files and compare the text - you can see additional spaces between words |-(

     
  • Janusz

    Janusz - 2012-10-07

    Looking at the problem from a different angle, I have several questions and comments:

    1. Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.

    2. For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?

    3. Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki

    4. Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/

    5. Last but not least, in the example

    (page .....
    (word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
    (char 1524 3875 1573 3930 "\t")

    I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27

    Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?

    Best regards

    Janusz

     
  • NBell

    NBell - 2012-10-07

    Dear Janusz

    Looking at the problem from a different angle,

    The problem – djvused produces erroneous dsed after output-txt from djvu, produced by abbyy FineReader 11.0.102.583

    It means, that text extracted<>text exists
    And it means some errors in djvused, that allows to recode text and adds artifacts in text
    Its error and bug at any angle.

    1. Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.

    Yes, of course (says John Rider in "Hitcher")
    So? Its strange djvu – no djbz, but iff-s looks like djbz, no coding control
    Don’t like so – I am not such lamer.

    1. For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?
      Pdftodjvu produces blank djvu from fr11 pdf

    2. Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki

    Linux is great, but have no features, needing me change the OS. Time is money – in that case see no money, but time for study.

    1. Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/

    Try to see. But problem – no tool except djvused to transfer text between files

    And question – why djvused in dsed use page names instead of page numbers? Djvuocr produce page numbers – it help to transfer between different coded files

    1. Last but not least, in the example
      I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27

    Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?

    No difference. Use option –u (djvused 0000.djvu –u –e "output-txt" > 0000.djvu.txt) and get UTF-8 text easy to edit. Djvused get it to djvu without any question (djvused 0000.djvu –f 0000.djvu.txt -s).
    Surprised?
    Convenient means old. Save the time if need to edit ocr-text-layer.

    Best regards

    NBell

     
  • NBell

    NBell - 2012-10-07

    Dear Janusz
    I have no Linux
    So try to recode fr11 produced pdf to djvu - i use pdftodjvuguile01 - receive balnk file

     
  • Janusz

    Janusz - 2012-10-07

    Let's first settle the pdf2djvu issue.

    I don't know pdftodjvuguile01, do you mean Pdf To Djvu GUI v. 2.1 (http://www.trustfm.net/GeneralTools/SoftwarePdfToDjvuGUI.php)?

    It is based on pdf2djvu (http://jwilk.net/software/pdf2djvu), which I used (under Linux) on several occasions to convert successfully the PDF output to DjVu.

    If it is really the fault of Jakub Wilk's pdf2djvu program, than I'm sure he will investigate the problem.

     
  • Janusz

    Janusz - 2012-10-07

    I mean of course the PDF output of FineReader 11.

     
  • NBell

    NBell - 2012-10-07

    remarkable Pdftodjvu LE v0.1
    http://minus.com/mte8hk07J/1f
    about trustfm product - it works not good, so it not useful - no time for bugtest - exist more workable solution

     
  • Janusz

    Janusz - 2012-10-07

    pdf2djvu is not a trustfm product.

    You can download the executable (without GUI) directly from its home site: https://code.google.com/p/pdf2djvu/. You can also report your problems in the issue tracker.

    I suspect Pdftodjvu is based on djvudigital. If so, than the comparison

    https://code.google.com/p/pdf2djvu/wiki/DjVuDigital

    may be of some interest to you.

     
  • NBell

    NBell - 2012-10-07

    pdf transform is not the topic and it more complicated way:
    ocr -- save pdf -- tansform to djvu using cmdline pdftodjvu -- extract text with djvuocr -- embed text to djvu using djvuocr
    does you really think that it is easy and simple?
    my way:
    ocr -- save djvu -- extract text with djvuocr -- embed text to djvu using djvuocr
    by cause of djvused added sizif work - remove space from dsed and then embed
    so:
    1. djvused make incorrect text extraction
    2. what symbol converted to space?
    3. why?
    4. when djvuset output-txt became correct?

    P.S. command line tools - for enthusiasts of it - simple people use GUI-tools

     
  • NBell

    NBell - 2012-10-07

    see no help - find it himself
    if lines starts with
    (char.....
    comennted like
    # (char....

    djvused produce correct text layer
    handy work, but - life is life

    waiting for win32 corrected djvused....

     
  • NBell

    NBell - 2012-10-07

    Janusz
    test pdf2djvu-0.7.14 - produce good readable text, but - added colors from b/w image-pdf
    for critical situ - good
    results - not exactly match

     
  • NBell

    NBell - 2012-10-07

    Janusz
    thanks for informing me about other djvu tools - helpful.

     
  • Janusz

    Janusz - 2012-10-08

    Glad you found my information useful.

    I've posted it here because I understand the main problem is correcting OCR and djvused is just a tool.

    It's a pity there is no place to discuss DjVu in general, the postings on djvu.org are delayed by the moderator so long that any discussion is practically impossible :-(

    Regards

    Janusz

     
  • NBell

    NBell - 2012-10-13

    make test file simple as can
    analyse of txt chunk show, that fr11 txtz exactly matches text embedded
    and djvused embeds additional spaces in somehow order

    please, correct djvused!!!

     
  • NBell

    NBell - 2012-10-15

    txtz decode.xls describes txtz chunks of djvu's texted by finereader, djvused, djvuocr - additional spaces shown by color

     
  • NBell

    NBell - 2012-10-15

    Change example to be understandable

     
  • Leon Bottou

    Leon Bottou - 2013-06-30
    • status: open --> wont-fix
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks