DjVuLibre / Bugs / #200 djvused extracts text layer w/artifacts

Leon Bottou - 2012-10-06

I am not sure I understand the bug.

However I am sure that none of these text layers are correct
because they contain non-printable characters.
Things like:

(page .....
(word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
(char 1524 3875 1573 3930 "\t") <------- this is wrong. tab does not print anything: no bounding box should be given.
.... )

How did you produce these files in the first place?

- L.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-06

abbyy FineReader 11.0.102.583 ocr -> save as djvu -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see good text
djvuocr 2.4 b4 txt from 0000.djvu -> txtfile
djvuocr 2.4 b4 txt from txtfile -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see text with additional spaces

i like handcoded djvu - > so testing TXT copy after ocr

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-06

I greatly apologize for borrowing you
its a djvuocr buggy djvused (its hard to know the version of djvused - its not show it - i asked for this feature to add)
when i replace it with djvused of last release for windows (may 2012) - i get just what i want...
can you answer - what dll's must be copied with djvused.exe for it using apart? i copied all, but think they are not all nessesary

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-06

sorry, but bug exists with last djvused edition

run simple bat with 0000.djvu

@djvused 0000.djvu -e "output-txt" > 01.djvu.txt
@copy 0000.djvu 0000-retexted.djvu
@djvused 0000-retexted.djvu -f 01.djvu.txt -s
@pause

then open both files and compare the text - you can see additional spaces between words |-(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2012-10-07

Looking at the problem from a different angle, I have several questions and comments:

1. Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.

2. For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?

3. Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki

4. Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/

5. Last but not least, in the example

(page .....
(word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
(char 1524 3875 1573 3930 "\t")

I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27

Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?

Best regards

Janusz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

Dear Janusz

Looking at the problem from a different angle,

The problem – djvused produces erroneous dsed after output-txt from djvu, produced by abbyy FineReader 11.0.102.583

It means, that text extracted<>text exists
And it means some errors in djvused, that allows to recode text and adds artifacts in text
Its error and bug at any angle.

Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.

Yes, of course (says John Rider in "Hitcher")
So? Its strange djvu – no djbz, but iff-s looks like djbz, no coding control
Don’t like so – I am not such lamer.

For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?
Pdftodjvu produces blank djvu from fr11 pdf

Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki

Linux is great, but have no features, needing me change the OS. Time is money – in that case see no money, but time for study.

Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/

Try to see. But problem – no tool except djvused to transfer text between files

And question – why djvused in dsed use page names instead of page numbers? Djvuocr produce page numbers – it help to transfer between different coded files

Last but not least, in the example
I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27

Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?

No difference. Use option –u (djvused 0000.djvu –u –e "output-txt" > 0000.djvu.txt) and get UTF-8 text easy to edit. Djvused get it to djvu without any question (djvused 0000.djvu –f 0000.djvu.txt -s).
Surprised?
Convenient means old. Save the time if need to edit ocr-text-layer.

Best regards

NBell
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

The Hitcher described at http://yandex.ru/clck/redir/AiuY0DBWFJ4ePaEse6rgeAjgs2pI3DW99KUdgowt9XvqxGyo_rnZJtx63N-JKenu-WTtzhMV438OtL2f2uYFiDjTBXicN3jaAJh2qpahK6Quqvj2uxnCXV6LJdZ5EjNHHwFgwbcYGFZwY-M-eIOztS1ibq5AmwqnActJ-FbiPv-5y5zwRHQ2mT8DFWrfyqL3ShM5OJpLhHU?data=UlNrNmk5WktYejR0eWJFYk1LdmtxcGpBbHdIa25tWmNLUkplUlo5VEhHRlB3UXQ2TS1DVnp5QU9aM1ctVkFENmFRbnEzYkF0cnpyWUQtclpFQXlnZEVvOERVOWhCSmtQeWVya1o5RzZoc3Y3Qk1icXhyd3ZJUQ&b64e=2&sign=c5e738a1455c608a931558bd13d5ced7&keyno=8&l10n=ru&i=2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

Dear Janusz
I have no Linux
So try to recode fr11 produced pdf to djvu - i use pdftodjvuguile01 - receive balnk file

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2012-10-07

Let's first settle the pdf2djvu issue.

I don't know pdftodjvuguile01, do you mean Pdf To Djvu GUI v. 2.1 (http://www.trustfm.net/GeneralTools/SoftwarePdfToDjvuGUI.php)?

It is based on pdf2djvu (http://jwilk.net/software/pdf2djvu), which I used (under Linux) on several occasions to convert successfully the PDF output to DjVu.

If it is really the fault of Jakub Wilk's pdf2djvu program, than I'm sure he will investigate the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2012-10-07

I mean of course the PDF output of FineReader 11.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

remarkable Pdftodjvu LE v0.1
http://minus.com/mte8hk07J/1f
about trustfm product - it works not good, so it not useful - no time for bugtest - exist more workable solution

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2012-10-07

pdf2djvu is not a trustfm product.

You can download the executable (without GUI) directly from its home site: https://code.google.com/p/pdf2djvu/. You can also report your problems in the issue tracker.

I suspect Pdftodjvu is based on djvudigital. If so, than the comparison

https://code.google.com/p/pdf2djvu/wiki/DjVuDigital

may be of some interest to you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

pdf transform is not the topic and it more complicated way:
ocr -- save pdf -- tansform to djvu using cmdline pdftodjvu -- extract text with djvuocr -- embed text to djvu using djvuocr
does you really think that it is easy and simple?
my way:
ocr -- save djvu -- extract text with djvuocr -- embed text to djvu using djvuocr
by cause of djvused added sizif work - remove space from dsed and then embed
so:
1. djvused make incorrect text extraction
2. what symbol converted to space?
3. why?
4. when djvuset output-txt became correct?

P.S. command line tools - for enthusiasts of it - simple people use GUI-tools

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

see no help - find it himself
if lines starts with
(char.....
comennted like
# (char....

djvused produce correct text layer
handy work, but - life is life

waiting for win32 corrected djvused....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

Janusz
test pdf2djvu-0.7.14 - produce good readable text, but - added colors from b/w image-pdf
for critical situ - good
results - not exactly match

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-07

Janusz
thanks for informing me about other djvu tools - helpful.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Janusz - 2012-10-08

Glad you found my information useful.

I've posted it here because I understand the main problem is correcting OCR and djvused is just a tool.

It's a pity there is no place to discuss DjVu in general, the postings on djvu.org are delayed by the moderator so long that any discussion is practically impossible :-(

Regards

Janusz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-13

make test file simple as can
analyse of txt chunk show, that fr11 txtz exactly matches text embedded
and djvused embeds additional spaces in somehow order

please, correct djvused!!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-15

txtz decode.xls describes txtz chunks of djvu's texted by finereader, djvused, djvuocr - additional spaces shown by color

newtestv2.rar

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NBell - 2012-10-15

Change example to be understandable

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leon Bottou - 2013-06-30

status: open --> wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

djvused extracts text layer w/artifacts

Group

Searches

Help

#200 djvused extracts text layer w/artifacts

Discussion