However I am sure that none of these text layers are correct
because they contain non-printable characters.
Things like:
(page .....
(word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
(char 1524 3875 1573 3930 "\t") <------- this is wrong. tab does not print anything: no bounding box should be given.
.... )
How did you produce these files in the first place?
- L.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
abbyy FineReader 11.0.102.583 ocr -> save as djvu -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see good text
djvuocr 2.4 b4 txt from 0000.djvu -> txtfile
djvuocr 2.4 b4 txt from txtfile -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see text with additional spaces
i like handcoded djvu - > so testing TXT copy after ocr
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I greatly apologize for borrowing you
its a djvuocr buggy djvused (its hard to know the version of djvused - its not show it - i asked for this feature to add)
when i replace it with djvused of last release for windows (may 2012) - i get just what i want...
can you answer - what dll's must be copied with djvused.exe for it using apart? i copied all, but think they are not all nessesary
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looking at the problem from a different angle, I have several questions and comments:
1. Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.
2. For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?
3. Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki
4. Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/
I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27
Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?
Best regards
Janusz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem – djvused produces erroneous dsed after output-txt from djvu, produced by abbyy FineReader 11.0.102.583
It means, that text extracted<>text exists
And it means some errors in djvused, that allows to recode text and adds artifacts in text
Its error and bug at any angle.
Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.
Yes, of course (says John Rider in "Hitcher")
So? Its strange djvu – no djbz, but iff-s looks like djbz, no coding control
Don’t like so – I am not such lamer.
For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?
Pdftodjvu produces blank djvu from fr11 pdf
Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki
Linux is great, but have no features, needing me change the OS. Time is money – in that case see no money, but time for study.
Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/
Try to see. But problem – no tool except djvused to transfer text between files
And question – why djvused in dsed use page names instead of page numbers? Djvuocr produce page numbers – it help to transfer between different coded files
Last but not least, in the example
I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27
Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?
No difference. Use option –u (djvused 0000.djvu –u –e "output-txt" > 0000.djvu.txt) and get UTF-8 text easy to edit. Djvused get it to djvu without any question (djvused 0000.djvu –f 0000.djvu.txt -s).
Surprised?
Convenient means old. Save the time if need to edit ocr-text-layer.
Best regards
NBell
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't know pdftodjvuguile01, do you mean Pdf To Djvu GUI v. 2.1 (http://www.trustfm.net/GeneralTools/SoftwarePdfToDjvuGUI.php)?
It is based on pdf2djvu (http://jwilk.net/software/pdf2djvu), which I used (under Linux) on several occasions to convert successfully the PDF output to DjVu.
If it is really the fault of Jakub Wilk's pdf2djvu program, than I'm sure he will investigate the problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
remarkable Pdftodjvu LE v0.1 http://minus.com/mte8hk07J/1f
about trustfm product - it works not good, so it not useful - no time for bugtest - exist more workable solution
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can download the executable (without GUI) directly from its home site: https://code.google.com/p/pdf2djvu/. You can also report your problems in the issue tracker.
I suspect Pdftodjvu is based on djvudigital. If so, than the comparison
pdf transform is not the topic and it more complicated way:
ocr -- save pdf -- tansform to djvu using cmdline pdftodjvu -- extract text with djvuocr -- embed text to djvu using djvuocr
does you really think that it is easy and simple?
my way:
ocr -- save djvu -- extract text with djvuocr -- embed text to djvu using djvuocr
by cause of djvused added sizif work - remove space from dsed and then embed
so:
1. djvused make incorrect text extraction
2. what symbol converted to space?
3. why?
4. when djvuset output-txt became correct?
P.S. command line tools - for enthusiasts of it - simple people use GUI-tools
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've posted it here because I understand the main problem is correcting OCR and djvused is just a tool.
It's a pity there is no place to discuss DjVu in general, the postings on djvu.org are delayed by the moderator so long that any discussion is practically impossible :-(
Regards
Janusz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
make test file simple as can
analyse of txt chunk show, that fr11 txtz exactly matches text embedded
and djvused embeds additional spaces in somehow order
please, correct djvused!!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not sure I understand the bug.
However I am sure that none of these text layers are correct
because they contain non-printable characters.
Things like:
(page .....
(word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
(char 1524 3875 1573 3930 "\t") <------- this is wrong. tab does not print anything: no bounding box should be given.
.... )
How did you produce these files in the first place?
- L.
abbyy FineReader 11.0.102.583 ocr -> save as djvu -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see good text
djvuocr 2.4 b4 txt from 0000.djvu -> txtfile
djvuocr 2.4 b4 txt from txtfile -> 0000.djvu
open 0000.djvu -> select text->copy to ms word xp -> see text with additional spaces
i like handcoded djvu - > so testing TXT copy after ocr
I greatly apologize for borrowing you
its a djvuocr buggy djvused (its hard to know the version of djvused - its not show it - i asked for this feature to add)
when i replace it with djvused of last release for windows (may 2012) - i get just what i want...
can you answer - what dll's must be copied with djvused.exe for it using apart? i copied all, but think they are not all nessesary
sorry, but bug exists with last djvused edition
run simple bat with 0000.djvu
@djvused 0000.djvu -e "output-txt" > 01.djvu.txt
@copy 0000.djvu 0000-retexted.djvu
@djvused 0000-retexted.djvu -f 01.djvu.txt -s
@pause
then open both files and compare the text - you can see additional spaces between words |-(
Looking at the problem from a different angle, I have several questions and comments:
1. Which version of FineReader is used? Version 11 allows to save the results directly in the DjVu format.
2. For FineReader versions earlier than 11 you can convert the PDF output with Jakub Wilk's pdf2djvu. Are there some important reasons to use djvuocr?
3. Is using Linux out of question? Jakub Wilk's djvusmooth, a graphical DjVu editor, is available for Debian, Ubuntu and OpenSuse. A virtual machine is available for demonstration, cf. https://bitbucket.org/jsbien/ndt/wiki/wyniki
4. Recently I've presented Jakub Wilk's tools on a virtual conference, you can read the slides and watch the video: http://bc.klf.uw.edu.pl/298/
5. Last but not least, in the example
(page .....
(word 684 3872 1523 3930 "БИБЛИОГРАФИЧЕСКИЙ")
(char 1524 3875 1573 3930 "\t")
I'm surprised to see Cyryllics instead of escape sequences like \320\224\320\265\321\200\321\217\320\263\320\270\320\27
Actually with e.g. output-all I get escape sequences also for Polish letters but of course "clean" UTF-8 would be more convenient. Am I missing something?
Best regards
Janusz
Dear Janusz
The problem – djvused produces erroneous dsed after output-txt from djvu, produced by abbyy FineReader 11.0.102.583
It means, that text extracted<>text exists
And it means some errors in djvused, that allows to recode text and adds artifacts in text
Its error and bug at any angle.
Yes, of course (says John Rider in "Hitcher")
So? Its strange djvu – no djbz, but iff-s looks like djbz, no coding control
Don’t like so – I am not such lamer.
Linux is great, but have no features, needing me change the OS. Time is money – in that case see no money, but time for study.
Try to see. But problem – no tool except djvused to transfer text between files
And question – why djvused in dsed use page names instead of page numbers? Djvuocr produce page numbers – it help to transfer between different coded files
No difference. Use option –u (djvused 0000.djvu –u –e "output-txt" > 0000.djvu.txt) and get UTF-8 text easy to edit. Djvused get it to djvu without any question (djvused 0000.djvu –f 0000.djvu.txt -s).
Surprised?
Convenient means old. Save the time if need to edit ocr-text-layer.
Best regards
NBell
The Hitcher described at http://yandex.ru/clck/redir/AiuY0DBWFJ4ePaEse6rgeAjgs2pI3DW99KUdgowt9XvqxGyo_rnZJtx63N-JKenu-WTtzhMV438OtL2f2uYFiDjTBXicN3jaAJh2qpahK6Quqvj2uxnCXV6LJdZ5EjNHHwFgwbcYGFZwY-M-eIOztS1ibq5AmwqnActJ-FbiPv-5y5zwRHQ2mT8DFWrfyqL3ShM5OJpLhHU?data=UlNrNmk5WktYejR0eWJFYk1LdmtxcGpBbHdIa25tWmNLUkplUlo5VEhHRlB3UXQ2TS1DVnp5QU9aM1ctVkFENmFRbnEzYkF0cnpyWUQtclpFQXlnZEVvOERVOWhCSmtQeWVya1o5RzZoc3Y3Qk1icXhyd3ZJUQ&b64e=2&sign=c5e738a1455c608a931558bd13d5ced7&keyno=8&l10n=ru&i=2
Dear Janusz
I have no Linux
So try to recode fr11 produced pdf to djvu - i use pdftodjvuguile01 - receive balnk file
Let's first settle the pdf2djvu issue.
I don't know pdftodjvuguile01, do you mean Pdf To Djvu GUI v. 2.1 (http://www.trustfm.net/GeneralTools/SoftwarePdfToDjvuGUI.php)?
It is based on pdf2djvu (http://jwilk.net/software/pdf2djvu), which I used (under Linux) on several occasions to convert successfully the PDF output to DjVu.
If it is really the fault of Jakub Wilk's pdf2djvu program, than I'm sure he will investigate the problem.
I mean of course the PDF output of FineReader 11.
remarkable Pdftodjvu LE v0.1
http://minus.com/mte8hk07J/1f
about trustfm product - it works not good, so it not useful - no time for bugtest - exist more workable solution
pdf2djvu is not a trustfm product.
You can download the executable (without GUI) directly from its home site: https://code.google.com/p/pdf2djvu/. You can also report your problems in the issue tracker.
I suspect Pdftodjvu is based on djvudigital. If so, than the comparison
https://code.google.com/p/pdf2djvu/wiki/DjVuDigital
may be of some interest to you.
pdf transform is not the topic and it more complicated way:
ocr -- save pdf -- tansform to djvu using cmdline pdftodjvu -- extract text with djvuocr -- embed text to djvu using djvuocr
does you really think that it is easy and simple?
my way:
ocr -- save djvu -- extract text with djvuocr -- embed text to djvu using djvuocr
by cause of djvused added sizif work - remove space from dsed and then embed
so:
1. djvused make incorrect text extraction
2. what symbol converted to space?
3. why?
4. when djvuset output-txt became correct?
P.S. command line tools - for enthusiasts of it - simple people use GUI-tools
see no help - find it himself
if lines starts with
(char.....
comennted like
# (char....
djvused produce correct text layer
handy work, but - life is life
waiting for win32 corrected djvused....
Janusz
test pdf2djvu-0.7.14 - produce good readable text, but - added colors from b/w image-pdf
for critical situ - good
results - not exactly match
Janusz
thanks for informing me about other djvu tools - helpful.
Glad you found my information useful.
I've posted it here because I understand the main problem is correcting OCR and djvused is just a tool.
It's a pity there is no place to discuss DjVu in general, the postings on djvu.org are delayed by the moderator so long that any discussion is practically impossible :-(
Regards
Janusz
make test file simple as can
analyse of txt chunk show, that fr11 txtz exactly matches text embedded
and djvused embeds additional spaces in somehow order
please, correct djvused!!!
txtz decode.xls describes txtz chunks of djvu's texted by finereader, djvused, djvuocr - additional spaces shown by color
Change example to be understandable