DjVu (pronounced "déjà vu") is a image compression technology. DjVu is an open standard. The file format specification, as well as an open source implementations of the decoder (and part of the encoder) are available.
Are there plans to add support for indexing DjVU files?
If a good java library cannot be found, would it be possible to use external programs for text extraction? For DjVU the djvutxt(.exe) program can be used which prints the text to stdout.
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, there are currently no such plans, and the main reason is still the lack of DjVu Java libraries. Falling back on external programs is - at least for now - not an option either, as this would add a significant layer of complexity to the program and lead to all kinds of problems. (It may look simple on the outside, but internally it certainly isn't.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
@ Ji Ling: As far as I can tell, there are no open-source DjVu libraries for Java out there due to patent issues. As long as those issues remain resolved, DocFetcher cannot have DjVu support. Therefore, you'll have to either convert your DjVu files to other formats such as PDF, or find another program with DjVu support.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now I use batch files to generate .meta-files with text from djvu/mcdx(Mathcad). For djvu uses DjvuLibre\djvutxt.exe, for matchad - 7-zip. And I use special batch file associated with .meta, that find and open origin document - it's usable, but not easy and effective.
Is it possible to customize user-defined document (djvu, mathcad, e.t.c.) to text converter?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not quite sure what you're talking about. But if you want integrate some custom DJVU processing into DocFetcher, the only way is to modify the source code and build your own version of DocFetcher.
😕
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No, I don't want for version that support djvu. I want version with suport the external user-defined converters (that converts document to plain text, and then DocFetcher can use they stdout for indexing)
I.e. user just setup a table:
Filter
Command
Stdout encoding
*.djvu;*.djv
c:\tooling\convert_djvu_to_txt.bat {file}
cp1251
*.mcdx
c:\tooling\convert_mathcad_to_txt.bat {file}
utf-8
May be it will be more clear to convert documents with intermediate file:
This support for external parsers might come to DocFetcher Pro someday (probably in the far future...). DocFetcher on the other hand is no longer being developed and will only receive bugfixes, not new features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The main problem is finding a decent DjVU library for Java. I'll see what I can do about it :-)
Are there plans to add support for indexing DjVU files?
If a good java library cannot be found, would it be possible to use external programs for text extraction? For DjVU the djvutxt(.exe) program can be used which prints the text to stdout.
Sorry, there are currently no such plans, and the main reason is still the lack of DjVu Java libraries. Falling back on external programs is - at least for now - not an option either, as this would add a significant layer of complexity to the program and lead to all kinds of problems. (It may look simple on the outside, but internally it certainly isn't.)
I have tons of books in djvu but DocFetcher turned into a pumpkin on it. Problem is actual
@ Ji Ling: As far as I can tell, there are no open-source DjVu libraries for Java out there due to patent issues. As long as those issues remain resolved, DocFetcher cannot have DjVu support. Therefore, you'll have to either convert your DjVu files to other formats such as PDF, or find another program with DjVu support.
Now I use batch files to generate .meta-files with text from djvu/mcdx(Mathcad). For djvu uses DjvuLibre\djvutxt.exe, for matchad - 7-zip. And I use special batch file associated with .meta, that find and open origin document - it's usable, but not easy and effective.
Is it possible to customize user-defined document (djvu, mathcad, e.t.c.) to text converter?
I'm not quite sure what you're talking about. But if you want integrate some custom DJVU processing into DocFetcher, the only way is to modify the source code and build your own version of DocFetcher.
No, I don't want for version that support djvu. I want version with suport the external user-defined converters (that converts document to plain text, and then DocFetcher can use they stdout for indexing)
I.e. user just setup a table:
*.djvu;*.djv
*.mcdx
May be it will be more clear to convert documents with intermediate file:
*.djvu;*.djv
*.mcdx
Last edit: Sergey Chelnokov 2022-11-03
And there, let the user at least run OCR on
*.png
if he wants (for example, what user can add to table FILTER-COMMAND)Last edit: Sergey Chelnokov 2022-11-03
This support for external parsers might come to DocFetcher Pro someday (probably in the far future...). DocFetcher on the other hand is no longer being developed and will only receive bugfixes, not new features.
OCR support integrated into DocFetcher Pro is also under consideration.
It's annoying, but okay. I continue to use text file generators and run DocFetcher on them.