After converting to PNG I get Augezdecki-01a_PT08_403_mask.png PNG 1711x353 1711x353+0+0 8-bit Gray 2c 9285B 0.000u 0:00.000. I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.
Last edit: Janusz 2024-08-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ddjvu -1 should always return a binary mask as a PBM files.
PBM files can only represent binary images anyway. https://linux.die.net/man/5/pbm
On 2024-08-26 11:26, Janusz wrote:
Perhaps I misinterpreted the identify output:
Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit
Bilevel Gray 75554B 0.000u 0:00.001. Moreover the mask file was
rejected as non-binary by a Python script written for me by ChatGPT. I
have to investigate the problem more closely.
[BUGS:#357][1] THE MASK EXTRACTED WITH DDJVU IS NOT BINARY, BUT
GREYSCALE
STATUS: open
GROUP: djvulibre
CREATED: Mon Aug 26, 2024 09:47 AM UTC by Janusz
LAST UPDATED: Mon Aug 26, 2024 09:47 AM UTC
OWNER: nobody
Thanks for your quick answer, but what exactly do you mean by ddjvu -1? What is the complete invocation? I use in Python subprocess.run(["ddjvu", "-format=pbm", "-mode=mask", djvu_file, pbm_file])
and the output does not seem to be binary, especially when converted to PNG (this can be of course the conversion fault).
Let me repeat what I posted 2 hours ago, as this doesn't seem to be distributed by mail: I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Option -1 or --subsample=1 ensure that the output resolution matches
that of the input image.
Note that this is the default when you use the --format option (which
you should always use.)
Note that when you use --format=pbm, you force the output to be a pbm
file which can only encode binary images.
Even if you were to subsample the image (making the mask gray level), it
would be tresholded into a binary image to produce a pbm file.
These PBM files start with the two letters "P4". If this is the case,
this is a binary image and nothing else. If the png is not binary, then
this must come from the downstream conversion. Note that you can also
use "--verbose" to let ddjvu tell you.
If you do not like these pbm/pgm/ppm files, you might like --format=tiff
which is usually smart enough to decide a binary, grey level, or color
encoding. You can then use program tiff info to know what you got. It
automatically use lossless CCITT-T4 compression for binary images. For
other images, it uses packbits (the most portable lossless scheme) by
default, jpeg compression with option "--quality=<1to100>", and
additionally understands "--quality=deflate|lzw|raw". People usually
say that this works nicely for them. Then you might use a program like
tiff2png which I believe keeps the tiff setup into the png (as much as
png allows). Best would be to use the tiff files directly though
(they're more flexible).
On 2024-08-26 13:39, Janusz wrote:
Thanks for your quick answer, but what exactly do you mean by ddjvu
-1? What is the complete invocation? I use in Python
subprocess.run(["ddjvu", "-format=pbm", "-mode=mask", djvu_file,
pbm_file])
and the output does not seem to be binary, especially when converted
to PNG (this can be of course the conversion fault).
Let me repeat what I posted 2 hours ago, as this doesn't seem to be
distributed by mail: I just noted that for the original mask I get:
Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit
Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you
provided from the identify command in ImageMagick indicates that the
image is in 8-bit grayscale with two colors (2c), which means it is
not truly binary.
[BUGS:#357][1] THE MASK EXTRACTED WITH DDJVU IS NOT BINARY, BUT
GREYSCALE
STATUS: open
GROUP: djvulibre
CREATED: Mon Aug 26, 2024 09:47 AM UTC by Janusz
LAST UPDATED: Mon Aug 26, 2024 03:26 PM UTC
OWNER: nobody
Thank you very much for your patience and the detailed explanation! My problems were caused by the conversion steps.
I will follow your suggestion and use the TIFF output. To my pleasant surprise the next program in the pipeline accepts TIFFs as the input, so conversion is not needed.
FYI, I intend to process font tables uploaded to https://github.com/jsbien/early_fonts_inventory/tree/main/font_tables/oDjvu.I need binary files for processing and , instead of running some binarization tool, I prefer to use the mask created while converting scans with didjvu.
Please close the issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was too optimistic, the impression the program accepts ddjvu produced TIFF as binary graphics was an illusion due to some mistake of mine. I try to persuade ChatGPT to correct the script (https://github.com/jsbien/tmp), we'll see what happens.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After converting to PNG I get Augezdecki-01a_PT08_403_mask.png PNG 1711x353 1711x353+0+0 8-bit Gray 2c 9285B 0.000u 0:00.000. I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.
Last edit: Janusz 2024-08-26
ddjvu -1 should always return a binary mask as a PBM files.
PBM files can only represent binary images anyway.
https://linux.die.net/man/5/pbm
On 2024-08-26 11:26, Janusz wrote:
Thanks for your quick answer, but what exactly do you mean by ddjvu -1? What is the complete invocation? I use in Python
subprocess.run(["ddjvu", "-format=pbm", "-mode=mask", djvu_file, pbm_file])and the output does not seem to be binary, especially when converted to PNG (this can be of course the conversion fault).
Let me repeat what I posted 2 hours ago, as this doesn't seem to be distributed by mail: I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.
Option -1 or --subsample=1 ensure that the output resolution matches
that of the input image.
Note that this is the default when you use the --format option (which
you should always use.)
Note that when you use --format=pbm, you force the output to be a pbm
file which can only encode binary images.
Even if you were to subsample the image (making the mask gray level), it
would be tresholded into a binary image to produce a pbm file.
These PBM files start with the two letters "P4". If this is the case,
this is a binary image and nothing else. If the png is not binary, then
this must come from the downstream conversion. Note that you can also
use "--verbose" to let ddjvu tell you.
If you do not like these pbm/pgm/ppm files, you might like --format=tiff
which is usually smart enough to decide a binary, grey level, or color
encoding. You can then use program tiff info to know what you got. It
automatically use lossless CCITT-T4 compression for binary images. For
other images, it uses packbits (the most portable lossless scheme) by
default, jpeg compression with option "--quality=<1to100>", and
additionally understands "--quality=deflate|lzw|raw". People usually
say that this works nicely for them. Then you might use a program like
tiff2png which I believe keeps the tiff setup into the png (as much as
png allows). Best would be to use the tiff files directly though
(they're more flexible).
On 2024-08-26 13:39, Janusz wrote:
Thank you very much for your patience and the detailed explanation! My problems were caused by the conversion steps.
I will follow your suggestion and use the TIFF output. To my pleasant surprise the next program in the pipeline accepts TIFFs as the input, so conversion is not needed.
FYI, I intend to process font tables uploaded to https://github.com/jsbien/early_fonts_inventory/tree/main/font_tables/oDjvu.I need binary files for processing and , instead of running some binarization tool, I prefer to use the mask created while converting scans with didjvu.
Please close the issue.
I was too optimistic, the impression the program accepts ddjvu produced TIFF as binary graphics was an illusion due to some mistake of mine. I try to persuade ChatGPT to correct the script (https://github.com/jsbien/tmp), we'll see what happens.
The problem appeared not related to the file format, it is now fixed.