Menu

#357 The mask extracted with ddjvu is not binary, but greyscale

djvulibre
closed
nobody
None
5
2024-10-18
2024-08-26
Janusz
No

The title says all.

Discussion

  • Janusz

    Janusz - 2024-08-26

    After converting to PNG I get Augezdecki-01a_PT08_403_mask.png PNG 1711x353 1711x353+0+0 8-bit Gray 2c 9285B 0.000u 0:00.000. I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.

     

    Last edit: Janusz 2024-08-26
    • Leon Bottou

      Leon Bottou - 2024-08-26

      ddjvu -1 should always return a binary mask as a PBM files.
      PBM files can only represent binary images anyway.
      https://linux.die.net/man/5/pbm

      On 2024-08-26 11:26, Janusz wrote:

      Perhaps I misinterpreted the identify output:
      Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit
      Bilevel Gray 75554B 0.000u 0:00.001. Moreover the mask file was
      rejected as non-binary by a Python script written for me by ChatGPT. I
      have to investigate the problem more closely.


      [BUGS:#357] [1] THE MASK EXTRACTED WITH DDJVU IS NOT BINARY, BUT
      GREYSCALE

      STATUS: open
      GROUP: djvulibre
      CREATED: Mon Aug 26, 2024 09:47 AM UTC by Janusz
      LAST UPDATED: Mon Aug 26, 2024 09:47 AM UTC
      OWNER: nobody

      The title says all.

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/djvu/bugs/357/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      *

      [1] https://sourceforge.net/p/djvu/bugs/357/

       
  • Janusz

    Janusz - 2024-08-26

    Thanks for your quick answer, but what exactly do you mean by ddjvu -1? What is the complete invocation? I use in Python
    subprocess.run(["ddjvu", "-format=pbm", "-mode=mask", djvu_file, pbm_file])
    and the output does not seem to be binary, especially when converted to PNG (this can be of course the conversion fault).
    Let me repeat what I posted 2 hours ago, as this doesn't seem to be distributed by mail: I just noted that for the original mask I get: Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you provided from the identify command in ImageMagick indicates that the image is in 8-bit grayscale with two colors (2c), which means it is not truly binary.

     
    • Leon Bottou

      Leon Bottou - 2024-08-27

      Option -1 or --subsample=1 ensure that the output resolution matches
      that of the input image.
      Note that this is the default when you use the --format option (which
      you should always use.)

      Note that when you use --format=pbm, you force the output to be a pbm
      file which can only encode binary images.
      Even if you were to subsample the image (making the mask gray level), it
      would be tresholded into a binary image to produce a pbm file.
      These PBM files start with the two letters "P4". If this is the case,
      this is a binary image and nothing else. If the png is not binary, then
      this must come from the downstream conversion. Note that you can also
      use "--verbose" to let ddjvu tell you.

      If you do not like these pbm/pgm/ppm files, you might like --format=tiff
      which is usually smart enough to decide a binary, grey level, or color
      encoding. You can then use program tiff info to know what you got. It
      automatically use lossless CCITT-T4 compression for binary images. For
      other images, it uses packbits (the most portable lossless scheme) by
      default, jpeg compression with option "--quality=<1to100>", and
      additionally understands "--quality=deflate|lzw|raw". People usually
      say that this works nicely for them. Then you might use a program like
      tiff2png which I believe keeps the tiff setup into the png (as much as
      png allows). Best would be to use the tiff files directly though
      (they're more flexible).

      On 2024-08-26 13:39, Janusz wrote:

      Thanks for your quick answer, but what exactly do you mean by ddjvu
      -1? What is the complete invocation? I use in Python
      subprocess.run(["ddjvu", "-format=pbm", "-mode=mask", djvu_file,
      pbm_file]
      )
      and the output does not seem to be binary, especially when converted
      to PNG (this can be of course the conversion fault).
      Let me repeat what I posted 2 hours ago, as this doesn't seem to be
      distributed by mail: I just noted that for the original mask I get:
      Augezdecki-01a_PT08_403_mask.pbm PBM 1711x353 1711x353+0+0 1-bit
      Bilevel Gray 75554B 0.000u 0:00.001. ChatGPT says: he output you
      provided from the identify command in ImageMagick indicates that the
      image is in 8-bit grayscale with two colors (2c), which means it is
      not truly binary.


      [BUGS:#357] [1] THE MASK EXTRACTED WITH DDJVU IS NOT BINARY, BUT
      GREYSCALE

      STATUS: open
      GROUP: djvulibre
      CREATED: Mon Aug 26, 2024 09:47 AM UTC by Janusz
      LAST UPDATED: Mon Aug 26, 2024 03:26 PM UTC
      OWNER: nobody

      The title says all.

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/djvu/bugs/357/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

      *

      [1] https://sourceforge.net/p/djvu/bugs/357/

       
  • Janusz

    Janusz - 2024-08-28

    Thank you very much for your patience and the detailed explanation! My problems were caused by the conversion steps.
    I will follow your suggestion and use the TIFF output. To my pleasant surprise the next program in the pipeline accepts TIFFs as the input, so conversion is not needed.
    FYI, I intend to process font tables uploaded to https://github.com/jsbien/early_fonts_inventory/tree/main/font_tables/oDjvu.I need binary files for processing and , instead of running some binarization tool, I prefer to use the mask created while converting scans with didjvu.
    Please close the issue.

     
  • Janusz

    Janusz - 2024-08-28

    I was too optimistic, the impression the program accepts ddjvu produced TIFF as binary graphics was an illusion due to some mistake of mine. I try to persuade ChatGPT to correct the script (https://github.com/jsbien/tmp), we'll see what happens.

     
  • Janusz

    Janusz - 2024-08-28

    The problem appeared not related to the file format, it is now fixed.

     
  • Leon Bottou

    Leon Bottou - 2024-10-18
    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB