Menu

About djvu4shape and lossless dictionaries for large number of pages

2019-08-20
2020-10-13
  • Alexander Trufanov

    Some thoughts about djvu4shape and creation of lossless common shapes
    dictionary for large number of pages.

    in reply to
    https://sourceforge.net/p/djvu/discussion/103286/thread/1b0de7aa93/#6856

    Building issues

    Well, I was able to build it.

    I faced with "undefined symbol: atomicDecrement" and "Cound not get fbjb"
    issues. To bypass them I've copied an up-to-date (v 3.5.27) sources of
    libDjVuLibre to the corresponding subfolder in src/ and replaced '#define
    THREADMODEL POSIXTHREADS' with '#define HAVE_PTHREAD 1' in src/config.h

    It works, loads djvu, displays letters, highlights them. The only thing is
    that the list of page thumbnails is empty. I used PgUp/PgDn to navigate
    pages. Don't know if it's a bug or as expected.

    I think the problem with code is caused by the fact that it contains a copy
    of all libDjVuLibre internal headers to get access to its private functions
    apart from public API. And they contain some inline or external (depending
    on build conf) functions. As I can see djvu4shapes seems to expect
    'libdjvulibre21 (>= 3.5.25.3)' (as it is in control file of deb) while
    modern systems have 3.5.27.1-10 in repositories, Their API seems to be
    binary compatible but headers are not compatible in private internals, in
    particular in atomic.h.

    djvudict

    As for the app - I made quite the same, but based on minidjvu instead of
    libDjVuLibre. In fact libDjVuLibre contains some part of minidjvu code but
    uses it for polishing sjbz local dictionaries only as far as I remember. My
    app could be found here: https://github.com/trufanov-nok/djvudict It's
    organized as a patch to minidjvu sourcebase. It's an alpha as I didn't
    finished it as there is only one user of it - me, and I'm fine with
    functionality. It has no GUI and just extracts blits and their metadata not
    only per page but also per shared dictionary. It was the main idea - to
    find out why similar letters are not getting encoded to shared dictionary
    and ending up in a local dictionary of page. I thought about visualizing it
    but my first idea was to export djvudict metadata to some db file and use
    it in any compatible viewer (another djview fork?) to highlight the
    letters. Anyway it turns to be not necessary once I realized that minidjvu
    is a good enough encoder and no further detailed comparison of commercial
    encoded and opensource encoded DjVus is required.

    djvu4shapes and shared dictionary info

    As for djvu4shapes its ability to highlight blits is just grate. But as I
    can see it is focused on gathering prototypes (parents) statistics and
    counting occurrences. I have a concern re that:

    The app isn't deep enough in libdjvulibre guts. It doesn't get info from
    lib about shared dictionary used by current page (if any).

    Just in case some encoding details:

    Each page is a JB2 image. JB2 image is just a set of instructions that may
    be on of following:

    {

    render new shape at (x,y) ;

    render new shape at (x,y) and remember it with id N for further use;

    just remember a new shape with id N;

    just render already known shape id N at (x,y);

    and a few more..

    }

    so each JB2 image has a local dictionary that is filling while rendering
    the page and allows to not encode same or similar shapes to minimize
    filesize.

    One can scan a book, encode each page scanned image to djvu with
    DjVuLibres' cjb2 encoder (it can encode only single pages) and then combine
    them with DjVuLibres' djvum tool into one multipage bundled document. It
    will be a valid doc but its filesize still be far from ideal. Bcs local
    dictionary of each page still contain many similar shapes between them.

    Afaik that was resolved by a next DjVu standard by introducing a shared
    dictionaries (Djbz). Djbz is basically a JB2 image too, but it has size
    (0,0) and contains only instructions of type {just remember a new shape
    with id N;}. So it's basically a hidden page in document. Any real page can
    refer to shared dictionary by its name and request to prefill its own local
    dictionary from it by M shapes. And then start rendering. Page still can
    add new shapes to local dict starting from index M+1.

    So djvu4shapes aren't deep enough in djvulibre for getting info about this.
    It just gets a shape and its parent id, but not the info if this is a shape
    from local dictionary or from its part that was prefilled from a shared
    dictionary and which shared dictionary.

    For example 100 pages document with 10 occurrences of same latter A in each
    page can be encoded without shared dictionaries at all. In this case it
    will contain 100 local dictionaries which is equal to 100 copies of A which
    reused 10 times each. Or it may be encoded with shared dictionaries with 10
    pages per dictionary parameter. In this case it will have 100 pages and 10
    hidden pages - shared dictionaries. And latter A will be stored in shared
    dictionaries, so there will be 10 copies of A in doc which used 100 times
    each (10 times 10 pages).

    And djvu4shapes just searches shape in QHash by image bytes. It ignores the
    dictionary origin of these bytes. So for any of such docs it will display 1
    shape A that used 1000 times. Which would be in ideal encoding but not the
    real fact.

    So my concern is that statistics it displays is better than the realy
    encoded data.

    About performance and possibility to creation of lossless common

    shapes dictionary for large number of pages:

    1. The djvu encoders use only 10 or 20 pages per shared dict bcs it's very
      time and sometimes memory consuming. The % of equal characters on scans of
      pages is quite low. They always differ by a few bits of shade or angle. So
      smart methods of image comparison must be used by good encoder.

    In case of djvu4shape - it works with ready made djvu, that's already
    encoded. So the % of equal images in it will be much higher – encoder
    already replaced them.

    Thus just comparing shapes byte per byte will work well, but we shall
    remember that this is just a second processing of already processed
    document. And the result is heavily dependent on aggressiveness of initial
    encoding.

    1. Creation of lossless common shapes dictionary for large number of pages
      technically seems to need to be implemented as a:

    a. shared dictionary for shared dictionaries.

    b. one big shared dictionary for all pages that is a result of merging all
    shared dicts in document

    c. a new global shared dictionary that was created for pages that hadn't
    shared dictionaries before.

    Simplest case is c. There is a project JB2Unify which is patch for djvulibre adding a new tool that shall do pretty the same: https://github.com/velwant/jb2unify

    It just looks in all local dictionaries and pages and forms a new shared
    dictionary that contains only equal shapes. And results of such compression
    is supposedly ordinary.

    In case of a. I would say I never see such a document. Shared dict for
    shared dict may be out of DjVu standard (but most probably isn't). Most
    important - DjVu Libre must support rendering of such docs. If it's not
    support it by implementation then no one will be able to read such doc in
    next 10 years regardless it's fitting to standard or not.

    Case of b. If one just combine all shared dicts in one by merging equal
    shapes it won't be a good design. First of all it'll increase RAM usage for
    viewers (bcs as I saw most of them not bother itself by defining number of
    shapes to prefill from shared dict and copy all of them). Also that'll
    increase max number of shape id in document. And as i understood these ids
    are tricky encoded. The bigger the id the more bits it'll require. Once i
    experimented with that and filesize decrease was noticeable in case of big
    shared dictionary.

    So the shared dict (which is in fact a hidden JB2 image) shall be at least
    reencoded to find new prototypes in it. Bcs encoders are usually looking
    for prototypes across of shared dict + page that uses it. Thus pages that
    share different dictionaries never have shared prototypes.

    So

    I. From performance point of view creation of lossless common shapes
    dictionary for large number of pages is doable if it’s only for information
    purpose - as image comparison func is simple.

    II. Using info obtained on step I. for djvu reencoding or optimization is
    problematical. It should be researched.

    III. If II. is doable than more interesting is not searching for equal
    images but searching for images that can be used as global prototypes
    (parents). And this isn't so simple and may have serious performance hit on
    step I in its turn.

    p.s. If you just need to gather statistics and display/highlight something
    it may be interesting to define a standard for such info, extract it from
    djvudict tool (its fork). And implement only metadata display layer in
    viewers like djview.

    --
    With best regards,
    Alexander Trufanov

     

    Last edit: Alexander Trufanov 2019-08-20
  • Janusz

    Janusz - 2019-08-20

    Thank you very much for your interest in djview4shapes. To make a long story short, it was done in a quick and dirty way due to various limitations. Unfortunately after the project ended I was unable to get funds for sponsoring improvements.
    BTW, another our program (https://bitbucket.org/jsbien/ndt/wiki/wyniki#!djview-for-poliqarp-zdalny-klient-graficzny-serwera-poliqarp-for-djvu-remote-graphical-client-for-poliqarp-for-djvu) was subject to some improvements by a volunteer (a student of Computer Science in Warsaw). Perhaps his changes in the code of djview4poliqarp may be of some interest to you (he claimed he found a strange bug/feature in the DjVu library, but had no motivation to report it).

    Best regards

    Janusz

     
  • Janusz

    Janusz - 2020-10-11

    Thank you once again for your interest in djview4shapes.

    Everything takes longer... :-)
    Only today I applied the changes proposed by you and succesfully compiled on Debian buster. I'm curious whether you also encountered problems described in https://github.com/jsbien/djview4shapes/issues/1. Please comment the issue, I will appreciate this very much.

     
  • Alexander Trufanov

    Hi! Commented on github.

     
  • Janusz

    Janusz - 2020-10-13

    I thank you very much for your important contribution.

     

Log in to post a comment.