DjVuLibre / Discussion / DjVuLibre Development: About djvu4shape and lossless dictionaries for large number of pages

Some thoughts about djvu4shape and creation of lossless common shapes
dictionary for large number of pages.

in reply to
https://sourceforge.net/p/djvu/discussion/103286/thread/1b0de7aa93/#6856

Building issues

Well, I was able to build it.

I faced with "undefined symbol: atomicDecrement" and "Cound not get fbjb"
issues. To bypass them I've copied an up-to-date (v 3.5.27) sources of
libDjVuLibre to the corresponding subfolder in src/ and replaced '#define
THREADMODEL POSIXTHREADS' with '#define HAVE_PTHREAD 1' in src/config.h

It works, loads djvu, displays letters, highlights them. The only thing is
that the list of page thumbnails is empty. I used PgUp/PgDn to navigate
pages. Don't know if it's a bug or as expected.

I think the problem with code is caused by the fact that it contains a copy
of all libDjVuLibre internal headers to get access to its private functions
apart from public API. And they contain some inline or external (depending
on build conf) functions. As I can see djvu4shapes seems to expect
'libdjvulibre21 (>= 3.5.25.3)' (as it is in control file of deb) while
modern systems have 3.5.27.1-10 in repositories, Their API seems to be
binary compatible but headers are not compatible in private internals, in
particular in atomic.h.

djvudict

As for the app - I made quite the same, but based on minidjvu instead of
libDjVuLibre. In fact libDjVuLibre contains some part of minidjvu code but
uses it for polishing sjbz local dictionaries only as far as I remember. My
app could be found here: https://github.com/trufanov-nok/djvudict It's
organized as a patch to minidjvu sourcebase. It's an alpha as I didn't
finished it as there is only one user of it - me, and I'm fine with
functionality. It has no GUI and just extracts blits and their metadata not
only per page but also per shared dictionary. It was the main idea - to
find out why similar letters are not getting encoded to shared dictionary
and ending up in a local dictionary of page. I thought about visualizing it
but my first idea was to export djvudict metadata to some db file and use
it in any compatible viewer (another djview fork?) to highlight the
letters. Anyway it turns to be not necessary once I realized that minidjvu
is a good enough encoder and no further detailed comparison of commercial
encoded and opensource encoded DjVus is required.

djvu4shapes and shared dictionary info

As for djvu4shapes its ability to highlight blits is just grate. But as I
can see it is focused on gathering prototypes (parents) statistics and
counting occurrences. I have a concern re that:

The app isn't deep enough in libdjvulibre guts. It doesn't get info from
lib about shared dictionary used by current page (if any).

Just in case some encoding details:

Each page is a JB2 image. JB2 image is just a set of instructions that may
be on of following:

{

render new shape at (x,y) ;

render new shape at (x,y) and remember it with id N for further use;

just remember a new shape with id N;

just render already known shape id N at (x,y);

and a few more..

}

so each JB2 image has a local dictionary that is filling while rendering
the page and allows to not encode same or similar shapes to minimize
filesize.

One can scan a book, encode each page scanned image to djvu with
DjVuLibres' cjb2 encoder (it can encode only single pages) and then combine
them with DjVuLibres' djvum tool into one multipage bundled document. It
will be a valid doc but its filesize still be far from ideal. Bcs local
dictionary of each page still contain many similar shapes between them.

Afaik that was resolved by a next DjVu standard by introducing a shared
dictionaries (Djbz). Djbz is basically a JB2 image too, but it has size
(0,0) and contains only instructions of type {just remember a new shape
with id N;}. So it's basically a hidden page in document. Any real page can
refer to shared dictionary by its name and request to prefill its own local
dictionary from it by M shapes. And then start rendering. Page still can
add new shapes to local dict starting from index M+1.

So djvu4shapes aren't deep enough in djvulibre for getting info about this.
It just gets a shape and its parent id, but not the info if this is a shape
from local dictionary or from its part that was prefilled from a shared
dictionary and which shared dictionary.

For example 100 pages document with 10 occurrences of same latter A in each
page can be encoded without shared dictionaries at all. In this case it
will contain 100 local dictionaries which is equal to 100 copies of A which
reused 10 times each. Or it may be encoded with shared dictionaries with 10
pages per dictionary parameter. In this case it will have 100 pages and 10
hidden pages - shared dictionaries. And latter A will be stored in shared
dictionaries, so there will be 10 copies of A in doc which used 100 times
each (10 times 10 pages).

And djvu4shapes just searches shape in QHash by image bytes. It ignores the
dictionary origin of these bytes. So for any of such docs it will display 1
shape A that used 1000 times. Which would be in ideal encoding but not the
real fact.

So my concern is that statistics it displays is better than the realy
encoded data.

About performance and possibility to creation of lossless common

shapes dictionary for large number of pages:

The djvu encoders use only 10 or 20 pages per shared dict bcs it's very
time and sometimes memory consuming. The % of equal characters on scans of
pages is quite low. They always differ by a few bits of shade or angle. So
smart methods of image comparison must be used by good encoder.

In case of djvu4shape - it works with ready made djvu, that's already
encoded. So the % of equal images in it will be much higher – encoder
already replaced them.

Thus just comparing shapes byte per byte will work well, but we shall
remember that this is just a second processing of already processed
document. And the result is heavily dependent on aggressiveness of initial
encoding.

Creation of lossless common shapes dictionary for large number of pages
technically seems to need to be implemented as a:

a. shared dictionary for shared dictionaries.

b. one big shared dictionary for all pages that is a result of merging all
shared dicts in document

c. a new global shared dictionary that was created for pages that hadn't
shared dictionaries before.

Simplest case is c. There is a project JB2Unify which is patch for djvulibre adding a new tool that shall do pretty the same: https://github.com/velwant/jb2unify

It just looks in all local dictionaries and pages and forms a new shared
dictionary that contains only equal shapes. And results of such compression
is supposedly ordinary.

In case of a. I would say I never see such a document. Shared dict for
shared dict may be out of DjVu standard (but most probably isn't). Most
important - DjVu Libre must support rendering of such docs. If it's not
support it by implementation then no one will be able to read such doc in
next 10 years regardless it's fitting to standard or not.

Case of b. If one just combine all shared dicts in one by merging equal
shapes it won't be a good design. First of all it'll increase RAM usage for
viewers (bcs as I saw most of them not bother itself by defining number of
shapes to prefill from shared dict and copy all of them). Also that'll
increase max number of shape id in document. And as i understood these ids
are tricky encoded. The bigger the id the more bits it'll require. Once i
experimented with that and filesize decrease was noticeable in case of big
shared dictionary.

So the shared dict (which is in fact a hidden JB2 image) shall be at least
reencoded to find new prototypes in it. Bcs encoders are usually looking
for prototypes across of shared dict + page that uses it. Thus pages that
share different dictionaries never have shared prototypes.

I. From performance point of view creation of lossless common shapes
dictionary for large number of pages is doable if it’s only for information
purpose - as image comparison func is simple.

II. Using info obtained on step I. for djvu reencoding or optimization is
problematical. It should be researched.

III. If II. is doable than more interesting is not searching for equal
images but searching for images that can be used as global prototypes
(parents). And this isn't so simple and may have serious performance hit on
step I in its turn.

p.s. If you just need to gather statistics and display/highlight something
it may be interesting to define a standard for such info, extract it from
djvudict tool (its fork). And implement only metadata display layer in
viewers like djview.

--
With best regards,
Alexander Trufanov

Last edit: Alexander Trufanov 2019-08-20

alternate

About djvu4shape and lossless dictionaries for large number of pages

Forums

Help

About djvu4shape and lossless dictionaries for large number of pages

Building issues

djvudict

djvu4shapes and shared dictionary info

About performance and possibility to creation of lossless common

About djvu4shape and lossless dictionaries for large number of pages

Forums

Help

About djvu4shape and lossless dictionaries for large number of pages document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Building issues

djvudict

djvu4shapes and shared dictionary info

About performance and possibility to creation of lossless common

About djvu4shape and lossless dictionaries for large number of pages