Files are too big

ManDay
2011-01-12
2013-03-05
  • ManDay

    ManDay - 2011-01-12

    A 9 pages grayscale document with 30 lines of text per page and reasonable resolution to even distinguish the details of the letters is about 200K in size.

    A single page of Xournal's xoj is about the same size with only half the amount of text on it. I wouldn't want to know how many megabytes 9 pages with 30 lines per page are….

    Can this please be resolved? I don't know where all this superflous data is wasted but it surely must be spent at the wrong place, given that the quality / informational content of both the grayscale and the the XOJ are the same but the the XOJ is more than 10 times the size of the bitmapped PDF (and so is the PDF exported by Xournal).

    Something is wrong here. An inefficient file format, and inefficient storing algorithm, an enourmous amount of redundant samples (a 10 inch long line should not take any more data than a 0.1 long line) or an inefficient sampling algorithm?

    Striving for "superior graphical appeal" as stated as the mission of Xournal is not noteworthy feature per se, not if you only achieve this through wasting resources. It's a reasonable balance between quality and efficiency that sets Xournal apart from other programs.
    Which means sampling and data storage in a way that the perceived quality is that which is required and yet, there is no data wasted on redudancy.

    A simple criterion is, for example, that if the user makes input at a certain zoom level, the sampling range SHALL NOT BE GREATER THAN HALF A PIXEL! (Hopefully that is already true)

    Not to mention, that the sampled data has then to be optmized for vector storage. If three successive points lie on a line, where the second point is set off from the line BY OR LESS THAN HALF A PIXEL, it will not be stored!

     
  • Denis Auroux

    Denis Auroux - 2011-01-12

    With my fairly dense handwriting, a page is about 100K. Given that such a page consists of about 20000 line segments, I don't consider the amount to be unreasonable.

    The sampling range needs to be less than a pixel, otherwise graphical quality is not nearly as good. With a tablet, the input resolution is extremely high; my chickenscratch can be zoomed to 5 times the original size and remains acceptable-looking.  Just try disabling xinput support (which decreases sampling to 1 pixel) and you'll see. (If you don't see a difference, then probably your tablet device is not configured properly to send high resolution events, which is just too bad).

    If you want something slightly more compact, try changing the constant PIXEL_MOTION_THRESHOLD from 0.3 to a higher value in src/xournal.h, but I personally find that it's just not worth it. There's more optimizations that could be done, which I again find not to be worth it.

    My experience and that of most users is that the storage size is comparable to that of other note-taking software (MS Journal, Jarnal, etc.). If you really need something super-compact, feel free to tweak the code, it's not just a priority for most of us.

    Denis

     
  • ManDay

    ManDay - 2011-01-18

    Thanks Denis for your reply. I'll have to try what you said, I can't remember being anywhere close 100K, maybe I just have a bad memory.

     
  • ManDay

    ManDay - 2011-01-19

    Denis, you must have a significantly different version of Xournal than the current stable release. I just wrote a three pages document with a few lines of text at best and its 600 KB. Let me see whether I can upload or attach it somewhere…

     
  • ManDay

    ManDay - 2011-01-19

    My bad as for the 600, my estimate was based upon the PDF, not the XOJ. However the XOJ (this time with "fairly dense handwriting") is about twice the size you said, unless we differ in our definition of fairly dense.

    http://ompldr.org/iNzIxdg

     
  • ManDay

    ManDay - 2011-01-19

    However, its about 16*65 ~ 1000 Lines, so I guess 10000 Line segments (from what is visible, assuming that the algorithm does indeed not store more data on a stroke than the stroke actually has - which I'm doubting in this thread in the first place)

     
  • ManDay

    ManDay - 2011-01-19

    I zoomed in very much and it appears that the lines are indeed sampled at a fixed range (apparently time based). If that's true then that is what I meant, that it's not optimized. If I draw a perfect


    line, Xournal will store 5 dozen line segments instead of storing one line, which is the actual data we need.

    Postscript support Bezier Splines, Xournal appears to just blindly store line elements. It's like wanting to cook chicken and throwing everything from the freezer into the pan, because you are too lazy to pick out the package with the chicken.

    You will eventually get fried chicken but at what cost?

    I think it's very much worth a second thought whether not to spend time on an otimizer which stores splines and samples based upon change of pressure and change of position.

     
  • Denis Auroux

    Denis Auroux - 2011-01-19

    Your 1088 strokes (68 x 16) seem to consist of a total of 34408 coordinate pairs, so about 30 line segments per stroke.

    File size seems larger than for me because you probably wrote at a higher resolution level than I do (sampling size is relative to pixel size when the stroke is written, not an absolute distance on the page), and because you use pressure sensitivity (I don't => no pressure data to store). Still surprised it came out quite a bit bigger than I expected - your handwriting is not dense…

    I agree this is not optimized - but there's also a space vs. speed tradeoff. For instance:

    - long straight lines are split into short segments, because doing so speeds up the eraser algorithm that detects which strokes are near the eraser. I suppose the splitting need not be stored in the saved file, but I am not sure how much this really would save in common use scenarios.

    - using polygonal lines instead of Bezier curves is faster for rendering, I believe (though I'm not sure what algorithm is used for anti-aliasing a Bezier curve). Not clear by how much since there'd be fewer Bezier curves to render, so it's actually not clear-cut.

    Anyway: yes, optimizing strokes to reduce their number seems to be worth thinking about. I don't view it as very high priority though (and I'd rather do it to speed up rendering than to reduce file size, though of course reducing file size would be a good by-product. Storage capacity has improved much faster than CPU speed over the last few years).

    (Another way to reduce file size would be to store relative motion instead of absolute coordinates along the path, and of course also to use something smarter than gzipped ascii; but the current format has the advantage of being very easy to process if anyone wants to write a script that reads path data.)

    Denis

     
  • Anonymous

    Anonymous - 2011-05-18

    This is an interesting debate, and the point about the eraser algorithm is a really good one that I wouldn't have thought of.  I would point out that the file size seems to depend heavily on how fast you write - as a new user of tablets, I am still writing a little slowly to make sure I get things right, and since the sampling is time-based, this creates larger files.

    Could some of the ideas here be the basis of a simple post-stroke analysis?  I'm not (yet) familiar with the code, but am starting to poke around a bit.  But it seems that you could do some interesting things by sending each "Item" through a post-processs before storing it.  For starters, you could do things like converting the time-based raw data to a space-based set of data using interpolation, which could save any stroke, at any speed, in the optimal amount of space, in a way that is consistent with the needs of the eraser.  At the same time, you could - just as an example - run a simple smoothing algorithm to eliminate pen tablet artifacts that do not appear in real handwriting (inkscape's "simplify" command comes tom mind).  I am mindful of the limited developer time available to this project, but establishing a framework in the code for this kind of idea - even an empty one at first - would give people a specific place to play with ideas.

     
  • Denis Auroux

    Denis Auroux - 2011-05-21

    The sampling is not time-based. It is based on taking all the (x,y) coordinates sent by the input device, and discarding those that are too close to their predecessor to save space. My notion of "too close" is 0.2 pixels. What happens is that if you write fast then the wacom driver skips a lot of points.

    Stroke smoothing is a good option to get things to work with fewer points, but requires a good amount of code rewriting (to deal with splines instead of polygonal lines), might affect performance, and might generally not be quite worth the trouble for tablet pen users (though it might be good for users with lower resolution devices…).

    Denis

     

Log in to post a comment.