Menu

#70 Implementation of jpeg_skip_scanlines(JDIMENSION rows) API to optimize subset decodes

closed-integrated
nobody
None
1
2015-07-21
2015-06-16
Matt Sarett
No

Background:

I am a member of the Skia graphics team within Google. We feel that making Android a user of libjpeg-turbo will have significant benefits for jpeg decodes on Android. In particular, we will benefit from the fast and accurate Neon IDCT, optimized Huffman decodes, flexible color conversions, and flexible scaling.

A feature of our current Android jpeg decoder (libjpeg plus patches) is the ability to decode image subsets efficiently. The idea is that if we only need to display a subset of an image, we will improve performance by decoding only that subset (rather than decoding the entire image and then cropping).

Despite the optimal features of libjpeg-turbo, our current decoder outperforms libjpeg-turbo on a given set of subset decoding performance benchmarks. libjpeg-turbo suffers on subsets near the bottom of the image (because it must decode all preceding rows) and on thin vertical stripe subsets (because it must decode the each row in its entirety).

We are hoping to make a contribution to libjpeg-turbo to aid with subsets at the bottom of the image. Specifically we feel it would be beneficial to extend the standard API with the additional function:
GLOBAL(JDIMENSION) jpeg_skip_scanlines (j_decompress_ptr cinfo, JDIMENSION lines);

This function would advance the input source (and the row counters etc.) past the input number of lines without actually performing the work of decoding these lines. We feel that this would make libjpeg-turbo superior to our current jpeg decoder in all important cases and would nullify any opposition within Android to switching over to libjpeg-turbo.

Implementation:

It was my hope to minimize touch points in the existing libjpeg-turbo code and also maintain ABI compatibility. If I have failed in either of these aims, I would be happy to try to rework the patch in a more acceptable way.

Testing:

jpeg_skip_scanlines() is allowed to be called after any valid number of rows have been read (0 to height-1). Any rows that are read after a call to jpeg_skip_scanlines() should be identical to what they would have been had we read the preceding rows instead of skipping them.

This has been tested for -
Sequential and progressive images
Interleaved, partially interleaved, and non-interleaved images
Huffman and arithmetic entropy encoded images
All scaling options provided by libjpeg-turbo
Simple upsampling and context upsampling

If any correctness issues are discovered, I would be happy to find and integrate appropriate fixes into the patch.

Questions/Concerns:

In order to skip, we will be performing entropy decoding and not using the resulting coefficients. For that reason, I have implemented discard_mcu(), which is exactly like decode_mcu() except that it does not store coefficients to an output buffer.

The benefit of using discard_mcu() over decode_mcu() is about 5% performance improvement for large subsets and it saves us from needing to access the MCU_buffer (private in the coef struct). The drawback is the repeated code. I'm opened to suggestions on the best choice for this tradeoff (or maybe there is an option 3?).

Additionally, I added two fields to the cinfo struct (which I was trying to avoid). I think ABI compatibility should be fine since they are at the end of the struct and do not change the alignment.

The first is a dummy buffer that jpeg_skip_scanlines() uses when it needs to call jpeg_read_scanlines().

The second is a function pointer to discard_mcu(). This would fit better in the public part of the coef struct (but putting it there would break the ABI). We could avoid this by using decode_mcu() (or finding another option?).

If adding members to the cinfo struct is a problem, I think it may be doable to find alternatives to these approaches.

I'm looking forward to getting feedback on this patch! Thanks for your time and effort!

1 Attachments

Discussion

  • DRC

    DRC - 2015-06-16

    Cool. I'll take a look at this ASAP. Unfortunately, any extension to the jpeg_[de]compress_struct structures will break ABI compatibility, even if the new members are at the end of the structure. The reason is that jpeg_CreateCompress() and jpeg_CreateDecompress() check the size of the structure and will throw an error if there is a size mismatch between the library and the application. However, I recently added a framework to mozjpeg that works around this issue by stashing new structure members in the opaque master structures and adding a set of accessor functions to allow those new members to be read or modified. In this case, since the new structure members don't need to be read or modified by the application, then they can just be moved into the master structure without needing to add accessor functions for them.

    As far as the reused code in discard_mcu(), that should be easy to solve using macros. Will get back to you once I've evaluated the patch.

     
  • DRC

    DRC - 2015-06-16

    Two things I need you to do before I can look at this further:

    (1) For all of the source files that you modified, please add a copyright notice underneath the "libjpeg-turbo Modifications" header. It should say something like "Copyright (C) 2015, Google, Inc." (or whatever name your project team would normally use for open source code attributions.) This serves two purposes: it complies with the terms of the IJG license requiring that all modifications are clearly marked, and it serves as a written record of Google's intent to introduce this code into the open source.

    (2) Modify djpeg so that it can test the new jpeg_skip_scanlines() function.

     
  • Matt Sarett

    Matt Sarett - 2015-06-16

    (1) Added copyright headers.

    (2) Modified djpeg so it can test the new jpeg_skip_scanlines() function.

    There were a few possibilities for this. In the end, I went with an approach where the user specifies on the command line that he wants an image "stripe" and then we use jpeg_skip_scanlines() to provide this stripe. This is cool because it adds a new functionality to djpeg. The drawback is that jinit_write() always writes a header for the full image in the dest_mgr. I work around this with a hack that I'm not too proud of. Also the dest_mgr still allocates unnecessary space.

    This could be cleaner if we change the signature of the dest_mgr to allow it to take in the height as a parameter. I'm not sure if this is allowed.

    Another option is to test by decoding the full image in stripes. This would require making multiple passes on input data source and reading part of the image each time. This is actually how I've been testing during development - I then diff the output with the decode output of a regular decode. This would mean that "stripe" mode is just a test mode, but doesn't add any functionality to djpeg (except a way to decompress images slowly).

    I'm happy to go with whatever approach you think makes the most sense.

    (3) In order to accomplish (2), I had to learn to build libjpeg-turbo with the normal build system. This showed me that my code was causing a bunch of warnings. I added corrections to silence the warnings.

     
  • DRC

    DRC - 2015-06-25

    A heavily modified version of this has been checked into SVN trunk, but there is an issue that I can't track down. Doing the following causes a seg fault:

    ./djpeg -skip 1,144 -outfile skip.ppm ../testimages/testorig.jpg
    

    (NOTE: -skip is a new option that I added. It does the exact opposite of -stripe and is thus useful for regression testing, since it's the only way to exercise some of the code paths -- such as the path that skips lines within an iMCU row or the path that handles prefetched iMCU rows, etc.)

    It appears that the segfault is occurring within the upsampler, which shouldn't ever be invoked within the body of jpeg_skip_scanlines().

    I need to get this resolved before I can finish crafting the regression tests. I am attempting to design a test suite that will, as much as possible, reuse existing generated test images while still providing full coverage of the jpeg_skip_scanlines() code paths.

    Also please review the code in general and let me know if you spot any issues with it, in terms of formatting or factual errors in the comments, etc. A lot of stuff was moved around in order to avoid exposing any new global symbols (except for the new jpeg_skip_scanlines function), to avoid code duplication, and to improve readability.

     

    Last edit: DRC 2015-06-25
  • Matt Sarett

    Matt Sarett - 2015-06-25

    I think the issue is with:
    dummy_buffer_setup (j_decompress_ptr cinfo)

    In some cases it returns with the dummy_buffer unallocated. I will follow up once I know a little bit more.

     
  • Matt Sarett

    Matt Sarett - 2015-06-25

    Hmm never mind, I think that's a different bug.

     
  • DRC

    DRC - 2015-06-25

    Yeah, I think the dummy buffer needs to be initialized to NULL in jinit_master_decompress(), but doing that didn't fix the issue for me (although it made valgrind happier.)

     
  • Matt Sarett

    Matt Sarett - 2015-06-25

    Yeah that was me causing me some seg faults, but did not solve the first issue. I actually just figured out that the first seg fault is happening because set_wraparound_pointers() is never getting called. I don't have a fix yet, but I'm working on it.

     
  • Matt Sarett

    Matt Sarett - 2015-06-25

    As I mentioned above, the bug you described was caused by never calling get_wraparound_pointers() after skipping the first iMCU block. I am attaching a patch that corrects this behavior and that I expect to work in all cases. I also took a closer look at get_bottom_pointers() as well - as far as I can tell this will work properly for any usage of read() and skip(). Nice catch on this one! It slipped by my tests.

    Also, as you mentioned above, the dummy buffer needs to be initialized to NULL in jinit_master_decompress().

    Thanks so much for all your efforts in testing/integrating this patch! I haven't done a thorough read through yet, but from what I've seen while I've been debugging, it looks great. I'm excited about the impact this will have.

     
  • Matt Sarett

    Matt Sarett - 2015-06-25

    As a whole the patch looks great! I just finished reading through the diffs.

    I think the following comment doesn't make too much sense anymore since your reorganization gives us access to my_coef_struct.
    / Another
    * advantage of discarding coefficients is that it allows us to avoid
    * accessing the private field cinfo->coef->MCU_buffer (which would
    * normally be a parameter to decode_mcu().)
    /

    But I'm just nitpicking there.

     
  • DRC

    DRC - 2015-06-27

    This has been fully implemented and verified in SVN trunk. In addition to the initial integration/rewrite of the feature and testing, then subsequent integration of the bug fixes, I also had to design a regression test suite that fully covers all of the code paths in the jpeg_skip_scanlines() function with the least possible amount of tests (pretty tedious work.) This test suite additionally had to be ported to 12-bit samples and to the Windows build system, and building/running on Windows revealed some other issues that needed fixing. Overall, my work on this feature has required about the equivalent of $2000 in unpaid labor, and that was in addition to thousands of dollars more that I already donated to rewrite the RGB 565 feature and other stuff that Linaro initially contributed in the context of getting libjpeg-turbo to work with Android.

    For completeness, it would be nice if this feature was extended into the TurboJPEG API (which would allow for benchmarking it), but at this point, doing all of this Android work has actually created a financial hardship for me, so I simply cannot continue adding features without funding. The libjpeg-turbo Project, because it has no "General Fund" for me to draw from to fund this kind of work, ends up requiring me to mostly contribute my labor to it pro bono. I'm happy to do that for bug fixes, etc., but features are always difficult to integrate, even if someone else is writing the initial code. To put it another way-- there's no such thing as a free patch. :) Regardless of how good a patch is (and yours was good-- don't get me wrong), it always just takes a lot of time to integrate properly. There are just too many platforms and compilers to test, too many things to do to make the code conform to the existing (20-year-old) coding standards, too many tests to run to make sure everything is stable and performant. I wanted to go ahead and get libjpeg-turbo into Android so I could check that off the list, but implementing jpeg_skip_scanlines() pretty much broke the bank. In the future, I am simply not going to be able to accept any non-trivial patches at all without financial sponsorship, unless a company steps forward to sponsor a General Fund for maintaining libjpeg-turbo.

    In short, hopefully this feature is sufficient to get libjpeg-turbo into Android, and any subsequent patches will be minor bug fixes. If I do any more major work on Android stuff pro bono, I'm literally going to go broke. If you know anyone within Google who might be able to kick a grant my way, I would appreciate it if you could put in a good word with them.

     

    Last edit: DRC 2015-06-27
  • DRC

    DRC - 2015-06-27
    • status: open --> closed-integrated
     
  • Matt Sarett

    Matt Sarett - 2015-07-01

    It's good to hear where you're coming from. Thanks!

    On another note, I think I may have discovered a memory leak bug.

    We currently free the dummy_row_buffer in jpeg_finish_output, which I think it only called in buffered image mode. I think we might want to free it in jpeg_finish_decompress? Let me know what you think!

     
  • DRC

    DRC - 2015-07-01

    Fixed in trunk. Instead of using jpeg_get_small(), dummy_buffer_setup() now uses (*cinfo->mem->alloc_small) (..., JPOOL_IMAGE, ...) instead, which is consistent with other parts of the libjpeg code. This allocates the memory from the libjpeg image pool, so it will be freed automatically whenever jpeg_abort_decompress() or jpeg_destroy()/jpeg_destroy_decompress() is called. The master structure is freed at the same time, so it should not be necessary to explicitly set the dummy buffer pointer to NULL (before jpeg_skip_scanlines() can be called again, the calling program will have had to first call jpeg_start_decompress(), which will re-allocate the master struct.)

     
  • Matt Sarett

    Matt Sarett - 2015-07-21

    If you're interested, I'm attaching another patch to fix a valgrind error.

    Valgrind complains on color conversion to 565, when skip() calls read() to read into the dummy buffer. Valgrind is upset because the dummy read() uses table lookup on garbage data.

    This can be fixed by adding a noop_convert() routine that we use on dummy reads. This has additional positive consequences of eliminating the need for a dummy row buffer and eliminating a step on dummy reads.

     
  • DRC

    DRC - 2015-07-21

    Nice! I'll get that integrated as soon as the repositories are back online.

     

Log in to post a comment.