Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

found bug in reading images

2009-02-20
2013-06-04
  • Hi,

    I've found a bug with wv 0.3.0 where it is trying to uncompress an image that is not actually compressed.

    I've written a little hack where I test for the gzip magic headers before inflating.

    Here are my modifications in a git patch. You'll probably find a better approach since you know the API.

    It's a great library by the way. I've been able to write a python module that extracts text in a couple of hours.

    diff --git a/src/parser9x.cpp b/src/parser9x.cpp
    index 2163f8e..591aa09 100644
    --- a/src/parser9x.cpp
    +++ b/src/parser9x.cpp
    @@ -1002,9 +1002,16 @@ void Parser9x::parsePictureEscher( const PictureData& data, OLEStreamReader* str
                     blip.dump();
    #endif
                     //if it's these types, we need to decompress the data
    -                //TODO need to check that the data is actually compressed
    -                if( blipType.compare("EMF") == 0 || blipType.compare("WMF") == 0
    -                        || blipType.compare("PICT") == 0 )
    +                //check that the data is actually compressed
    +                U8 byte1, byte2;
    +                stream->read(&byte1, 1);
    +                stream->read(&byte2, 1);
    +                stream->seek(0, G_SEEK_SET);
    +                bool is_compressed = ( byte1 == gz_magic[0]
    +                        && gz_magic[1] == 0x8b );
    +                if( ( is_compressed ) && ( blipType.compare("EMF") == 0
    +                        || blipType.compare("WMF") == 0
    +                        || blipType.compare("PICT") == 0 ) )
                     {
                         wvlog << "Decompressing image data at " << stream->tell() << "..." << std::endl;
                         ZCodec z( 0x8000, 0x8000 );
    diff --git a/src/zcodec.cxx b/src/zcodec.cxx
    index 81d3429..44e7b9b 100644
    --- a/src/zcodec.cxx
    +++ b/src/zcodec.cxx
    @@ -82,8 +82,6 @@
    #define GZ_COMMENT      0x10 /* bit 4 set: file comment present */
    #define GZ_RESERVED     0xE0 /* bits 5..7: reserved */

    -static int gz_magic[2] = { 0x1f, 0x8b }; /* gzip magic header */
    -

    // ----------
    // - ZCodec -
    diff --git a/src/zcodec.hxx b/src/zcodec.hxx
    index a099fc1..9169f0f 100644
    --- a/src/zcodec.hxx
    +++ b/src/zcodec.hxx
    @@ -92,6 +92,8 @@ typedef unsigned long ULONG;
    typedef bool BOOL;
    typedef U8 BYTE;

    +static int gz_magic[2] = { 0x1f, 0x8b }; /* gzip magic header */
    +
    class ZCodec
    {
    private:

     
    • By the way, I'm sorry I cannot give you the faulty document. it's a private customer document.

       
    • Tuubaaku
      Tuubaaku
      2009-02-21

      Thank you very much for the information and the patch. I'll look it over and try to get it committed today.

       
    • Tuubaaku
      Tuubaaku
      2009-02-22

      OK, I tried your patch - it had problems with my test document here, even after I modified it a bit. I've tried a different approach and committed it to SVN. If you could compile the SVN version, test it on your document, and report back, that would be wonderful.

       
    • Great, but reading that wv2 development is back and that you've switched to SVN is even better news! (Though I prefer Git ;-)).

      The SVN checkout doesn't crash in the decompression routine anymore but elsewhere:

      TEXT: [<65|A><99|c><99|c><101|e><115|s><115|s><111|o><105|i><114|r><101|e><115|s><32| ><115|s><111|o><117|u><115|s><32| ><112|p><114|r><101|e><115|s><115|s><105|i><111|o><110|n>]
        TEXT: paragraph end
      TEXT: paragraph start
        TEXT: paragraph end
      TEXT: paragraph start
        TEXT: paragraph end
      TEXT: paragraph start
        TEXT: picture found
      *** glibc detected *** /home/herve/Itaapy/Indexeurs/wv2/tests/.libs/lt-handlertest: malloc(): memory corruption: 0x09f35e68 ***
      ======= Backtrace: =========
      /lib/libc.so.6[0xb7a71ee4]
      /lib/libc.so.6[0xb7a74510]
      /lib/libc.so.6(__libc_malloc+0x9c)[0xb7a7618c]
      /usr/lib/libstdc++.so.6(_Znwj+0x29)[0xb7c3bbe9]
      /home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZNSt6vectorIhSaIhEE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPhS1_EERKh+0x9f)[0xb7f870af]
      /home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6ZCodec13ImplWriteBackEPSt6vectorIhSaIhEE+0x93)[0xb7f8ba83]
      /home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6ZCodec10DecompressERN6wvWare15OLEStreamReaderEPSt6vectorIhSaIhEE+0xbf)[0xb7f8bc1f]
      /home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6wvWare8Parser9x18parsePictureEscherERKNS_11PictureDataEPNS_15OLEStreamReaderEii+0Abandon

      I'll debug it later.

       
    • I was testing with the "handlertest" program.

      All I could deduce is that it crashes when it calls "picture()" in src/handlers.cpp line 213.

      It then involves templates and I've just learned C++ a week ago. ;-)

      Could I give you the document in private so you analyze it with your tools? I'm 99% sure it was generated by a regular MS Word, not OpenOffice.org.

       
    • Tuubaaku
      Tuubaaku
      2009-02-23

      Yes, that'd be great. My gmail address is cricketc. I need different documents to test these different cases. :)

       
    • Tuubaaku
      Tuubaaku
      2009-02-26

      OK, I found the bug that caused the crash. I'll try to clean things up and commit the fix tonight, and then you'd be welcome to try any other documents you have, too. It looks like there's also some stuff in your document that will help me improve the koffice filter code, so your document is a big help. :) Thanks.

       
    • This bug is fixed but I've found another one apparently still in the "TextHandler::pictureFound" callback.

      I've sent a test case to you by e-mail.

       
    • Tuubaaku
      Tuubaaku
      2009-02-28

      I was just able to convert your last test case document using the latest wv2 library with no crash. Thanks again for your test cases - they're great for testing different things and seeing where improvement is needed.

       
    • Tuubaaku
      Tuubaaku
      2009-03-02

      I committed another change that stops that latest document from crashing. I'm not sure the fix is entirely correct, but at least there's no crash. :)

       
    • I don't know how you did it, since you just added a debug print, but I can't find any document crashing anymore.

      I consider this bug as fixed and congratulate you!

      My text extractor is now ready, you can see it here:

      http://git.hforge.org/?p=itools.git;a=blob;f=office/doctotext.cc;h=850b1b46386994173442bb757098473b99db297f;hb=HEAD

      I'll wait for the next release to play with the "pictureFound" callback.

       
    • Tuubaaku
      Tuubaaku
      2009-03-03

      well, it must have been in a previous commit that the bug was actually fixed, but I'm very glad you don't have crashing documents anymore. :) I'll have to look at your text extractor sometime, but right now I'm getting ready to release wv2-0.3.1 with the fixes for the bugs you found.