OK, I tried your patch - it had problems with my test document here, even after I modified it a bit. I've tried a different approach and committed it to SVN. If you could compile the SVN version, test it on your document, and report back, that would be wonderful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, I found the bug that caused the crash. I'll try to clean things up and commit the fix tonight, and then you'd be welcome to try any other documents you have, too. It looks like there's also some stuff in your document that will help me improve the koffice filter code, so your document is a big help. :) Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was just able to convert your last test case document using the latest wv2 library with no crash. Thanks again for your test cases - they're great for testing different things and seeing where improvement is needed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
well, it must have been in a previous commit that the bug was actually fixed, but I'm very glad you don't have crashing documents anymore. :) I'll have to look at your text extractor sometime, but right now I'm getting ready to release wv2-0.3.1 with the fixes for the bugs you found.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've found a bug with wv 0.3.0 where it is trying to uncompress an image that is not actually compressed.
I've written a little hack where I test for the gzip magic headers before inflating.
Here are my modifications in a git patch. You'll probably find a better approach since you know the API.
It's a great library by the way. I've been able to write a python module that extracts text in a couple of hours.
diff --git a/src/parser9x.cpp b/src/parser9x.cpp
index 2163f8e..591aa09 100644
--- a/src/parser9x.cpp
+++ b/src/parser9x.cpp
@@ -1002,9 +1002,16 @@ void Parser9x::parsePictureEscher( const PictureData& data, OLEStreamReader* str
blip.dump();
#endif
//if it's these types, we need to decompress the data
- //TODO need to check that the data is actually compressed
- if( blipType.compare("EMF") == 0 || blipType.compare("WMF") == 0
- || blipType.compare("PICT") == 0 )
+ //check that the data is actually compressed
+ U8 byte1, byte2;
+ stream->read(&byte1, 1);
+ stream->read(&byte2, 1);
+ stream->seek(0, G_SEEK_SET);
+ bool is_compressed = ( byte1 == gz_magic[0]
+ && gz_magic[1] == 0x8b );
+ if( ( is_compressed ) && ( blipType.compare("EMF") == 0
+ || blipType.compare("WMF") == 0
+ || blipType.compare("PICT") == 0 ) )
{
wvlog << "Decompressing image data at " << stream->tell() << "..." << std::endl;
ZCodec z( 0x8000, 0x8000 );
diff --git a/src/zcodec.cxx b/src/zcodec.cxx
index 81d3429..44e7b9b 100644
--- a/src/zcodec.cxx
+++ b/src/zcodec.cxx
@@ -82,8 +82,6 @@
#define GZ_COMMENT 0x10 /* bit 4 set: file comment present */
#define GZ_RESERVED 0xE0 /* bits 5..7: reserved */
-static int gz_magic[2] = { 0x1f, 0x8b }; /* gzip magic header */
-
// ----------
// - ZCodec -
diff --git a/src/zcodec.hxx b/src/zcodec.hxx
index a099fc1..9169f0f 100644
--- a/src/zcodec.hxx
+++ b/src/zcodec.hxx
@@ -92,6 +92,8 @@ typedef unsigned long ULONG;
typedef bool BOOL;
typedef U8 BYTE;
+static int gz_magic[2] = { 0x1f, 0x8b }; /* gzip magic header */
+
class ZCodec
{
private:
By the way, I'm sorry I cannot give you the faulty document. it's a private customer document.
Thank you very much for the information and the patch. I'll look it over and try to get it committed today.
OK, I tried your patch - it had problems with my test document here, even after I modified it a bit. I've tried a different approach and committed it to SVN. If you could compile the SVN version, test it on your document, and report back, that would be wonderful.
Great, but reading that wv2 development is back and that you've switched to SVN is even better news! (Though I prefer Git ;-)).
The SVN checkout doesn't crash in the decompression routine anymore but elsewhere:
TEXT: [<65|A><99|c><99|c><101|e><115|s><115|s><111|o><105|i><114|r><101|e><115|s><32| ><115|s><111|o><117|u><115|s><32| ><112|p><114|r><101|e><115|s><115|s><105|i><111|o><110|n>]
TEXT: paragraph end
TEXT: paragraph start
TEXT: paragraph end
TEXT: paragraph start
TEXT: paragraph end
TEXT: paragraph start
TEXT: picture found
*** glibc detected *** /home/herve/Itaapy/Indexeurs/wv2/tests/.libs/lt-handlertest: malloc(): memory corruption: 0x09f35e68 ***
======= Backtrace: =========
/lib/libc.so.6[0xb7a71ee4]
/lib/libc.so.6[0xb7a74510]
/lib/libc.so.6(__libc_malloc+0x9c)[0xb7a7618c]
/usr/lib/libstdc++.so.6(_Znwj+0x29)[0xb7c3bbe9]
/home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZNSt6vectorIhSaIhEE13_M_insert_auxEN9__gnu_cxx17__normal_iteratorIPhS1_EERKh+0x9f)[0xb7f870af]
/home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6ZCodec13ImplWriteBackEPSt6vectorIhSaIhEE+0x93)[0xb7f8ba83]
/home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6ZCodec10DecompressERN6wvWare15OLEStreamReaderEPSt6vectorIhSaIhEE+0xbf)[0xb7f8bc1f]
/home/herve/Itaapy/Indexeurs/wv2/src/.libs/libwv2.so.2(_ZN6wvWare8Parser9x18parsePictureEscherERKNS_11PictureDataEPNS_15OLEStreamReaderEii+0Abandon
I'll debug it later.
I was testing with the "handlertest" program.
All I could deduce is that it crashes when it calls "picture()" in src/handlers.cpp line 213.
It then involves templates and I've just learned C++ a week ago. ;-)
Could I give you the document in private so you analyze it with your tools? I'm 99% sure it was generated by a regular MS Word, not OpenOffice.org.
Yes, that'd be great. My gmail address is cricketc. I need different documents to test these different cases. :)
OK, I found the bug that caused the crash. I'll try to clean things up and commit the fix tonight, and then you'd be welcome to try any other documents you have, too. It looks like there's also some stuff in your document that will help me improve the koffice filter code, so your document is a big help. :) Thanks.
This bug is fixed but I've found another one apparently still in the "TextHandler::pictureFound" callback.
I've sent a test case to you by e-mail.
I was just able to convert your last test case document using the latest wv2 library with no crash. Thanks again for your test cases - they're great for testing different things and seeing where improvement is needed.
I committed another change that stops that latest document from crashing. I'm not sure the fix is entirely correct, but at least there's no crash. :)
I don't know how you did it, since you just added a debug print, but I can't find any document crashing anymore.
I consider this bug as fixed and congratulate you!
My text extractor is now ready, you can see it here:
http://git.hforge.org/?p=itools.git;a=blob;f=office/doctotext.cc;h=850b1b46386994173442bb757098473b99db297f;hb=HEAD
I'll wait for the next release to play with the "pictureFound" callback.
well, it must have been in a previous commit that the bug was actually fixed, but I'm very glad you don't have crashing documents anymore. :) I'll have to look at your text extractor sometime, but right now I'm getting ready to release wv2-0.3.1 with the fixes for the bugs you found.