MComix / Patches / #35 better support for solid archives

Benoit Pierre - 2014-05-26

Here is an updated series, which include:

BaseArchive.extract/BaseArchive.extract_all prototypes change

changes to mcomix.process to allow overriding were stdin/stderr are redirected and to default to redirecting stderr to /dev/null (or platform equivalent)

I have checked the changes to mcomix.process for Windows using Wine, but it would be better if someone with access to a real Windows machine gave it a try.

0001-improve-mcomix.process.patch

0002-better-support-for-solid-archives.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-26

I haven't actually tested the patch yet (too tired to process changes of this impact at the moment), but I wonder if extracting archives this way wouldn't be a much better idea than extracting file by file, even on non-solid archives. My reasoning is this: You can never rely on 7z to output ``correct'' filenames on stdout, especially on Windows. For example, one compressed file could be あえうお.jpg, but when asking 7z to list the archive contents, it would turn out as ????.jpg. Thus, if you ask 7z to extract ????.jpg, it will of course do nothing, because it cannot find the file you asked for.

If you extracted everything at once, the file names would still be wrong, but I feel that wouldn't be that much of a problem if it at least could display the contents.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-26

You are right: I created a number of archives with non-ASCII filenames, and while everything works fine on Linux, I'm having troubles on Windows. But with my patch, a solid 7z archive with non-ASCII filenames is displayed okay (minus some characters conversion to _ in the names). Note that again I'm testing in Wine, so, maybe some of my issues are Wine issues.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ark - 2014-05-26

I didn't test this on Windows either but, in general, if there is a way to get the correct names and the actual file sizes in the same order the files will be extracted from a solid archive, this isn't that much of an issue. This is because we only need the file sizes to split the output stream, but we don't need the respective filenames.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

The pipe thingy works on Windows. I have to admit that the speedup is quite awesome on solid archives. However, between 0001-better-support-for-solid-archives.patch and 0002-better-support-for-solid-archives.patch, something seems to have gone wrong. MComix no longer extracts files from certain archives with errors akin to

16:11:52 [Thread-1-extract] ERROR: ! Extraction error: [Errno 2] No such file or directory: u'/tmp/mcomix.A3DgjR/dirname/00.jpg'

I believe the problem is that the archives in question contain directories - archives without any directories in them seem to work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Right, I was aware of that, sorry for not updating the ticket, I'll fix it.

However, after thinking about it, I think it would be better for MComix to not use the archive name for the temporary path name, as it can be a security risk with tar archives:

:::shell > tar -cvf test.tar --xform='s,^,../,' *.jpeg > tar -tf test.tar | head tar: Removing leading `../' from member names ../0001.jpeg ../0002.jpeg ../0003.jpeg ../0004.jpeg ../0005.jpeg ../0006.jpeg ../0007.jpeg ../0008.jpeg ../0009.jpeg ../0010.jpeg

Viewing this archive will create files outside the temporary directory!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

This should only be a problem for tar files, if I understand correctly? In any case, you are right, file names should be sanitized first before extraction. Committed some code to do just that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Well, I don't know if its possible to craft some malicious ZIP, RAR, or 7z archive. Another problem on Windows could be case sensitivity. Anyway, it works fine now with the example above.

Note: mcomix/archive/rarfile.py should use `_unicode_filename/_original_filename' too, no?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

Not necessarily, since everything that comes out of UnrarDll is already Unicode. I guess all names could be secured with _replace_invalid_filesystem_chars just in case, but I have yet to see a RAR archive that couldn't be opened with UnrarDll without problems, so it's probably not high priority.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Do you mean you usually see issues with UnrarDll?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Sorry, never mind about the last comment. But yeah, _unicode_filename/_original_filename' should be called so_replace_invalid_filesystem_chars` is called.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

My bad. In fact, I meant just the opposite. I had a 100% success rate with UnrarDll since its introduction, meaning every single RAR archive I obtained and opened in the past 2? 3? years worked.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Here is an updated patch that correctly handle directories in archives.

Edit: remove debug print.

Last edit: Benoit Pierre 2014-05-28

0001-better-support-for-solid-archives.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

And regarding Unicode handling when listing/extracting 7z archives, I think I found a way to do it. See the attached patch.

Note: again, tested with Wine. Python versions before 3 something are supposed to have issues because CreateProcess is used instead of CreateProcessW, but I had no issue when testing. If this in fact does not work on a real Windows machine (because the command line to ask 7z to extract an entry does not work with Unicode char), then I think a possible workaround would be to use a temporary file and the -i@tempfile switch to ask for a specific archive member.

0001-fix-unicode-names-support-with-7z-archives.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

Well, those certainly are interesting results. For one, because -scc isn't actually listed as available option when calling only "7z". For another, on my Linux machine, adding -scc always results in "Incorrect command line" error from 7z, even though it works on my Windows machine. The funny thing is, both are presumably "7-Zip 9.20", though it might make a difference that on my Linux machine, there is also a "p7zip 9.20" line.

That being said, the success rate on Windows is about 50% right now - some archives work, some don't at this point. I'll try to look into it further.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

D'oh! Same issue on Linux. Using p7zip too (9.20.1). So this flag should only be added on Windows (unicode names already work fine for me on Linux without this flag). I hope locale settings don't affect the output too much too...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Oh, and yeah, only found about it after searching the 7z forums since its indeed not listed as an available option.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

It would appear that using this -scc option does indeed fix the encoding of the file names in list_contents (on Windows), but apparently the option does nothing when added to the extract argument list - passing back file names exactly as received still mostly results in "No files to process" from 7z, indicating that the name somehow got corrupted (or maybe 7z simply ignores the switch for its own command line parameters). This whole encoding shit is fucked up on Windows, anyway.

Now, this problem could be avoided by unconditionally using "is_solid = True", which would be quite good - well, except for the use cases where the whole archive isn't needed (or wanted), such as when generating a thumbnail from File->Open or from the library view.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

OK, and how about with this patch?

0001-fix-unicode-names-support-with-7z-archives.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Also, what if the actual archive name contains Unicode chars? Does it work?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oddegamra - 2014-05-28

Yeah, that works. Thanks for figuring this out!

My only concern at this point is that 7z on Linux may or may not output lines encoded in UTF-8. Granted, most modern systems probably do, as you'd be hard-pressed to find a distribution that hasn't made the switch yet. But who knows?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Yeah, I tried the following:

LC_ALL=fr_FR.ISO-8859-1 7z l -slt a/test-7z-unicode.cbz

And archive names are not correctly listed (but header names are not changed).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benoit Pierre - 2014-05-28

Updated patch (I thought I had checked, but I missed a leftover call to _get_extract_arguments in sevenzip.py).

0001-fix-unicode-names-support-with-7z-archives.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ark - 2015-02-14

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

better support for solid archives

GTK+ comic book viewer.

Group

Searches

Help

#35 better support for solid archives

Related

Discussion