First pass at adding better support for solid archives: when an archive is solid, extract all files in one pass (in the order they are archived). The most improved use case is with 7z archives (e.g. from 7m33s to 13s to extract a 167Mo archive containing 146 images).
Areas of improvements:
The code in sevenzip.py access internal fields for the mcomix.process.Process object and there is the problem of stderr (redirected to subprocess.PIPE):
:::python
proc = process.Process([self._get_executable()] +
self._get_extract_arguments() +
[self.archive])
fd = proc.spawn()
if not fd:
return
# FIXME: what if process is overly chatty on stderr?
proc._proc.stdin.close()
I wanted to change the prototype to spawn (and _exec) like this:
:::python
def spawn(self, stdout=subprocess.PIPE,
stdin=subprocess.PIPE, stderr=subprocess.PIPE):
but I saw this comment in _exec:
:::python
# Cannot spawn processes with PythonW/Win32 unless stdin and
# stderr are redirected to a pipe as well.
self._proc = subprocess.Popen(self._args, stdout=subprocess.PIPE,
stdin=subprocess.PIPE, stderr=subprocess.PIPE,
startupinfo=self._startupinfo())
The message for the related commit is:
Fix MComix failing to start on Win32 using PythonW
Apparently, Python cannot spawn processes on Win32 when MComix is
started with PythonW instead of Python, since stdin/stdout/stderr
are closed and thus subprocess.Popen raises WindowsError (invalid
handle). Work around by piping both stdin as well as stderr.
What if stdin/stderr are redirected to os.devnull instead? Can someone with access to a Windows machine check?
Another area of improvement is the prototypes for BaseArchive.extract and BaseArchive.extract_all:
:::python
def extract(self, filename, destination_path):
[...]
def extract_all(self, entries, callback):
For consistency with extract, extract_all takes a list of (filename, destination_path), but I'd like to simplify those prototypes to:
:::python
def extract(self, filename, destination_directory):
[...]
def extract_all(self, entries, destination_directory, callback):
With entries a simple list of filenames. What do you think?
Oh, and there are few missing docstrings too!
Here is an updated series, which include:
BaseArchive.extract/BaseArchive.extract_allprototypes changemcomix.processto allow overriding werestdin/stderrare redirected and to default to redirectingstderrto/dev/null(or platform equivalent)I have checked the changes to
mcomix.processfor Windows using Wine, but it would be better if someone with access to a real Windows machine gave it a try.I haven't actually tested the patch yet (too tired to process changes of this impact at the moment), but I wonder if extracting archives this way wouldn't be a much better idea than extracting file by file, even on non-solid archives. My reasoning is this: You can never rely on 7z to output ``correct'' filenames on stdout, especially on Windows. For example, one compressed file could be あえうお.jpg, but when asking 7z to list the archive contents, it would turn out as ????.jpg. Thus, if you ask 7z to extract ????.jpg, it will of course do nothing, because it cannot find the file you asked for.
If you extracted everything at once, the file names would still be wrong, but I feel that wouldn't be that much of a problem if it at least could display the contents.
You are right: I created a number of archives with non-ASCII filenames, and while everything works fine on Linux, I'm having troubles on Windows. But with my patch, a solid 7z archive with non-ASCII filenames is displayed okay (minus some characters conversion to _ in the names). Note that again I'm testing in Wine, so, maybe some of my issues are Wine issues.
I didn't test this on Windows either but, in general, if there is a way to get the correct names and the actual file sizes in the same order the files will be extracted from a solid archive, this isn't that much of an issue. This is because we only need the file sizes to split the output stream, but we don't need the respective filenames.
The pipe thingy works on Windows. I have to admit that the speedup is quite awesome on solid archives. However, between 0001-better-support-for-solid-archives.patch and 0002-better-support-for-solid-archives.patch, something seems to have gone wrong. MComix no longer extracts files from certain archives with errors akin to
16:11:52 [Thread-1-extract] ERROR: ! Extraction error: [Errno 2] No such file or directory: u'/tmp/mcomix.A3DgjR/dirname/00.jpg'
I believe the problem is that the archives in question contain directories - archives without any directories in them seem to work.
Right, I was aware of that, sorry for not updating the ticket, I'll fix it.
However, after thinking about it, I think it would be better for MComix to not use the archive name for the temporary path name, as it can be a security risk with tar archives:
Viewing this archive will create files outside the temporary directory!
This should only be a problem for tar files, if I understand correctly? In any case, you are right, file names should be sanitized first before extraction. Committed some code to do just that.
Well, I don't know if its possible to craft some malicious ZIP, RAR, or 7z archive. Another problem on Windows could be case sensitivity. Anyway, it works fine now with the example above.
Note:
mcomix/archive/rarfile.pyshould use `_unicode_filename/_original_filename' too, no?Not necessarily, since everything that comes out of UnrarDll is already Unicode. I guess all names could be secured with
_replace_invalid_filesystem_charsjust in case, but I have yet to see a RAR archive that couldn't be opened with UnrarDll without problems, so it's probably not high priority.Do you mean you usually see issues with UnrarDll?
Sorry, never mind about the last comment. But yeah,
_unicode_filename/_original_filename' should be called so_replace_invalid_filesystem_chars` is called.My bad. In fact, I meant just the opposite. I had a 100% success rate with UnrarDll since its introduction, meaning every single RAR archive I obtained and opened in the past 2? 3? years worked.
Here is an updated patch that correctly handle directories in archives.
Edit: remove debug print.
Last edit: Benoit Pierre 2014-05-28
And regarding Unicode handling when listing/extracting 7z archives, I think I found a way to do it. See the attached patch.
Note: again, tested with Wine. Python versions before 3 something are supposed to have issues because CreateProcess is used instead of CreateProcessW, but I had no issue when testing. If this in fact does not work on a real Windows machine (because the command line to ask 7z to extract an entry does not work with Unicode char), then I think a possible workaround would be to use a temporary file and the
-i@tempfileswitch to ask for a specific archive member.Well, those certainly are interesting results. For one, because -scc isn't actually listed as available option when calling only "7z". For another, on my Linux machine, adding -scc always results in "Incorrect command line" error from 7z, even though it works on my Windows machine. The funny thing is, both are presumably "7-Zip 9.20", though it might make a difference that on my Linux machine, there is also a "p7zip 9.20" line.
That being said, the success rate on Windows is about 50% right now - some archives work, some don't at this point. I'll try to look into it further.
D'oh! Same issue on Linux. Using p7zip too (9.20.1). So this flag should only be added on Windows (unicode names already work fine for me on Linux without this flag). I hope locale settings don't affect the output too much too...
Oh, and yeah, only found about it after searching the 7z forums since its indeed not listed as an available option.
It would appear that using this -scc option does indeed fix the encoding of the file names in list_contents (on Windows), but apparently the option does nothing when added to the extract argument list - passing back file names exactly as received still mostly results in "No files to process" from 7z, indicating that the name somehow got corrupted (or maybe 7z simply ignores the switch for its own command line parameters). This whole encoding shit is fucked up on Windows, anyway.
Now, this problem could be avoided by unconditionally using "is_solid = True", which would be quite good - well, except for the use cases where the whole archive isn't needed (or wanted), such as when generating a thumbnail from File->Open or from the library view.
OK, and how about with this patch?
Also, what if the actual archive name contains Unicode chars? Does it work?
Yeah, that works. Thanks for figuring this out!
My only concern at this point is that 7z on Linux may or may not output lines encoded in UTF-8. Granted, most modern systems probably do, as you'd be hard-pressed to find a distribution that hasn't made the switch yet. But who knows?
Yeah, I tried the following:
And archive names are not correctly listed (but header names are not changed).
Updated patch (I thought I had checked, but I missed a leftover call to
_get_extract_argumentsinsevenzip.py).