Menu

#35 better support for solid archives

SVN
closed-fixed
nobody
None
5
2015-02-14
2014-05-26
No

First pass at adding better support for solid archives: when an archive is solid, extract all files in one pass (in the order they are archived). The most improved use case is with 7z archives (e.g. from 7m33s to 13s to extract a 167Mo archive containing 146 images).

Areas of improvements:

The code in sevenzip.py access internal fields for the mcomix.process.Process object and there is the problem of stderr (redirected to subprocess.PIPE):

:::python
        proc = process.Process([self._get_executable()] +
            self._get_extract_arguments() +
            [self.archive])
        fd = proc.spawn()
        if not fd:
            return

        # FIXME: what if process is overly chatty on stderr?
        proc._proc.stdin.close()

I wanted to change the prototype to spawn (and _exec) like this:

:::python
    def spawn(self, stdout=subprocess.PIPE,
              stdin=subprocess.PIPE, stderr=subprocess.PIPE):

but I saw this comment in _exec:

:::python
            # Cannot spawn processes with PythonW/Win32 unless stdin and
            # stderr are redirected to a pipe as well.
            self._proc = subprocess.Popen(self._args, stdout=subprocess.PIPE,
                    stdin=subprocess.PIPE, stderr=subprocess.PIPE,
                    startupinfo=self._startupinfo())

The message for the related commit is:

Fix MComix failing to start on Win32 using PythonW

Apparently, Python cannot spawn processes on Win32 when MComix is
started with PythonW instead of Python, since stdin/stdout/stderr
are closed and thus subprocess.Popen raises WindowsError (invalid
handle). Work around by piping both stdin as well as stderr.

What if stdin/stderr are redirected to os.devnull instead? Can someone with access to a Windows machine check?

Another area of improvement is the prototypes for BaseArchive.extract and BaseArchive.extract_all:

:::python
    def extract(self, filename, destination_path):

[...]

    def extract_all(self, entries, callback):

For consistency with extract, extract_all takes a list of (filename, destination_path), but I'd like to simplify those prototypes to:

:::python
    def extract(self, filename, destination_directory):

[...]

    def extract_all(self, entries, destination_directory, callback):

With entries a simple list of filenames. What do you think?

Oh, and there are few missing docstrings too!

1 Attachments

Related

Bugs: #73

Discussion

  • Benoit Pierre

    Benoit Pierre - 2014-05-26

    Here is an updated series, which include:

    • BaseArchive.extract/BaseArchive.extract_all prototypes change
    • changes to mcomix.process to allow overriding were stdin/stderr are redirected and to default to redirecting stderr to /dev/null (or platform equivalent)

    I have checked the changes to mcomix.process for Windows using Wine, but it would be better if someone with access to a real Windows machine gave it a try.

     
  • Oddegamra

    Oddegamra - 2014-05-26

    I haven't actually tested the patch yet (too tired to process changes of this impact at the moment), but I wonder if extracting archives this way wouldn't be a much better idea than extracting file by file, even on non-solid archives. My reasoning is this: You can never rely on 7z to output ``correct'' filenames on stdout, especially on Windows. For example, one compressed file could be あえうお.jpg, but when asking 7z to list the archive contents, it would turn out as ????.jpg. Thus, if you ask 7z to extract ????.jpg, it will of course do nothing, because it cannot find the file you asked for.

    If you extracted everything at once, the file names would still be wrong, but I feel that wouldn't be that much of a problem if it at least could display the contents.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-26

    You are right: I created a number of archives with non-ASCII filenames, and while everything works fine on Linux, I'm having troubles on Windows. But with my patch, a solid 7z archive with non-ASCII filenames is displayed okay (minus some characters conversion to _ in the names). Note that again I'm testing in Wine, so, maybe some of my issues are Wine issues.

     
  • Ark

    Ark - 2014-05-26

    I didn't test this on Windows either but, in general, if there is a way to get the correct names and the actual file sizes in the same order the files will be extracted from a solid archive, this isn't that much of an issue. This is because we only need the file sizes to split the output stream, but we don't need the respective filenames.

     
  • Oddegamra

    Oddegamra - 2014-05-28

    The pipe thingy works on Windows. I have to admit that the speedup is quite awesome on solid archives. However, between 0001-better-support-for-solid-archives.patch and 0002-better-support-for-solid-archives.patch, something seems to have gone wrong. MComix no longer extracts files from certain archives with errors akin to

    16:11:52 [Thread-1-extract] ERROR: ! Extraction error: [Errno 2] No such file or directory: u'/tmp/mcomix.A3DgjR/dirname/00.jpg'

    I believe the problem is that the archives in question contain directories - archives without any directories in them seem to work.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Right, I was aware of that, sorry for not updating the ticket, I'll fix it.

    However, after thinking about it, I think it would be better for MComix to not use the archive name for the temporary path name, as it can be a security risk with tar archives:

    :::shell
    > tar -cvf test.tar --xform='s,^,../,' *.jpeg
    > tar -tf test.tar | head
    tar: Removing leading `../' from member names
    ../0001.jpeg
    ../0002.jpeg
    ../0003.jpeg
    ../0004.jpeg
    ../0005.jpeg
    ../0006.jpeg
    ../0007.jpeg
    ../0008.jpeg
    ../0009.jpeg
    ../0010.jpeg
    

    Viewing this archive will create files outside the temporary directory!

     
  • Oddegamra

    Oddegamra - 2014-05-28

    This should only be a problem for tar files, if I understand correctly? In any case, you are right, file names should be sanitized first before extraction. Committed some code to do just that.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Well, I don't know if its possible to craft some malicious ZIP, RAR, or 7z archive. Another problem on Windows could be case sensitivity. Anyway, it works fine now with the example above.

    Note: mcomix/archive/rarfile.py should use `_unicode_filename/_original_filename' too, no?

     
  • Oddegamra

    Oddegamra - 2014-05-28

    Not necessarily, since everything that comes out of UnrarDll is already Unicode. I guess all names could be secured with _replace_invalid_filesystem_chars just in case, but I have yet to see a RAR archive that couldn't be opened with UnrarDll without problems, so it's probably not high priority.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Do you mean you usually see issues with UnrarDll?

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Sorry, never mind about the last comment. But yeah, _unicode_filename/_original_filename' should be called so_replace_invalid_filesystem_chars` is called.

     
  • Oddegamra

    Oddegamra - 2014-05-28

    My bad. In fact, I meant just the opposite. I had a 100% success rate with UnrarDll since its introduction, meaning every single RAR archive I obtained and opened in the past 2? 3? years worked.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Here is an updated patch that correctly handle directories in archives.

    Edit: remove debug print.

     

    Last edit: Benoit Pierre 2014-05-28
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    And regarding Unicode handling when listing/extracting 7z archives, I think I found a way to do it. See the attached patch.

    Note: again, tested with Wine. Python versions before 3 something are supposed to have issues because CreateProcess is used instead of CreateProcessW, but I had no issue when testing. If this in fact does not work on a real Windows machine (because the command line to ask 7z to extract an entry does not work with Unicode char), then I think a possible workaround would be to use a temporary file and the -i@tempfile switch to ask for a specific archive member.

     
  • Oddegamra

    Oddegamra - 2014-05-28

    Well, those certainly are interesting results. For one, because -scc isn't actually listed as available option when calling only "7z". For another, on my Linux machine, adding -scc always results in "Incorrect command line" error from 7z, even though it works on my Windows machine. The funny thing is, both are presumably "7-Zip 9.20", though it might make a difference that on my Linux machine, there is also a "p7zip 9.20" line.

    That being said, the success rate on Windows is about 50% right now - some archives work, some don't at this point. I'll try to look into it further.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    D'oh! Same issue on Linux. Using p7zip too (9.20.1). So this flag should only be added on Windows (unicode names already work fine for me on Linux without this flag). I hope locale settings don't affect the output too much too...

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Oh, and yeah, only found about it after searching the 7z forums since its indeed not listed as an available option.

     
  • Oddegamra

    Oddegamra - 2014-05-28

    It would appear that using this -scc option does indeed fix the encoding of the file names in list_contents (on Windows), but apparently the option does nothing when added to the extract argument list - passing back file names exactly as received still mostly results in "No files to process" from 7z, indicating that the name somehow got corrupted (or maybe 7z simply ignores the switch for its own command line parameters). This whole encoding shit is fucked up on Windows, anyway.

    Now, this problem could be avoided by unconditionally using "is_solid = True", which would be quite good - well, except for the use cases where the whole archive isn't needed (or wanted), such as when generating a thumbnail from File->Open or from the library view.

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Also, what if the actual archive name contains Unicode chars? Does it work?

     
  • Oddegamra

    Oddegamra - 2014-05-28

    Yeah, that works. Thanks for figuring this out!

    My only concern at this point is that 7z on Linux may or may not output lines encoded in UTF-8. Granted, most modern systems probably do, as you'd be hard-pressed to find a distribution that hasn't made the switch yet. But who knows?

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Yeah, I tried the following:

    LC_ALL=fr_FR.ISO-8859-1 7z l -slt a/test-7z-unicode.cbz
    

    And archive names are not correctly listed (but header names are not changed).

     
  • Benoit Pierre

    Benoit Pierre - 2014-05-28

    Updated patch (I thought I had checked, but I missed a leftover call to _get_extract_arguments in sevenzip.py).

     
  • Ark

    Ark - 2015-02-14
    • status: open --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB