#38 QuaZip is extremely slow when selecting a file in ZIPs with many files

v1.0_(example)
closed-fixed
nobody
None
5
2014-01-26
2013-04-27
Richard
No

The function:
bool QuaZip::setCurrentFile(const QString& fileName, CaseSensitivity cs)

is extremely slow when processing zips containing many files.
In my application this function may take upwards of 300ms to complete if the file is located near the end of the directory.

Discussion

  • Richard
    Richard
    2013-04-27

    I'm starting some work on this and hope to offer a patch.

    The cause is this section of the function:

    p->hasCurrentFile_f=false;
    for(bool more=goToFirstFile(); more; more=goToNextFile()) {
        current=getCurrentFileName();
        if(current.isEmpty()) return false;
        if(sens) {
            if(current==fileName) break;
        } else {
            if(current.toLower()==lower) break;
        }
    }
    return p->hasCurrentFile_f;
    

    My current intention is that QuaZip will make a quickly-searchable map of the ZIP directory the first time it's asked for it, in both these two functions:

    bool QuaZip::setCurrentFile(const QString& fileName, CaseSensitivity cs)
    bool QuaZipPrivate::getFileInfoList(QList<TFileInfo> *result) const
    

    I'm intending that setCurrentFile() will result in a partial map, only including the parts of the ZIP directory it actually searched. This should not noticeably affect its standalone use in QuaZipFile.

    I think it should be possible to use these two functions in unzip.h to generate that map:

    extern int ZEXPORT unzGetFilePos(
    unzFile file,
    unz_file_pos* file_pos);
    extern int ZEXPORT unzGoToFilePos(
    unzFile file,
    unz_file_pos* file_pos);
    

    Questions:

    • I'm only intending to include the filename in the map.
      This will make repeated calls to QuaZip::getFileNameList() very fast, but have no effect on repeated calls to QuaZip::getFileInfoList().
      Does that matter to anyone?

    • When should the map be invalidated?
      Clearly it should invalidate whenever any overload of QuaZip::Open is called, but as I don't write to a ZIP file in my application I'm not sure where to catch appending a file.

     
    Last edit: Richard 2013-04-27
  • I think anyone abusing getFileInfoList() should just call it once instead, so no need to optimize anything there. The same goes for getFileNameList(), really, but if it becomes fast, it's just as good. Although it can break ordering, which isn't really important, but who knows? I'd not touch getFileNameList() just in case. But for setCurrentFile() it would be very useful!

    Reading and writing archives at the same time isn't supported anyway, so the map should be invalidated on call to close(). It doesn't make much sense to do it on open() as you can't call open() twice unless you call close() in between - that's checked.

     
    • Richard
      Richard
      2013-04-29

      I hope the attached patch file is usable as I've not created one before.

      This greatly speeds up the usage of QuaZip::setCurrentFile().

      In my tests on a ZIP containing 27,400 files, extracting one subfolder of the ZIP 'by filename' sped up from 33 minutes (0.5.1) to 15 seconds (patched).
      Runtime of QuaZip::setCurrentFilename() reduced from 250-300ms to 0ms (unmeasurable).

      This passes the unit tests and 'works for me' in my application, however I have not done extensive testing.

      Patch summary:
      The map is created lazily and each file is added to it whenever getCurrentFileName() is called.

      The first call to setCurrentFile() after Opening the ZIP will be almost unchanged.
      Later calls will either immediately use the map or continue from 'where it got to last time', and therefore go much faster as more and more of the ZIP directory is added to the map.

      Alternatively, calling getFileNameList() or getFileInfoList() will always completely fill the map, greatly speeding up later calls to setCurrentFile().

       
      Last edit: Richard 2013-04-29
  • OK, committed (r172). I replaced QMap with QHash (it is recommended to use for most purposes), removed "mutable" (I've heard it has portability problems) and made some other minor changes. Looks great to me.

    Now I think I'll add some tests for it.

     
    Last edit: Sergey A. Tachenov 2014-01-26
    • status: open --> closed-fixed