Slow extraction of archives with a lot of files.

Hui Buuh
2016-12-31
2016-12-31
  • Hui Buuh

    Hui Buuh - 2016-12-31

    Hey

    I had to extract .7z archives with a lot of files (>200'000) in one directory. This took about 90 minutes on my machine. While it was extracting I put all those files into a .tar compressed that tar with 7za downloaded it and extracted it in <15 minutes.

    I tried to change the "overwrite mode" to always overwrite/skip so there wouldn't be a check anymore, but that had no effect on extraction time.

    So I checked the source where the extraction is sleepy. The part slowing down extraction considerably after ~10'000 files is here:

    CPP/7zip/UI/Common/ArchiveExtractCallback.cpp

    The function GetStream is called for every file in the archive to be extracted.

    599 CArchiveExtractCallback::GetStream
    
    1078    // ----- Is file (not split) -----
    1079    NFind::CFileInfo fileInfo;
    1080    if (fileInfo.Find(fullProcessedPath))
    1081    {
    1082      switch (_overwriteMode)
    

    To me this looks like a check if the file already exists so the user can be asked if he wants to overwrite the file or ignore or whatever.

    I changed the slow file exists code to this:

    AString aFullProcessedPath = UnicodeStringToMultiByte(fullProcessedPath, CP_ACP);
    const char * cpFullProcessedPath = (const char *)aFullProcessedPath;
    bool bfileInfoFindFullProcessedPath = ( access( cpFullProcessedPath, F_OK ) != -1 );
    

    Which cut the 90 minute extraction time down to <3 minutes.

    I just copied the string conversions from other parts of the source as I have no clue how those are handled, the important part is using the "access" function to check if the file exists instead of "fileInfo.Find".

    My question is: "Is this very expensive way of checking if the file exists necessary?"

     
  • Igor Pavlov

    Igor Pavlov - 2016-12-31

    What version of Windows?
    What filesystem in output folder (NTFS / FAT)?
    Is it local computer volume or network folder?
    Do you have any antivirus software?

     
  • Hui Buuh

    Hui Buuh - 2016-12-31

    I think I didn't express myself very well lol.

    I did this on a Linux machine with ext4 file systems. The slowness is in the way FindFirst is done in p7zip.

    On Windows the difference is very small. I ran a test with 200'000 files in a directory on windows:
    10'000 calls to FindFirstFile took ~240ms
    10'000 calls to PathFileExists took ~190ms.
    10'000 calls to GetFileAttributes took ~190ms
    So on Windows using FindFirstFile doesn't slow down archive extraction significantly.

    I want to fix this problem in p7zip, but I don't want to make FindFirst faster if I can simply check for existance of the file WAY faster without any effort.

    My problem is, that I don't know if the pattern checking of FindFirstFile is necessary in that context or if those paths never contain wildcards (* or ?).

     
  • Igor Pavlov

    Igor Pavlov - 2016-12-31

    The path doesn't contain wildcards there.
    Probably p7zip developers must try to optimize it.
    maybe it's simpler to opimize the code of
    NFind::CFileInfo::Find()
    in
    Windows/FileFind.cpp
    As I remember CFileInfo::Find() doesn't need wildcard check for other calls in 7-zip also.

    You can try to change Find() code.

    And probably you should create message at p7zip forum about that problem.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks