Menu

7-Zip is creating highly fragmented files when decompressing big files

2015-10-17
2016-12-12
  • Iván Pérez

    Iván Pérez - 2015-10-17

    Hi!

    I've just decompressed a ZIP file (3.5 GB) containing a 12 GB file, and I've seen that the decompressed file has become highly fragmented (10000+ fragments!), despite there is enough room on the partition in order to put it in a single chunk. The partition has about 50 GB free before decompression and the biggest free chunk was around 18 GB.

    So I've tried to remove the uncompressed file and decompress it again but using WinRAR for example. The resulting file has now only 2 fragments, so I think that 7-Zip could do anything to avoid such fragmentation, which doesn't occur using other tools.

    The partition where I did the test is not the system partition, and between the tests no file has been added, modified or deleted at all.

    I'm using 7-Zip 15.09 and Windows 7 x64. To see the fragments for the files and free chunks available I've used MyDefrag.

    Thanks!

     
  • Iván Pérez

    Iván Pérez - 2015-10-17

    I've been investigating while I was typing the post on how you can solve the problem, and I've found that pre-allocating the space for the file to be decompressed, and that's what WinRAR is doing for sure - http://stackoverflow.com/questions/7970333/how-do-you-pre-allocate-space-for-a-file-in-c-c-on-windows

    It looks pretty simple, so I think it could be a really good idea to add that code on decompression.

    Thanks again!

     
  • Igor Pavlov

    Igor Pavlov - 2015-10-17

    I suppose that there are some Cons and Pros of that solution.
    It's better if windows could select optimal strategy for file storing without additional call.

     
  • Iván Pérez

    Iván Pérez - 2015-10-17

    Yeah, but windows has no way to determine the file size before decompressing it.

    Another problem that can be solved by using pre-allocation could be the next one: Imagine you start decompressing a 10 GB file and you have 10.5 GB of free space. Simultaneously, you start copying a 1 GB file to that partition. If the copy operation ends before decompression, the decompression will fail because there's no more space left, and the situation will not be detected until the disk gets full. Instead, the copy operation should have failed before starting copying, because there won't be enough room for the file after the decompression would have finished (given the pre-allocation would reserve all disk needed for the decompressed file).

    I can't find any cons about pre-allocation. Windows explorer, while copying two files, also does preallocation.

     
  • Igor Pavlov

    Igor Pavlov - 2015-10-17

    I'm not sure how these solutions work OK in different versions of Windows.
    Are you sure that it will not write zeros to FAT system at Windows XP or Windows 2000?

    Windows can cache writing for 10-20 seconds (so it's 100 MB for zip or more). So it can see that fragment is big.
    Maybe smart file system driver in Windows 10 or in some future Windows can guess that we write big file, so it can use big fragmets.

    Did you test it in Windows 10?

     
    • enter name here

      enter name here - 2016-12-12

      It's better if windows could select optimal strategy for file storing without additional call.
      Yes! But you're not giving it any hint as to how big the file will be. Creating a file, then seeking to desired place, then writing a single byte can be thought of as such a hint.

      Are you sure that it will not write zeros to FAT system at Windows XP or Windows 2000?
      Yes it does... It doesn't depend on OS. That I did not test before. Gosh :-/
      Attached is a screenshot of my test code and results from ProcMon.

      Did you test it in Windows 10?
      Yes, I did it on Windows 10, NTFS. No differences to my experience with NTFS on older systems.
      It's not guaranteed to have strictly 1 fragment file all the time, but most of the time number of fragments is 1, and some of the time number of fragments is far less than by using sequential stream write.

       
  • Iván Pérez

    Iván Pérez - 2015-10-17

    I can test it on Windows 10, but not at the moment. When possible, I'll post back what I've seen on W10.

    It looks like such feature would only work on NTFS volumes. SetEndOfFile doesn't write zeroes on FAT/FAT32, or if it does write zeroes, it will take so much time to write them (it has to actually write them onto disk, as opposed to NTFS).

    I don't know how Windows does pre-allocation on FAT volumes, but the entire file size is being pre-allocated when copying a big file to FAT partitions (they probably do not require to zero fill before copying, it just copies on the allocated sectors, that will be overwritten inmediatly).

    So maybe you can detect the filesystem and use that feature only on NTFS, if you require to zero fill first. I don't know if not writing zeroes on FAT before would be a problem, though...

     

    Last edit: Iván Pérez 2015-10-17
  • Shell

    Shell - 2015-10-17

    I would like to mention that pre-allocation has already been suggested in previous topics and I have commented that idea. I also see no drawbacks for pre-allocation and I vote again for it to be implemented in 7-Zip; however, the case when there is not enough space for the extracted file requires some attention.

    To Igor Pavlov: NTFS allocation strategy tries to find the optimal gap for the written fragment, so if you write 128K chunks, you can end up with that sized fragments. This strategy may not depend on the cache; I don't know for sure, but my previous tests resulted in fragments of several megabytes each. I also do not share your opinion on the OS selecting the optimal strategy - I think that it is the responsibility of software to make appropriate hints to the OS.

    I do not remember whether Windows zeroes out the pre-allocated space upon reading, but SetFilePointer() itself executes very rapidly. WinRAR relies on SetFilePointer()+SetEndOfFile(), and its users do not complain on the extraction delays. By the way, NtCreateFile() has a parameter that can specify initial file size; it is a pity WinAPI does not use it. The only drawback for FAT is that pre-allocation should (I guess) initialize the FAT chain, which may require some time. On the other hand, this is faster than appending clusters one by one (and searching for yet another free cluster takes even more time).

     
  • Iván Pérez

    Iván Pérez - 2015-10-21

    I've just checked in Windows 10 and the problem is the same. After decompressing a different ZIP file, the resulting 1.2 GB file has 170 fragments, and the partition has 120 GB free.

    While decompressing, write cache was up to 300-320 MB. I got that info from the task manager, in the memory under performance (modified chunk) - I don't know if that's correct.

     

    Last edit: Iván Pérez 2015-10-21
  • enter name here

    enter name here - 2016-11-12

    I'm reviving this old thread because I've also seen high fragmentation caused by 7-zip extraction, and I think there's simple solution to it.

    First, a word of explanation. If you treat a file like a stream of unknown size, the operating system has no ability to allocate contiguous space for it. So it just gets first free block, and if it's full, looks for the next one anywhere, and so on. You can't do much about it.

    I think you can surely prevent it for extraction, not sure for archive creation, but maybe once you fix the extraction, you can work something out for creation of the archive.

    You just need to create a file, then immediately do a seek to the position equal to file length (and you know uncompressed length, aren't you?), write a byte, then seek back to the beginning, what you get is a non-fragmented file (if only there's enough contiguous space on the disk). Simple isn't it? I've seen it working well on different platforms.

    Here's the working code to illustrate what I mean:
    http://pastebin.com/GtqAvSrp

    Here's the fragmentation report for this file:
    c:\temp>contig -a extracted-file

    Contig v1.7 - Makes files contiguous
    Copyright (C) 1998-2012 Mark Russinovich
    Sysinternals - www.sysinternals.com

    c:\temp\extracted-file is defragmented

    Summary:
    Number of files processed : 1
    Average fragmentation : 1 frags/file

    I'd like all software vendors to use this method, it would improve computers' performance a lot in many cases...

    Thank you!

     
  • v77

    v77 - 2016-11-13

    I use SetFilePointer()+SetEndOfFile() for creating large empty files, and when it is applied to a FAT volume, zeros are written for the whole file. This is because unlike NTFS, FAT does not support invalid data areas.

    But a compromise could be to detect the file system on which the archive is decompressed: if it is NTFS, the space is preallocated. Otherwise, no preallocation.

    I never tried NtCreateFile, but I am not very optimistic about that. This kind of rule belongs to the file system, not the API, and SetEndOfFile tends to confirm that. So checking the file system seems to me the best solution.

    @enter name here:
    You code is invalid. It will write 1,000,000 bytes, even on NTFS (because except for sparse files, invalid data area are possible only for the end of a file, not the beginning), and your call to WriteFile is not correct: lpNumberOfBytesWritten: This parameter can be NULL only when the lpOverlapped parameter is not NULL.

     
  • enter name here

    enter name here - 2016-11-15

    @v77

    You code is invalid. It will write 1,000,000 bytes
    That's what I wanted to do :))) Write 1,000,000 zeros into the file, which is not fragmented. Using streams (which 7-zip does currently) makes the resulting file very fragmented. This method reduces that problem to minimum.
    your call to WriteFile is not correct: lpNumberOfBytesWritten
    I know, it was not the main point of my example. You've missed it completely. And the code does what I wanted - creates a file which is not fragmented most of the time.

    Hopefully Igor will be able to implement something like that. While it's not always perfect, most often it creates files which are contiguous.

     
    • v77

      v77 - 2016-11-21

      So you are suggesting to write a file 2 times, for the only purpose to have it non fragmented? I doubt that "it would improve computers' performance a lot in many cases..."

      Have you at least some tests to show that this method will produce a less fragmented file on the FAT file systems? I would not be surprised that the result is identical...

      If your code does what you wanted, it's only by luck. For a software used by millions of people, you should at least read the documentation of the functions you are using.

       
      • enter name here

        enter name here - 2016-12-12

        You're committed to trolling, I'm not going to waste my time arguing with that. End of topic.

         
  • AmigoJack

    AmigoJack - 2016-11-18

    +1

    Just implement it along with an option in the settings, so each 7zip user can choose for himself whether to use preallocation or not.

     

Log in to post a comment.