ISSUE:
Especially large files decompressed by 7-ZIP are created from the start with significant file fragmentation on the disk.
CAUSE:
7-ZIP seems to write-out decompressed file content to disk incrementally just as their content becomes available by the "unzip"-engine.
Therefore a 100MB file will initially start with a length of zero bytes and then grow incrementally in small steps.
As the operation system initially never knows how large the file will become in the end the OS has to incrementally reserve new disk space for each and every of these "growth"-steps.
-> In most cases this will not be possible in a continuous manner on the hard disk - therefore creating many file fragments all over the disk even at the time of file creation.
FIX/IMPROVEMENT:
=> Tell the OS how large the file will become in the end!
-> As this is during decompression 7-ZIP knows exactly how large each uncompressed file will be
HOWTO:
Just set the expected file size directly after initially creating the file.
On Windows using the Win32 API this would look something like this (without any error checking):
HANDLE hFile;
LARGE_INTEGER liSize;
hFile = CreateFile("somelargefile.bin", GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_NEW, FILE_FLAG_SEQUENTIAL_SCAN, 0);
liSize.LowPart = 1234567890;
liSize.HighPart = 0;
SetFilePointerEx(hFile, liSize , NULL, FILE_BEGIN);
SetEndOfFile(hFile);
SetFilePointer(hFile, 0, NULL, FILE_BEGIN);
The SetEndOfFile call is were it all happens in the OS.
On NTFS this call performs very fast and still pre-allocates the full physical disk space for the file (look up doc for SetEndOfFile and how it works together with SetFileValidData to avoid the need to zero-fill the complete file which would slow down things).
CAVEAT: This method only works (i.e. prevent fragmentation) if the file is not configured as a SPARSE-file (supported by NTFS, ext3/4, and many others).
On Windows CreateFile never creates a SPARSE-file, this would need an additional call to DeviceIoControl.
However, I've no idea if this is also true for other OS/other FS - I think so, but I'm not 100% sure.
On other OS the same principle applies just as long as the OS and the file-system drivers support using this "size-hint" to create less fragmented files.
For example, AFAIK, Linux works partially like this as well (e.g. see "shake" defragmenter for Linux)...
Thanks for creating the best archiving tool out there!
This highly technical point really is my onlyissue with 7-ZIP...
Cheers,
Marble.
- Therefore a 100MB file will initially start with a length of zero bytes and then grow incrementally in small steps.
Most programs works so. OS must optimize these things.
I suppose Windows doesn't write data immediately. It writes to cache and in several seconds from cache to disk.
If you think that Windows doesn't optimize it, write exact experiment results.
- Most programs work so.
True, it is a simplier approach, but it really does lead to fragmentation. My experiments show that Windows does not merge cached write operations. On my machine (Win2k, NTFS volume with 0.5K clusters) 16x4K fall in one fragment, but 32x2K fall in 2, increasing to 9 fragments for 512-byte pieces. On the other hand, SetEndOfFile() makes Windows search a free cluster chain that the file would fit (if any). So, Igor, it would be great if you implement this behavior in File_Open().
By the way, the bug you document in 7zFile.c is not that strict. I managed to write 123 Mbytes across network in a single piece. 124 Mbytes failed (though target file seemed to be written completely), but after some time even 183 Mbytes could be written without errors. I suggest introducing kChunkSize=kChunkSizeMax and decimate it by 2 on error and retry the write with a smaller chunk.
Also, please consider the following for File_Open() (http://msdn.microsoft.com/en-us/library/aa363858(v=VS.85).aspx):
"When an application creates a file across a network, it is better to use GENERIC_READ | GENERIC_WRITE for dwDesiredAccess than to use GENERIC_WRITE alone. The resulting code is faster, because the redirector can use the cache manager and send fewer SMBs with more data. This combination also avoids an issue where writing to a file across a network can occasionally return ERROR_ACCESS_DENIED".
1) 7-zip usually writes big blocks (much larger than 512 bytes or 4K).
2) Check any bad thing in 7-zip also in XP (not only Win2000). I don't want to add any special optimizations for Win2000.
3) I don't see any difference between cases when we write 16 MB or 4 * 4 MB.
4) I don't understand all details for network (GENERIC_READ | GENERIC_WRITE) for all systems. Maybe it's good for network, but bad for Local HDD files. I don't want to make code more complex. All these things must be optimized by Windows, if they know fast way to write file.
Vista Home Premium, NTFS volume, 4K clusters, 3% free space, no compression/sparsing/etc. 2 fragments for one-piece 128M file, 14 pieces for 8M chunks. So it is common Windows behavior. It is probably good luck for your 4x4 case: there are no free blocks of size 4M..15M. So I think it's worth adding 6 extra syscalls to File_Open(). Of course, the decision is yours.
As for network optimizations, your way turned to be better than Microsoft's! I observed no difference in local writes. For network files, specifying GENERIC_WRITE (on Win2k):
1) raises the 32M limit for write operations, but
2) reduces write latency.
Igor, I hope you will integrate this in near future. Marble gave great explanation, but I see you didn't quite understand the concept behind this.
It doesn't matter what block size you write, what matters is that you must "tell" Windows what final file size will be before starting write, because without telling him, even if 7-Zip writes in 1MB block, Windows will write that fragment where there is at least 1MB free space block. I made some visual explanation, take a look at attached picture. It's drive map of one of my drives from Defraggler. If you unpack big file with 7zip, Windows will write that file to red blocks so it will get fragmented. If 7-Zip would tell Windows what file size will be, Windows will then write that file to where blue blocks are. It's that simple :)
EDIT: It's fixed in 17.00 :)
Last edit: kapela86 2017-04-30
WOW!
After 7 years...
...I might finally be able to stop using WinRAR and switch completely to 7-ZIP
Even if I found your inital response quite unthinking and a bit abrasive (which is why I never invested any further argument into this)...
...better late than never!
Much appreciated!
Cheers,
Marble.