Menu

7-Zip 18.02 benchmark

2018-03-04
2018-04-22
<< < 1 2 (Page 2 of 2)
  • LBP

    LBP - 2018-03-09

    An older Ivy Bridge with HT enabled:
    Xeon E3-1230v2, Windows [Version 10.0.16299.248]

     
    • Igor Pavlov

      Igor Pavlov - 2018-03-09

      CMOV instruction execution speed and rate can be important for decompression speed in 18.03, especially in hyper-threading mode.
      Probably all Intel processors before Hasswell/Broadwell are not so good for CMOV instructions.
      So we can see best hyper-threading results for 18.03 decompression in new processors: Broadwell / Skylake / Kaby Lake / Coffee Lake / Ryzen.

       
      • NoAngel

        NoAngel - 2018-03-13

        My bench results for Ryzen 5 2400g (stock clocks) for 18.03 and 18.01 64-bit.

         
      • NoAngel

        NoAngel - 2018-03-14

        Bench results for i7 8700k (Coffee Lake) (no overclock) for 18.03 vs 18.00 64-bit.

         
        • Igor Pavlov

          Igor Pavlov - 2018-03-14

          So Intel (Coffee Lake) is still faster than Ryzen for 5-7% at same frequency for single-thread LZMA decompression.
          But probably Ryzen is slightly better for multi-thread.

           

          Last edit: Igor Pavlov 2018-03-14
          • Jarred Walton

            Jarred Walton - 2018-04-20

            Hi Igor. I don't want to start a new thread on this, so I figure this is a good place to start.

            I've been using 7-zip as a CPU benchmark in my reviews for a while (eg, https://www.pcgamer.com/amd-ryzen-5-2400g-review/ ), but I started to think maybe the benchmark results are meaningless as the actual program doesn't seem to come close to the performance shown in the benchmarks. So I started trying to figure things out, but the documentation is lacking and there's all sorts of not-useful hits out there.

            What I've determined is that 7-zip with the LZMA2 compression algorithm runs into severe performance bottlenecks with larger dictionary sizes. This in turn kills the multi-threaded scaling. So for example, if I run:
            "\Program Files\7-Zip\7z.exe" a -bt test.7z "Cinebench R15"

            That reports the time taken to do the compression of the 129 folders and 2533 files, totaling 208798847 bytes. Fine. Problem is, that uses the default 7-zip compression method, which is LZMA2 but with the "normal" preset and a 16MB dictionary size. Why is that a problem? Because, AFAICT, you become completely memory bandwidth limited.

            As an example, in my i7-4702QM laptop that I'm sitting at right now, it takes 25.491seconds to do the compression and CPU utilization is around 40 percent (give or take). The resulting archive is 57230006 bytes in size. Now if I change the command to ultra compression with a 64MB dictionary:
            "\Program Files\7-Zip\7z.exe" a -bt -mx9 test.7z "Cinebench R15"

            This takes 47.109 seconds, and CPU utilization is only 20 percent. And the ultra archive size is 55687746. Yeah, I saved about 1.6MB for the nearly twice as long compression time. Yuck. Is this a memory bandwidth bottleneck, or something else?

            My new test that I'm running uses the following:
            "C:\Program Files\7-Zip\7z.exe" a -m0=lzma2:d4m:fb16 -t7z -mx9 -slp -bt test.7z "Cinebench R15"

            Same laptop, this now takes 15.534 seconds, CPU utilization is much closer to 100 percent, and the archive is 61752571 bytes -- so now I've increased the archive size by 4.5MB while cutting the compression time in half. And actually, I'm going one step further, and using twice as many threads as the system shows (because it's a bit faster in practice), so:
            "C:\Program Files\7-Zip\7z.exe" a -m0=lzma2:d4m:fb16 -t7z -mx9 -slp -mmt16 -bt test.7z "Cinebench R15"

            This gives slightly higher CPU utilization with a time of 15.250 seconds.

            So far so good, but then I get to the decompression side. Even using fast NVMe SSD storage, decompression is at best maybe twice as fast as compression on a quad-core processor. Move to a 6-core or 8-core processor and it's even worse. Everything I'm seeing in decompression suggests that 7-zip is not using multi-threaded decompression.

            Is that because of the LZMA2 algorithm? If so, is the regular Zip algorithm better, or is there something else I can use? Ideally, when I'm testing on a chip like Ryzen 7 2700X or Core i9-7900X, I want to see real-world performance at least somewhat close to the benchmark results. Right now, here's what I get with an i7-5930K (running stock)

            Benchmark (32MB dicionary, 12 CPU threads):
            Compress: 28524 KB/s and 32568 MIPS
            Decompress: 358167 KB/s and 31875 MIPS

            Manual test of 704,420,864 bytes, 268 folders, 5267 files via command line:
            7z.exe a -m0=lzma2:dm4:fb16 -t7z -mmt12 -mx9 -slp -bt test.7z ".\CompressionTest"
            7z.exe x -mmt12 -bt -y test.7z
            Compress: 20.701 seconds = 33198 KB/s
            Decompress: 14.871 seconds = 46259 KB/s

            Furthermore, "7z.exe x -mmt1 -bt -y test.7z" (for single-threaded testing) shows basically the same result, confirming it's not multi-threaded compression (or it's not working at least).

            So basically, the built-in benchmark looks cool and provides nice numbers, but I'm not convinced they're useful other than as a theoretical reference. And I'm more of a real-world kind of guy so I'm trying to find a way to make this a real-world test of compressing actual files that still scales well with multi-core.

            Cheers!

             
  • Igor Pavlov

    Igor Pavlov - 2018-04-21

    1) compression multithreading works when total size of files is much larger than dictionary size (8-100 times larger).
    2) decompression multithreading works in new 7-Zip 18.03.

     
    • Jarred Walton

      Jarred Walton - 2018-04-22

      Hi Igor, thanks for the response.

      So multithreaded decompression didn't work prior to the 18.03 beta? That confirms the single-threaded extraction, since I was using 18.01 before. With it now enabled, what sort of speedup do you expect, and will it only occur with larger archives? Does it matter what type of compression is used -- is there a preferred algorithm (eg, LZMA2) for multithreaded extraction to work best? It would appear so, as I'll illustrate below.

      Regarding compression needing total size of files larger than dictionary size, that's still not quite correct. Here are results using the command, with the "-mx?" ranging from 1 to 9 (in steps of 2), on a 4-core/8-thread i7-4702QM:
      7z.exe a -t7z -mx? -mmt16 -slp -bt test.7z "CINEBENCH R15.038_RC184115"


      -mx1:
      Add new data to archive: 129 folders, 2533 files, 208798847 bytes (200 MiB)
      Archive size: 77036286 bytes (74 MiB)
      Kernel Time = 0.296 = 6% 57433 MCycles
      User Time = 26.296 = 576%
      Process Time = 26.593 = 582% Virtual Memory = 61 MB
      Global Time = 4.563 = 100% Physical Memory = 49 MB

      -mx3:
      Archive size: 75017438 bytes (72 MiB)
      Kernel Time = 1.500 = 24% 86053 MCycles
      User Time = 38.171 = 615%
      Process Time = 39.671 = 639% Virtual Memory = 260 MB
      Global Time = 6.206 = 100% Physical Memory = 216 MB

      -mx5:
      Archive size: 57230006 bytes (55 MiB)
      Kernel Time = 0.968 = 3% 168888 MCycles
      User Time = 75.906 = 306%
      Process Time = 76.875 = 310% Virtual Memory = 886 MB
      Global Time = 24.727 = 100% Physical Memory = 716 MB

      -mx7:
      Archive size: 56189076 bytes (54 MiB)
      Kernel Time = 0.859 = 2% 171954 MCycles
      User Time = 77.937 = 210%
      Process Time = 78.796 = 213% Virtual Memory = 1170 MB
      Global Time = 36.939 = 100% Physical Memory = 870 MB

      -mx9:
      Archive size: 55687746 bytes (54 MiB)
      Kernel Time = 1.078 = 2% 175565 MCycles
      User Time = 79.078 = 168%
      Process Time = 80.156 = 170% Virtual Memory = 719 MB
      Global Time = 46.878 = 100% Physical Memory = 688 MB


      Basically, the ideal compression ratio for multi-core scalin appears to be around -mx3. Less than that doesn't scale quite as well, while more than that severely limits scalability. -mx5 shows "310%" CPU utilization, so about three full cores, but -mx7 is "213% on the process time (about two cores) and -mx9 is down to 170% (~1.7 cores). Watching CPU use in Windows Task Manager confirms these are pretty accurate values.

      Decompression seems to follow a somewhat similar pattern, but with worse multi-core scaling across all compression ratios. Extracting them gives the following (all with the same command of "7z.exe x -mmt16 -slp -bt -y test.7z"):


      -mx1:
      Size: 208798847
      Compressed: 77036286
      Kernel Time = 1.437 = 58% 14255 MCycles
      User Time = 4.890 = 200%
      Process Time = 6.328 = 258% Virtual Memory = 36 MB
      Global Time = 2.443 = 100% Physical Memory = 39 MB

      -mx3:
      Size: 208798847
      Compressed: 75017438
      Kernel Time = 1.531 = 69% 13618 MCycles
      User Time = 4.875 = 220%
      Process Time = 6.406 = 289% Virtual Memory = 108 MB
      Global Time = 2.213 = 100% Physical Memory = 110 MB

      -mx5:
      Size: 208798847
      Compressed: 57230006
      Kernel Time = 1.437 = 47% 8996 MCycles
      User Time = 2.687 = 89%
      Process Time = 4.125 = 137% Virtual Memory = 216 MB
      Global Time = 2.999 = 100% Physical Memory = 218 MB

      -mx7:
      Size: 208798847
      Compressed: 56189076
      Kernel Time = 1.421 = 38% 8262 MCycles
      User Time = 2.375 = 64%
      Process Time = 3.796 = 102% Virtual Memory = 215 MB
      Global Time = 3.699 = 100% Physical Memory = 217 MB

      -mx9:
      Size: 208798847
      Compressed: 55687746

      Kernel Time = 1.359 = 34% 8263 MCycles
      User Time = 2.437 = 62%
      Process Time = 3.796 = 97% Virtual Memory = 214 MB
      Global Time = 3.902 = 100% Physical Memory = 216 MB


      So the range is from 97% (a single core) for an 'ultra' compressedarchive, to as much as 289% (three cores) for a 'fast' compress archive. I plan on testing this with multiple CPUs, ranging from as low as 2-core/4-thread models up to 18-core/36-thread models, and everything in between. Unless you have other suggestions, I'm going to stick with the -mx3 'fast' compression as the best multi-core scaling option. If you think Zip or some other format would work better, please let me know. :-)

      Cheers,
      Jarred Walton
      Senior Hardware Editor
      PCGamer.com

       
      • Jarred Walton

        Jarred Walton - 2018-04-22

        Attached is the testing on Core i7-5930K (3.7GHz pseudo-stock -- the mobo always runs all cores at max turbo). It seems the -mx2 option is even better than -mx3, in multi-core use. I have no idea what settings this corresponds to, as it's undocumented, but it clearly works and creates an in-between option that's more compressed than mx1 and less than mx3. mx4 also works but ends up being worse than mx3. This compression test is a different folder, with more data, so maybe that helps.

         
      • Igor Pavlov

        Igor Pavlov - 2018-04-22

        1) I suppose you don't need -mmt switch.
        7-Zip selects good number of threads by default.
        2) you need big dataset, if you want multithreading with -mx5 and above with big dictionary size.
        3) extracting can be limited with IO speed - it reads and writes big files.
        4) -mx4 can be better that -mx3 for compression ratio.

         
        • Jarred Walton

          Jarred Walton - 2018-04-22

          1 - I've found that doubling the number of threads (relative to the CPU thread count) gives a small but consisten boost over just using the default number of threads, so I decided to go for the 'best-case' performance.
          2 - My current dataset (used in the 5930K results) is nearly 1GB. I could go much larger, but I still see greatly reduced multi-core scaling beyond -mx3. For example, with a 3.7GiB dataset, -mx2 shows 1101% process time (excellent!), while -mx5 is 966%. Also, it takes much longer to run (69 seconds vs. 232 seconds).
          3 - Interestingly, I've found that I get notably better compress and extract performance on a good SATA SSD (Samsung 850 Pro) than on even the fastest NVMe SSDs (Intel SSD 750, Samsung 950 Pro, Samsung 960 Evo, and more. I might try an Optane 900P for kicks, though.) I'm not sure why NVMe drives with substantially higher read/write performance would do worse than SATA, but it's consistent across multiple platforms.
          4 - Yes, I meant 'worse' in that it doesn't show better multi-core scaling. What settings are used on the even' -mx values? This page https://sevenzip.osdn.jp/chm/cmdline/switches/method.htm#7Z is the only somewhat clear documentation of some of the compression options for command line, and it only covers the odd values of 1,3,5,7,9. I assume the evens are just in between dictionary sizes, but it would be nice to have the exact values noted somewhere.

          Thanks again. I'm probably going to modify my current dataset to be close to 4GiB in size but stick with the -mx2 setting for pure speed/core scaling purposes.

           
  • Igor Pavlov

    Igor Pavlov - 2018-04-22

    -mx1 ... -mx4 - are "fast" methods.
    -mx5 ... -mx9 - are "slow" methods. It uses another code for encoding. That code doesn't load cpu for 100%. So you can increase slightly the number of threads.

    Don't look for CPU load percents.
    Look to working time instead.
    I suppose that default value for -mmt is good enough in most cases.
    But you can use +2 threads for -mx5 ... -mx9.
    For example, if cpu is 8 threads, try -mmt10 for -mx5.

     

    Last edit: Igor Pavlov 2018-04-22
  • Igor Pavlov

    Igor Pavlov - 2018-04-22

    each additional -mx level (up to -mx5) increases the dictionary for 4 times:
    -mx1 - 64 KB
    -mx2 - 256 KB
    -mx3 - 1 MB
    -mx4 - 4 MB
    -mx5 - 16 MB

     

    Last edit: Igor Pavlov 2018-04-22
<< < 1 2 (Page 2 of 2)

Log in to post a comment.