from http://j-o.users.sourceforge.net/download/7-zip/ you can download a patch, which adds support for hugetlbfs to p7zip. Like with the Windows large pages, this gives a nice speedup, when running memory intensive operations. Testing on x86_64 showed, that run time savings of up to 18% are possible (e.g. x=3 d=27 fb=273).
To apply my patch to p7zip, you additionally need Alloc.c from the original 7-zip sources. I decided to move the entire platform handling to a single file for simplifying this and further ports (FreeBSD might be a good candidate). Here is how everything fits together:
~$ cd p7zip_9.04
~/p7zip_9.04$ sed 's/\r$//' ../7-zip/C/Alloc.c >C/Alloc.c
~/p7zip_9.04$ zcat ../p7zip_9.04-linux_huge_pages.diff.gz | patch -p0
~/p7zip_9.04$ make OPTFLAGS=-O2
Using huges pages in Linux requires some preparations. First, make sure your running kernel has support for hugetlbfs compiled in:
In this case the size of a huge page is 2 MiB. So, if you have 2 GiB of RAM and want to reserve 512 MiB for huge pages, you would need 256 pages. Do the following as root:
What CPU Frequency / L2 cache size / RAM speed of that PowerPC CPU ?
As I remember PowerPC uses hash function to translate virtual address to physical address. So PowerPC can provide good speed for small pages only when it can place whole translation table to L2 Cache. But most PowercPCs conatin only small L2 Cache. So you have good gain for -slp.
Also you can try -mmt2 switch in benchmark.
It can provide faster compression for small pages. -mmt splits task into two parts. And each part can work faster, since it uses smaller memory block and TLB hit rate is higher.
x86 doesn't use hash for translation. For example, 1 MB L2 can provide good speed for block of 256 MB of RAM (which requires 512 KB for translation tables). That is why the -slp gain on x86 is not big. But you still get good gain for some CPUs like Pentium 4 (about 15%) or for CPUs with very small L2 cache, like Celerons.
BTW, Windows 7 and Windows 2008 R2 provide faster allocation for large pages than previous versions of Windows. So probably it's OK to enable "large pages" mode in 7-Zip. Previous versions of Windows can hang your system for 10-15 seconds during large pages allocation. That was not good.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To show the relevance of my patch even on Intel processors, I post these benchmark results I got with Linux x86_64 (Core2 Duo T7500 2.20 GHz, 32k L1, 4096k L2; DDR2 667 MHz):
~/p7zip_9.04-jo$ bin/7za b -md27 -mmt1
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Try also to compress some real big file (for example .tar of Linux source code) with -mx switch and compare the time with / without -slp.
You need about 700 MB for large pages.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you can port/run it in PowerPC, we can see exact numbers of memory latency with/without large pages. If CPU tick counter is not available for PowerPC, it must be called in ns mode:
memlat 1 n
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-> If there is such big gain on PowerPC from large pages, why it was not enabled in Linux before?
Do you mean in Linux generally or in the Linux version of p7zip?
First I did a port of memlat to Linux x86 and verified that it works correctly by comparing the results to those from MemLat32.exe in Windows XP. Then I added support for PowerPC. If you are interested, you can download the modified sources and the patch from - you might like to include it in the next official release.
On PPC970FX the CPU tick counter doesn't run at full CPU speed, but you still get an idea:
~/7bench/CPP/Utils/CPUTest/MemLat$ ./memlat 1 n
MemLat 9.02 : Igor Pavlov : Public domain : 2009-08-25
Size 1 2 3 4 5
Currently, handling huge pages in Linux is somewhat cumbersome, and almost only used by enterprise software like database systems.
With the upcoming kernel 2.6.32 allocating memory on huge pages will be just as easy as doing an mmap (from the application side) - but still, the administrator has to pre-configure, how much memory is reserved for huge pages.
Well, you already can make use of an "over-commit" feature, which lets the kernel automatically reserve more huge pages on demand. Of course, this works only, if physical memory got not too fragmented - especially on PowerPC, where a huge page is 16384k (at least on PPC970)! One day a memory defragmenter (or even "compressor") might be added to Linux to overcome these issues.
Yes, having the whole user space thing be dealt with by malloc would be perfect. But I wouldn't expect that too soon. However, interested users can come around that by pre-loading a library like libhugetlbfs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
from http://j-o.users.sourceforge.net/download/7-zip/ you can download a patch, which adds support for hugetlbfs to p7zip. Like with the Windows large pages, this gives a nice speedup, when running memory intensive operations. Testing on x86_64 showed, that run time savings of up to 18% are possible (e.g. x=3 d=27 fb=273).
To apply my patch to p7zip, you additionally need Alloc.c from the original 7-zip sources. I decided to move the entire platform handling to a single file for simplifying this and further ports (FreeBSD might be a good candidate). Here is how everything fits together:
~$ cd p7zip_9.04
~/p7zip_9.04$ sed 's/\r$//' ../7-zip/C/Alloc.c >C/Alloc.c
~/p7zip_9.04$ zcat ../p7zip_9.04-linux_huge_pages.diff.gz | patch -p0
~/p7zip_9.04$ make OPTFLAGS=-O2
Using huges pages in Linux requires some preparations. First, make sure your running kernel has support for hugetlbfs compiled in:
~$ grep hugetlbfs /proc/filesystems
nodev hugetlbfs
You can view your current huge page configuration like this:
~$ grep Huge /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
In this case the size of a huge page is 2 MiB. So, if you have 2 GiB of RAM and want to reserve 512 MiB for huge pages, you would need 256 pages. Do the following as root:
~# echo 256 >/proc/sys/vm/nr_hugepages
~# grep Huge /proc/meminfo
HugePages_Total: 256
HugePages_Free: 256
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Finally, make access from user space possible:
~# mkdir /dev/hugepages
~# mount -t hugetlbfs -o rw,nosuid,nodev,noexec,noatime none /dev/hugepages
~# chmod 1777 /dev/hugepages
Now huge pages are configured. In your shell, set the environment variable HUGETLB_PATH to the mount point:
~/p7zip_9.04$ export HUGETLB_PATH=/dev/hugepages
To enable huge page use in p7zip, pass the '-slp' switch to it.
Just if some one is interested - here are the benchmark results from a PowerPC 970FX system:
~/p7zip_9.04$ grep Huge /proc/meminfo
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB
~/p7zip_9.04$ bin/7za b
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 419 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 690 100 673 671 | 15368 100 1388 1387
23: 708 100 723 722 | 14987 100 1374 1372
24: 681 100 733 732 | 14547 100 1350 1350
25: 663 100 758 757 | 14095 100 1327 1325
-------------------------------------------
Avr: 100 722 720 100 1360 1358
Tot: 100 1041 1039
~/p7zip_9.04$ bin/7za b -slp
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 419 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 896 100 875 872 | 15377 100 1391 1388
23: 952 100 971 970 | 14994 100 1374 1372
24: 938 100 1011 1009 | 14546 100 1350 1349
25: 927 100 1060 1058 | 14113 100 1331 1327
-------------------------------------------
Avr: 100 979 977 100 1361 1359
Tot: 100 1170 1168
As you can see, using huge pages can give a nice performance improvement.
Best regards,
Joachim Henke
What CPU Frequency / L2 cache size / RAM speed of that PowerPC CPU ?
As I remember PowerPC uses hash function to translate virtual address to physical address. So PowerPC can provide good speed for small pages only when it can place whole translation table to L2 Cache. But most PowercPCs conatin only small L2 Cache. So you have good gain for -slp.
Also you can try -mmt2 switch in benchmark.
It can provide faster compression for small pages. -mmt splits task into two parts. And each part can work faster, since it uses smaller memory block and TLB hit rate is higher.
x86 doesn't use hash for translation. For example, 1 MB L2 can provide good speed for block of 256 MB of RAM (which requires 512 KB for translation tables). That is why the -slp gain on x86 is not big. But you still get good gain for some CPUs like Pentium 4 (about 15%) or for CPUs with very small L2 cache, like Celerons.
BTW, Windows 7 and Windows 2008 R2 provide faster allocation for large pages than previous versions of Windows. So probably it's OK to enable "large pages" mode in 7-Zip. Previous versions of Windows can hang your system for 10-15 seconds during large pages allocation. That was not good.
cpu: PPC970FX, altivec supported
clock: 1800 MHz
L2 cache: 512K unified
memory: PC3200, unbuffered, 400 MHz DDR SDRAM
The benchmarks above were not done correctly, as I forgot to disable CPU frequency scaling. Here are the new results:
~/p7zip_9.04$ bin/7za b
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 419 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 746 100 726 726 | 15420 100 1394 1392
23: 703 100 717 716 | 14997 100 1374 1373
24: 678 100 730 729 | 14556 100 1348 1350
25: 661 100 755 754 | 14104 100 1329 1326
-------------------------------------------
Avr: 100 732 731 100 1361 1360
Tot: 100 1047 1046
~/p7zip_9.04$ bin/7za b -mmt2
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 425 MB, # Benchmark threads: 2
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 773 100 752 752 | 15144 100 1370 1367
23: 712 100 727 725 | 14706 100 1347 1346
24: 672 100 724 723 | 14302 100 1328 1327
25: 654 100 747 746 | 13895 100 1307 1307
-------------------------------------------
Avr: 100 737 736 100 1338 1337
Tot: 100 1038 1037
~/p7zip_9.04$ bin/7za b -slp
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 419 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 969 100 943 943 | 15426 99 1400 1392
23: 955 100 974 973 | 14987 100 1371 1372
24: 942 100 1014 1012 | 14546 100 1350 1349
25: 930 100 1063 1061 | 14099 100 1327 1326
-------------------------------------------
Avr: 100 999 998 100 1362 1360
Tot: 100 1180 1179
To show the relevance of my patch even on Intel processors, I post these benchmark results I got with Linux x86_64 (Core2 Duo T7500 2.20 GHz, 32k L1, 4096k L2; DDR2 667 MHz):
~/p7zip_9.04-jo$ bin/7za b -md27 -mmt1
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
RAM size: 2000 MB, # CPU hardware threads: 2
RAM usage: 1539 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 2295 100 2235 2233 | 24071 100 2171 2173
23: 2193 100 2231 2234 | 23750 100 2180 2174
24: 2108 100 2267 2266 | 23421 100 2174 2173
25: 2022 100 2309 2308 | 23010 100 2164 2164
26: 1734 100 2113 2113 | 22613 100 2155 2154
27: 1616 100 2111 2110 | 22304 100 2152 2152
-------------------------------------------
Avr: 100 2211 2211 100 2166 2165
Tot: 100 2188 2188
~/p7zip_9.04-jo$ bin/7za b -md27 -mmt1 -slp
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
RAM size: 2000 MB, # CPU hardware threads: 2
RAM usage: 1539 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
22: 2423 100 2366 2357 | 24061 100 2171 2172
23: 2334 100 2382 2378 | 23730 100 2173 2172
24: 2267 100 2439 2438 | 23371 100 2168 2168
25: 2210 100 2522 2523 | 22997 100 2164 2163
26: 1974 100 2406 2406 | 22638 100 2159 2157
27: 1930 100 2520 2520 | 22320 100 2154 2154
-------------------------------------------
Avr: 100 2439 2437 100 2165 2164
Tot: 100 2302 2301
I don't like poor results of PPC970FX.
Please check also
7z b -md21
Probably long pipeline of PPC970FX (and big missprediction penalty) is reason for bad decompression performance.
Try also to compress some real big file (for example .tar of Linux source code) with -mx switch and compare the time with / without -slp.
You need about 700 MB for large pages.
Here are the requested benchmark results for PPC970FX:
~/p7zip_9.04$ bin/7za b -md21
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
RAM size: 1965 MB, # CPU hardware threads: 1
RAM usage: 29 MB, # Benchmark threads: 1
Dict Compressing | Decompressing
Speed Usage R/U Rating | Speed Usage R/U Rating
KB/s % MIPS MIPS | KB/s % MIPS MIPS
18: 1128 99 1018 1005 | 16657 100 1419 1420
19: 1031 100 921 924 | 16437 100 1419 1419
20: 922 100 840 840 | 16218 100 1418 1418
21: 820 100 766 769 | 15846 100 1409 1410
-------------------------------------------
Avr: 100 886 884 100 1416 1417
Tot: 100 1151 1150
Compressing a cached file to memory:
~/p7zip_9.04$ time bin/7za a -mx /dev/shm/t.7z linux-2.6.31.tar
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
Scanning
Creating archive /dev/shm/t.7z
Compressing linux-2.6.31.tar
Everything is Ok
real 16m48.090s
user 16m38.198s
sys 0m9.700s
The same with huge pages enabled:
~/p7zip_9.04$ time bin/7za a -mx -slp /dev/shm/t.7z linux-2.6.31.tar
7-Zip (A) 9.04 beta Copyright (C) 1999-2009 Igor Pavlov 2009-05-30
p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
Scanning
Creating archive /dev/shm/t.7z
Compressing linux-2.6.31.tar
Everything is Ok
real 11m48.874s
user 11m42.048s
sys 0m6.677s
If there is such big gain on PowerPC from large pages, why it was not enabled in Linux before?
Also I have memory benchmark program:
http://www.7-cpu.com/
If you can port/run it in PowerPC, we can see exact numbers of memory latency with/without large pages. If CPU tick counter is not available for PowerPC, it must be called in ns mode:
memlat 1 n
-> If there is such big gain on PowerPC from large pages, why it was not enabled in Linux before?
Do you mean in Linux generally or in the Linux version of p7zip?
First I did a port of memlat to Linux x86 and verified that it works correctly by comparing the results to those from MemLat32.exe in Windows XP. Then I added support for PowerPC. If you are interested, you can download the modified sources and the patch from - you might like to include it in the next official release.
On PPC970FX the CPU tick counter doesn't run at full CPU speed, but you still get an idea:
~/7bench/CPP/Utils/CPUTest/MemLat$ ./memlat 1 n
MemLat 9.02 : Igor Pavlov : Public domain : 2009-08-25
Size 1 2 3 4 5
4-K 2.77 1.38 0.92 0.69 0.58
5-K 2.77 1.38 0.92 0.69 0.57
6-K 2.77 1.38 0.92 0.69 0.55
7-K 2.77 1.38 0.92 0.69 0.55
8-K 2.77 1.38 0.92 0.69 0.58
10-K 2.77 1.38 0.92 0.69 0.57
12-K 2.77 1.38 0.92 0.69 0.57
14-K 2.77 1.38 0.92 0.69 0.58
16-K 2.77 1.38 0.92 0.69 0.58
20-K 2.77 1.38 0.92 0.69 0.57
24-K 2.77 1.39 0.92 0.69 0.55
28-K 2.77 1.39 0.92 0.69 0.57
32-K 2.77 1.39 0.92 0.69 0.58
40-K 3.86 2.06 1.52 1.26 1.12
48-K 4.50 2.48 1.84 1.54 1.34
56-K 5.04 2.73 2.04 1.70 1.51
64-K 5.34 2.93 2.23 1.82 1.61
80-K 5.86 3.20 2.40 1.98 1.75
96-K 6.17 3.37 2.51 2.08 1.83
112-K 6.42 3.47 2.58 2.13 1.90
128-K 6.60 3.55 2.62 2.18 1.95
160-K 6.82 3.66 2.67 2.25 2.02
192-K 7.02 3.73 2.71 2.29 2.07
224-K 7.12 3.77 2.73 2.32 2.10
256-K 7.20 3.81 2.74 2.35 2.13
320-K 7.32 3.86 2.75 2.38 2.16
384-K 14.02 8.89 7.74 7.15 6.71
448-K 26.49 16.99 14.31 13.12 12.37
512-K 43.66 26.92 21.82 19.82 18.24
640-K 62.75 36.83 29.52 26.36 24.16
768-K 81.87 46.31 37.44 33.21 31.00
896-K 101.68 55.66 46.28 41.39 38.74
1024-K 116.51 63.20 53.78 49.27 45.73
BW- 32 B 274 506 594 649 699
BW- 64 B 549 1012 1189 1299 1399
BW-128 B 1098 2025 2379 2598 2799
Cache latency = 2.77 ns = 0.09 cycles
Memory latency = 116.51 ns = 3.90 cycles
ns PageSize=4
Timer frequency = 1000000 Hz
CPU frequency = 33.45 MHz
I hope to find some spare time to add huge pages support to memlat in the next days.
: http://j-o.users.sourceforge.net/download/7-zip/
Thanks!
I mean large pages support in Linux for any software (maybe even by default, when the software requests big amount of memory with malloc).
If you will be able to run any test from memlat, try to call same tests as in test.bat (or test64.bat).
Probably it's better to post messages and results about MemLat to
: http://sourceforge.net/projects/sevenmax/forums/forum/399008
Currently, handling huge pages in Linux is somewhat cumbersome, and almost only used by enterprise software like database systems.
With the upcoming kernel 2.6.32 allocating memory on huge pages will be just as easy as doing an mmap (from the application side) - but still, the administrator has to pre-configure, how much memory is reserved for huge pages.
Well, you already can make use of an "over-commit" feature, which lets the kernel automatically reserve more huge pages on demand. Of course, this works only, if physical memory got not too fragmented - especially on PowerPC, where a huge page is 16384k (at least on PPC970)! One day a memory defragmenter (or even "compressor") might be added to Linux to overcome these issues.
Yes, having the whole user space thing be dealt with by malloc would be perfect. But I wouldn't expect that too soon. However, interested users can come around that by pre-loading a library like libhugetlbfs.