Menu

[PATCH] Linux huge pages support

2009-10-25
2013-05-28
  • Joachim Henke

    Joachim Henke - 2009-10-25

    Hello,

    from http://j-o.users.sourceforge.net/download/7-zip/ you can download a patch, which adds support for hugetlbfs to p7zip. Like with the Windows large pages, this gives a nice speedup, when running memory intensive operations. Testing on x86_64 showed, that run time savings of up to 18% are possible (e.g. x=3 d=27 fb=273).

    To apply my patch to p7zip, you additionally need Alloc.c from the original 7-zip sources. I decided to move the entire platform handling to a single file for simplifying this and further ports (FreeBSD might be a good candidate). Here is how everything fits together:

        ~$ cd p7zip_9.04
        ~/p7zip_9.04$ sed 's/\r$//' ../7-zip/C/Alloc.c >C/Alloc.c
        ~/p7zip_9.04$ zcat ../p7zip_9.04-linux_huge_pages.diff.gz | patch -p0
        ~/p7zip_9.04$ make OPTFLAGS=-O2

    Using huges pages in Linux requires some preparations. First, make sure your running kernel has support for hugetlbfs compiled in:

        ~$ grep hugetlbfs /proc/filesystems
        nodev hugetlbfs

    You can view your current huge page configuration like this:

        ~$ grep Huge /proc/meminfo
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB

    In this case the size of a huge page is 2 MiB. So, if you have 2 GiB of RAM and want to reserve 512 MiB for huge pages, you would need 256 pages. Do the following as root:

        ~# echo 256 >/proc/sys/vm/nr_hugepages
        ~# grep Huge /proc/meminfo
        HugePages_Total:     256
        HugePages_Free:      256
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB

    Finally, make access from user space possible:

        ~# mkdir /dev/hugepages
        ~# mount -t hugetlbfs -o rw,nosuid,nodev,noexec,noatime none /dev/hugepages
        ~# chmod 1777 /dev/hugepages

    Now huge pages are configured. In your shell, set the environment variable HUGETLB_PATH to the mount point:

        ~/p7zip_9.04$ export HUGETLB_PATH=/dev/hugepages

    To enable huge page use in p7zip, pass the '-slp' switch to it.

    Just if some one is interested - here are the benchmark results from a PowerPC 970FX system:

        ~/p7zip_9.04$ grep Huge /proc/meminfo
        HugePages_Total:      32
        HugePages_Free:       32
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:      16384 kB
        ~/p7zip_9.04$ bin/7za b
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:    419 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:     690   100    673    671  |    15368   100   1388   1387
        23:     708   100    723    722  |    14987   100   1374   1372
        24:     681   100    733    732  |    14547   100   1350   1350
        25:     663   100    758    757  |    14095   100   1327   1325
        -------------------------------------------
        Avr:          100    722    720               100   1360   1358
        Tot:          100   1041   1039
        ~/p7zip_9.04$ bin/7za b -slp
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:    419 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:     896   100    875    872  |    15377   100   1391   1388
        23:     952   100    971    970  |    14994   100   1374   1372
        24:     938   100   1011   1009  |    14546   100   1350   1349
        25:     927   100   1060   1058  |    14113   100   1331   1327
        -------------------------------------------
        Avr:          100    979    977               100   1361   1359
        Tot:          100   1170   1168

    As you can see, using huge pages can give a nice performance improvement.

    Best regards,

    Joachim Henke

     
  • Igor Pavlov

    Igor Pavlov - 2009-10-25

    What CPU Frequency / L2 cache size / RAM speed of that PowerPC CPU ?

    As I remember PowerPC uses hash function to translate virtual address to physical address. So PowerPC can provide good speed for small pages only when it can place whole translation table to L2 Cache. But most PowercPCs conatin only small L2 Cache. So you have good gain for -slp.

    Also you can try -mmt2 switch in benchmark.
    It can provide faster compression for small pages. -mmt splits task into two parts. And each part can work faster, since it uses smaller memory block and TLB hit rate is higher.

    x86 doesn't use hash for translation. For example, 1 MB L2 can provide good speed for block of 256 MB of RAM (which requires 512 KB for translation tables). That is why the -slp gain on x86 is not big. But you still get good gain for some CPUs like Pentium 4 (about 15%) or for CPUs with very small L2 cache, like Celerons.

    BTW, Windows 7 and Windows 2008 R2 provide faster allocation for large pages than previous versions of Windows. So probably it's OK to enable "large pages" mode in 7-Zip. Previous versions of Windows can hang your system for 10-15 seconds during large pages allocation. That was not good.

     
  • Joachim Henke

    Joachim Henke - 2009-10-25

        cpu: PPC970FX, altivec supported
        clock: 1800 MHz
        L2 cache: 512K unified
        memory: PC3200, unbuffered, 400 MHz DDR SDRAM

    The benchmarks above were not done correctly, as I forgot to disable CPU frequency scaling. Here are the new results:

        ~/p7zip_9.04$ bin/7za b
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:    419 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:     746   100    726    726  |    15420   100   1394   1392
        23:     703   100    717    716  |    14997   100   1374   1373
        24:     678   100    730    729  |    14556   100   1348   1350
        25:     661   100    755    754  |    14104   100   1329   1326
        -------------------------------------------
        Avr:          100    732    731               100   1361   1360
        Tot:          100   1047   1046
       
       
        ~/p7zip_9.04$ bin/7za b -mmt2
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:    425 MB,  # Benchmark threads:      2
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:     773   100    752    752  |    15144   100   1370   1367
        23:     712   100    727    725  |    14706   100   1347   1346
        24:     672   100    724    723  |    14302   100   1328   1327
        25:     654   100    747    746  |    13895   100   1307   1307
        -------------------------------------------
        Avr:          100    737    736               100   1338   1337
        Tot:          100   1038   1037
       
       
        ~/p7zip_9.04$ bin/7za b -slp
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:    419 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:     969   100    943    943  |    15426    99   1400   1392
        23:     955   100    974    973  |    14987   100   1371   1372
        24:     942   100   1014   1012  |    14546   100   1350   1349
        25:     930   100   1063   1061  |    14099   100   1327   1326
        -------------------------------------------
        Avr:          100    999    998               100   1362   1360
        Tot:          100   1180   1179

     
  • Joachim Henke

    Joachim Henke - 2009-10-26

    To show the relevance of my patch even on Intel processors, I post these benchmark results I got with Linux x86_64 (Core2 Duo T7500 2.20 GHz, 32k L1, 4096k L2; DDR2 667 MHz):

        ~/p7zip_9.04-jo$ bin/7za b -md27 -mmt1
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
       
        RAM size:    2000 MB,  # CPU hardware threads:   2
        RAM usage:   1539 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:    2295   100   2235   2233  |    24071   100   2171   2173
        23:    2193   100   2231   2234  |    23750   100   2180   2174
        24:    2108   100   2267   2266  |    23421   100   2174   2173
        25:    2022   100   2309   2308  |    23010   100   2164   2164
        26:    1734   100   2113   2113  |    22613   100   2155   2154
        27:    1616   100   2111   2110  |    22304   100   2152   2152
        -------------------------------------------
        Avr:          100   2211   2211               100   2166   2165
        Tot:          100   2188   2188
       
       
        ~/p7zip_9.04-jo$ bin/7za b -md27 -mmt1 -slp
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
       
        RAM size:    2000 MB,  # CPU hardware threads:   2
        RAM usage:   1539 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        22:    2423   100   2366   2357  |    24061   100   2171   2172
        23:    2334   100   2382   2378  |    23730   100   2173   2172
        24:    2267   100   2439   2438  |    23371   100   2168   2168
        25:    2210   100   2522   2523  |    22997   100   2164   2163
        26:    1974   100   2406   2406  |    22638   100   2159   2157
        27:    1930   100   2520   2520  |    22320   100   2154   2154
        -------------------------------------------
        Avr:          100   2439   2437               100   2165   2164
        Tot:          100   2302   2301

     
  • Igor Pavlov

    Igor Pavlov - 2009-10-27

    I don't like poor results of PPC970FX.
    Please check also

    7z b -md21

    Probably long pipeline of PPC970FX (and big missprediction penalty) is reason for bad decompression performance.

     
  • Igor Pavlov

    Igor Pavlov - 2009-10-27

    Try also to compress some real big file (for example .tar of Linux source code) with -mx switch and compare the time with / without -slp.
    You need about 700 MB for large pages.

     
  • Joachim Henke

    Joachim Henke - 2009-10-28

    Here are the requested benchmark results for PPC970FX:

        ~/p7zip_9.04$ bin/7za b -md21
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
       
        RAM size:    1965 MB,  # CPU hardware threads:   1
        RAM usage:     29 MB,  # Benchmark threads:      1
       
        Dict        Compressing          |        Decompressing
              Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
               KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS
       
        18:    1128    99   1018   1005  |    16657   100   1419   1420
        19:    1031   100    921    924  |    16437   100   1419   1419
        20:     922   100    840    840  |    16218   100   1418   1418
        21:     820   100    766    769  |    15846   100   1409   1410
        -------------------------------------------
        Avr:          100    886    884               100   1416   1417
        Tot:          100   1151   1150

    Compressing a cached file to memory:

        ~/p7zip_9.04$ time bin/7za a -mx /dev/shm/t.7z linux-2.6.31.tar
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
        Scanning
       
        Creating archive /dev/shm/t.7z
       
        Compressing  linux-2.6.31.tar     
       
        Everything is Ok
       
        real 16m48.090s
        user 16m38.198s
        sys 0m9.700s

    The same with huge pages enabled:

        ~/p7zip_9.04$ time bin/7za a -mx -slp /dev/shm/t.7z linux-2.6.31.tar
       
        7-Zip (A) 9.04 beta  Copyright (C) 1999-2009 Igor Pavlov  2009-05-30
        p7zip Version 9.04 (locale=C,Utf16=off,HugeFiles=on,1 CPU)
        Scanning
       
        Creating archive /dev/shm/t.7z
       
        Compressing  linux-2.6.31.tar     
       
        Everything is Ok
       
        real 11m48.874s
        user 11m42.048s
        sys 0m6.677s

     
  • Igor Pavlov

    Igor Pavlov - 2009-10-29

    If there is such big gain on PowerPC from large pages, why it was not enabled in Linux before?

    Also I have memory benchmark program:

    http://www.7-cpu.com/

    If you can port/run it in PowerPC, we can see exact numbers of memory latency with/without large pages. If CPU tick counter is not available for PowerPC, it must be called in ns mode:
    memlat 1 n

     
  • Joachim Henke

    Joachim Henke - 2009-10-29

    -> If there is such big gain on PowerPC from large pages, why it was not enabled in Linux before?

    Do you mean in Linux generally or in the Linux version of p7zip?

    First I did a port of memlat to Linux x86 and verified that it works correctly by comparing the results to those from MemLat32.exe in Windows XP. Then I added support for PowerPC. If you are interested, you can download the modified sources and the patch from  - you might like to include it in the next official release.

    On PPC970FX the CPU tick counter doesn't run at full CPU speed, but you still get an idea:

        ~/7bench/CPP/Utils/CPUTest/MemLat$ ./memlat 1 n
        MemLat 9.02 : Igor Pavlov : Public domain : 2009-08-25
            Size      1      2      3      4      5
       
             4-K   2.77   1.38   0.92   0.69   0.58
             5-K   2.77   1.38   0.92   0.69   0.57
             6-K   2.77   1.38   0.92   0.69   0.55
             7-K   2.77   1.38   0.92   0.69   0.55
             8-K   2.77   1.38   0.92   0.69   0.58
            10-K   2.77   1.38   0.92   0.69   0.57
            12-K   2.77   1.38   0.92   0.69   0.57
            14-K   2.77   1.38   0.92   0.69   0.58
            16-K   2.77   1.38   0.92   0.69   0.58
            20-K   2.77   1.38   0.92   0.69   0.57
            24-K   2.77   1.39   0.92   0.69   0.55
            28-K   2.77   1.39   0.92   0.69   0.57
            32-K   2.77   1.39   0.92   0.69   0.58
            40-K   3.86   2.06   1.52   1.26   1.12
            48-K   4.50   2.48   1.84   1.54   1.34
            56-K   5.04   2.73   2.04   1.70   1.51
            64-K   5.34   2.93   2.23   1.82   1.61
            80-K   5.86   3.20   2.40   1.98   1.75
            96-K   6.17   3.37   2.51   2.08   1.83
           112-K   6.42   3.47   2.58   2.13   1.90
           128-K   6.60   3.55   2.62   2.18   1.95
           160-K   6.82   3.66   2.67   2.25   2.02
           192-K   7.02   3.73   2.71   2.29   2.07
           224-K   7.12   3.77   2.73   2.32   2.10
           256-K   7.20   3.81   2.74   2.35   2.13
           320-K   7.32   3.86   2.75   2.38   2.16
           384-K  14.02   8.89   7.74   7.15   6.71
           448-K  26.49  16.99  14.31  13.12  12.37
           512-K  43.66  26.92  21.82  19.82  18.24
           640-K  62.75  36.83  29.52  26.36  24.16
           768-K  81.87  46.31  37.44  33.21  31.00
           896-K 101.68  55.66  46.28  41.39  38.74
          1024-K 116.51  63.20  53.78  49.27  45.73
       
        BW- 32 B    274    506    594    649    699
        BW- 64 B    549   1012   1189   1299   1399
        BW-128 B   1098   2025   2379   2598   2799
       
        Cache latency    =       2.77 ns     =       0.09 cycles
        Memory latency   =     116.51 ns     =       3.90 cycles
        ns PageSize=4
       
       
        Timer frequency  =    1000000 Hz
        CPU frequency    =      33.45 MHz

    I hope to find some spare time to add huge pages support to memlat in the next days.

    : http://j-o.users.sourceforge.net/download/7-zip/

     
  • Igor Pavlov

    Igor Pavlov - 2009-10-30

    Thanks!

    I mean large pages support in Linux for any software (maybe even by default, when the software requests big amount of memory with malloc).

    If you will be able to run any test from memlat, try to call same tests as in test.bat (or test64.bat).

    Probably it's better to post messages and results about MemLat to 

      : http://sourceforge.net/projects/sevenmax/forums/forum/399008

     
  • Joachim Henke

    Joachim Henke - 2009-10-30

    Currently, handling huge pages in Linux is somewhat cumbersome, and almost only used by enterprise software like database systems.

    With the upcoming kernel 2.6.32 allocating memory on huge pages will be just as easy as doing an mmap (from the application side) - but still, the administrator has to pre-configure, how much memory is reserved for huge pages.

    Well, you already can make use of an "over-commit" feature, which lets the kernel automatically reserve more huge pages on demand. Of course, this works only, if physical memory got not too fragmented - especially on PowerPC, where a huge page is 16384k (at least on PPC970)! One day a memory defragmenter (or even "compressor") might be added to Linux to overcome these issues.

    Yes, having the whole user space thing be dealt with by malloc would be perfect. But I wouldn't expect that too soon. However, interested users can come around that by pre-loading a library like libhugetlbfs.

     

Log in to post a comment.