Menu

#18 Dual-Core/SMP parallelization support

closed
nobody
None
5
2006-03-19
2006-01-25
Anonymous
No

Hi,

isn't it possible to spread the en/decrypting work on
more than one processor? I am not familiar with the low
level consequences on kernel level, but theoretically a
block-wise cipher should be easily parellelizable. In
my opinion, this would make a lot of sense in respect
to the current spreading of DualCore PCs.

Yours,
Holger

Discussion

  • Jari Ruusu

    Jari Ruusu - 2006-01-26

    Logged In: YES
    user_id=238645

    Following applies to 2.4 and 2.6 kernel versions:
    Writes are already parallelized to some extent, reads
    are not. If writing process is able to allocate
    internal buffer where to store the ciphertext, then
    encryption is done using context of writing process
    regardless of what cpu it happens to run on. If writing
    process is not able to allocate internal buffer, then
    encryption work is pushed to loop helper thread.
    Decryption is always handled by loop helper thread.
    Currently there is only one helper thread per loop
    device.

    Many cases involving modern processors, the AES
    implementation, especially the AMD64 optimized
    assembler implementation, is already fast enough to
    exceed disk data transfer speed even on one processor
    core.

    Do you have disk system that is fast enough to fully
    utilize one core of modern processor running loop-AES?

     
  • Nobody/Anonymous

    Logged In: NO

    Hi jariruusu,

    I'm running loop-aes on a small server with two Opteron 244
    processors and a 6 disk HW RAID. The plain RAID is able to
    deliver 60-100 MB/s during sustained read of a 16G file and
    is capable to write 40 MB/s (same file size). A loop-aes
    partition on the same array reads and writes around 25-30
    MB/s. I've rarely seen more than one busy processor on
    I/O-intensive tasks, even if I start multiple concurrent
    disk transfers.

    I'm currently running kernel 2.6.14 (x86_64) and loop-aes 3.1b.

    Yours,
    Holger

     
  • Jari Ruusu

    Jari Ruusu - 2006-01-28

    Logged In: YES
    user_id=238645

    Speed of assembler AES implementation on 1.6 GHz Opteron
    key length 128 bits, encrypt speed 1106.6 Mbits/s
    key length 128 bits, decrypt speed 1107.0 Mbits/s
    key length 192 bits, encrypt speed 932.3 Mbits/s
    key length 192 bits, decrypt speed 933.3 Mbits/s
    key length 256 bits, encrypt speed 807.8 Mbits/s
    key length 256 bits, decrypt speed 813.7 Mbits/s

    Speed of assembler MD5 implementation on 1.6 GHz Opteron
    md5 IV speed 2367.1 Mbits/sec

    Combining above 128 bit AES + MD5, one 1.6 GHz Opteron
    should be able to handle about 89 MB/s. Above numbers
    are for crypto operations in userspace, not including
    any file system, loop driver, block layer, or disk
    waiting overhead.

    If you want to attempt to optimize loop-AES
    performance, you can try these optimizations:

    1) Try using deadline I/O scheduler (boot with
    elevator=deadline kernel parameter). Deadline I/O
    scheduler may reduce situations where loop driver
    has to wait for I/O to complete on underlying
    device.

    2) Try using built-in loop driver by applying kernel
    patch that is present in loop-AES tarball. This
    should reduce TLB cache misses and improve
    performance a litle bit.

    3) Try using larger page pre-allocation. For module
    version add "options loop lo_prealloc=512" line to
    /etc/modprobe.conf, or alternatively, if you are
    using built-in loop driver by applying the kernel
    patch, adding "lo_prealloc=512" kernel parameter.
    Larger pre-allocation may reduce situations where
    loop driver has to wait for I/O to complete on
    underlying device.

    4) Try using 128 bit AES keys instead of 256 bit. 128
    bit keys use smaller number of rounds and are little
    bit faster.

     
  • Jari Ruusu

    Jari Ruusu - 2006-03-19
    • status: open --> closed
     
  • Jari Ruusu

    Jari Ruusu - 2006-03-19

    Logged In: YES
    user_id=238645

    Adding multiple worker thread support would mean much
    re-writing of loop code. The nasty problem being
    getting barrier ordering right. As of this writing, I
    do not plan to make such changes myself. If someone
    sends a patch for that, then I will seriously consider
    merging it. But for now... I am closing this feature
    request.