Dual-Core/SMP parallelization support
Brought to you by:
jariruusu
Hi,
isn't it possible to spread the en/decrypting work on
more than one processor? I am not familiar with the low
level consequences on kernel level, but theoretically a
block-wise cipher should be easily parellelizable. In
my opinion, this would make a lot of sense in respect
to the current spreading of DualCore PCs.
Yours,
Holger
Logged In: YES
user_id=238645
Following applies to 2.4 and 2.6 kernel versions:
Writes are already parallelized to some extent, reads
are not. If writing process is able to allocate
internal buffer where to store the ciphertext, then
encryption is done using context of writing process
regardless of what cpu it happens to run on. If writing
process is not able to allocate internal buffer, then
encryption work is pushed to loop helper thread.
Decryption is always handled by loop helper thread.
Currently there is only one helper thread per loop
device.
Many cases involving modern processors, the AES
implementation, especially the AMD64 optimized
assembler implementation, is already fast enough to
exceed disk data transfer speed even on one processor
core.
Do you have disk system that is fast enough to fully
utilize one core of modern processor running loop-AES?
Logged In: NO
Hi jariruusu,
I'm running loop-aes on a small server with two Opteron 244
processors and a 6 disk HW RAID. The plain RAID is able to
deliver 60-100 MB/s during sustained read of a 16G file and
is capable to write 40 MB/s (same file size). A loop-aes
partition on the same array reads and writes around 25-30
MB/s. I've rarely seen more than one busy processor on
I/O-intensive tasks, even if I start multiple concurrent
disk transfers.
I'm currently running kernel 2.6.14 (x86_64) and loop-aes 3.1b.
Yours,
Holger
Logged In: YES
user_id=238645
Speed of assembler AES implementation on 1.6 GHz Opteron
key length 128 bits, encrypt speed 1106.6 Mbits/s
key length 128 bits, decrypt speed 1107.0 Mbits/s
key length 192 bits, encrypt speed 932.3 Mbits/s
key length 192 bits, decrypt speed 933.3 Mbits/s
key length 256 bits, encrypt speed 807.8 Mbits/s
key length 256 bits, decrypt speed 813.7 Mbits/s
Speed of assembler MD5 implementation on 1.6 GHz Opteron
md5 IV speed 2367.1 Mbits/sec
Combining above 128 bit AES + MD5, one 1.6 GHz Opteron
should be able to handle about 89 MB/s. Above numbers
are for crypto operations in userspace, not including
any file system, loop driver, block layer, or disk
waiting overhead.
If you want to attempt to optimize loop-AES
performance, you can try these optimizations:
1) Try using deadline I/O scheduler (boot with
elevator=deadline kernel parameter). Deadline I/O
scheduler may reduce situations where loop driver
has to wait for I/O to complete on underlying
device.
2) Try using built-in loop driver by applying kernel
patch that is present in loop-AES tarball. This
should reduce TLB cache misses and improve
performance a litle bit.
3) Try using larger page pre-allocation. For module
version add "options loop lo_prealloc=512" line to
/etc/modprobe.conf, or alternatively, if you are
using built-in loop driver by applying the kernel
patch, adding "lo_prealloc=512" kernel parameter.
Larger pre-allocation may reduce situations where
loop driver has to wait for I/O to complete on
underlying device.
4) Try using 128 bit AES keys instead of 256 bit. 128
bit keys use smaller number of rounds and are little
bit faster.
Logged In: YES
user_id=238645
Adding multiple worker thread support would mean much
re-writing of loop code. The nasty problem being
getting barrier ordering right. As of this writing, I
do not plan to make such changes myself. If someone
sends a patch for that, then I will seriously consider
merging it. But for now... I am closing this feature
request.