From: Bart V. A. <bva...@ac...> - 2011-02-26 12:46:09
|
On Fri, Feb 25, 2011 at 8:49 AM, Hiroyuki Sato <hir...@gm...> wrote: > I would like to know benchmark best practice about SRP. > > I tested the following environment . > > result > (this result is the average rate of multiple tests) > > 1) enable buffer cache > read 456MB/sec > write 243MB/sec > > 2) diable buffer cache > read 454MB/sec > write 246MB/sec > > I'm thinking SRP is more faster and my test is something wrong. > > I would like to know the following info. > I'll test again and feedback it. > > 1) benchmark tool (only freeware) > 2) scst and kernel tuning parameters. > 3) observation point > cpu average, memory usage... like that > 4) best practice website or books. > Hello Hiroyuki, A few general comments about IB performance: - The bandwidths you report are significantly below what is possible with an SRP HCA and large block sizes. You should verify with ib_rdma_bw and a block size of 64 KB whether the bandwidth on your setup is close to what the HCA vendor advertizes. If not, check whether your HCA has been inserted in an I/O slot that has sufficient bandwidth and that the PCI MaxReadRequest parameter is at least 512 bytes (see also the output of lspci -vvv). Updating the BIOS may help. - IB settings like "connected mode" and MTU are only relevant for IPoIB and not for SRP. - Recent HCAs have significantly lower latency than the SDR HCAs I know of - you should keep that in mind when running tests to measure I/O latency. General performance measurement advice: - Make sure that frequency scaling has been disabled in the BIOS, kernel or via /sys. - For maximum reproducibility, make sure that the X server is not running, e.g. by switching to a runlevel in which the X server is not started. About the initiator configuration: - There are two possible ways to perform I/O at the initiator: asynchronous (buffered) I/O and direct (non-buffered) I/O. If you want to measure the performance of SRP you should avoid asynchronous I/O and use direct I/O instead. Since bonnie++ uses buffered I/O it is not suited for such tests -- please use fio <http://git.kernel.dk/?p=fio.git;a=summary> instead. - Use the NOOP I/O scheduler instead of e.g. CFQ. While an I/O scheduler like CFQ can give better performance than NOOP when running asynchronous I/O on top of SRP, the impact of an I/O scheduler makes it impossible to interpret benchmark results properly. - Enlarging the block-level parameter nr_requests might be necessary before running IOPS tests with a large I/O depth. - Increasing the ib_srp parameter srp_sg_tablesize to 255 might help. About the target setup: - It is a more convenient to create a file on tmpfs (/dev/shm/...) with dd instead of using a RAM disk (/dev/ram...) for benchmarking. And if I'm not mistaken, the overhead of a tmpfs file is smaller than that of a RAM disk when using the SCST handler vdisk_fileio. - You should set the threads_num parameter in scst.conf for disk01 etc. The default value of that parameter is too large for low-latency media like a RAM disk. One of the values 1 or 2 for threads_num should be optimal. Configuring up the SCST sysfs parameter called cpu_mask may also help. - Increasing the ib_srpt kernel module parameter srp_max_req_size might help, especially for asynchronous I/O throughput. - When multiple HCAs are present in the target, make sure that the system has been configured such that interrupts of different HCAs are processed by different CPUs. About which tests to run - interesting tests to run are: - Single I/O depth random reads and writes for varying block sizes. Use fio's I/O engine psync for this test. - Large I/O depth reads and writes for small block sizes. Use fio's I/O engine libaio for this test with the number of jobs identical to the number of CPU cores in the initiator system. About which parameters to monitor - interesting parameters to monitor while I/O is ongoing are: - CPU load of each CPU core in each system involved. - Interrupt frequency per CPU core - both total frequency and frequency per IB HCA. - I/O bandwidth to/from each storage driver involved. - Cache miss frequency per CPU. Because of the high rates at which threads communicate on an SRP target system, if the system has multiple processors, it is not impossible that misconfiguring CPU affinity can cause slowdown because of excessive inter-processor communication. Running a recent kernel and using the perf tool may help here. Regarding books: sorry, but I do not know about any SRP-specific books. What is needed to make SRP perform optimally is an understanding of system behavior at different levels, including InfiniBand, I/O chipset and PCI bus, multiprocessor caching, the memory hierarchy and Linux-related performance knowledge. As you can see, a lot of advice - I hope I haven't forgotten to mention anything. Bart. |