From: Seth, R. <roh...@in...> - 2001-09-21 19:51:11
|
Steve, Thanks for forwarding this mail. Recently I have gone over handling of iobufs in kernel in great detail with Andreas (maintainer). That was one issue resulting in severe loss of performance starting with 2.4.5 kernel. If you put Andrea's or my patch the result on IA-32 could go as much as 40% higher and on IA-64 it will be about 20% higher depending on the workload. Most of this benefit in performance is coming directly or indirectly from better iobuf (and associated buffer head) management. Couple of points about this mail below: On 2.4.9 kernel the 128KB blocks should go as is (without breaking them in smaller chunk of 64K). The limit is 512K. So all IO requests <=512K get sumbitted in one iteration. But if these requests are greater than 512K then kernel first waits for completeing the first 512K before submitting the next. The data below indicates ---look at the numbers against wait_kio for instance. Anyways, the limitation is mainly because of the current handling of bhs associated with iobufs. After putting my patch this limitation can be completely removed without any problems (Haven't tried that though). The other part that this mail is refering to is breaking the requests to 512 byte sector quantity. I've this on my todo list to try and see its affect on real world situation. But currently I'm on AIO ...thats is the top priority. So, it is possible that I may not touch this for another couple of weeks. Having said that, I also think that the lower layer is combining these requests broken by raw IO layer. And the data below kind of indicates that too. But this needs to be properly investigated. Could you also please explain the name of all the columns in the following data. rohit -----Original Message----- From: Carbonari, Steven Sent: Friday, September 21, 2001 10:05 AM To: Seth, Rohit Cc: Mallick, Asit K; Prickett, Terry O Subject: FW: [Lse-tech] code path of 128KB read() from raw device (kernprof acg) Rohit, I thought you might be interested in seeing this, since you have looked closely at raw IO with the TPC-C testing. Does you patch address any of the issues brought up below? Steve -----Original Message----- From: Bill Hartner [mailto:ha...@au...] Sent: Friday, September 21, 2001 9:53 AM To: lse...@li... Subject: [Lse-tech] code path of 128KB read() from raw device (kernprof acg) I created a simple program (readraw.c) that does reads from a raw device in order to understand the code path [in support of a database benchmarking effort that is using raw i/o]. A link to the program : http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c <http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c> I collected a kernprof ACG for one instance of the rawread program that did (128) reads of 128KB from the raw device. The system under test was a 4-way 200 Mhz and a 2.4.9 kernel w/ SGI profile patch. Here is a link to the ACG : http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.acg .txt <http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.ac g.txt> Here is the raw read code down to the block device layer : ----------------------------------------------- 0.00 0.31 150/150 system_call [3] [4] 1.4 0.00 0.31 150 sys_read [4] 0.00 0.29 128/128 raw_read [5] 0.00 0.02 18/18 proc_file_read [22] 0.00 0.00 4/14 generic_file_read [78] 0.00 0.00 151/253 fput [191] 0.00 0.00 150/177 fget [194] ----------------------------------------------- 0.00 0.29 128/128 sys_read [4] [5] 1.3 0.00 0.29 128 raw_read [5] 0.00 0.29 128/128 rw_raw_dev [6] ----------------------------------------------- 0.00 0.29 128/128 raw_read [5] [6] 1.3 0.00 0.29 128 rw_raw_dev [6] 0.03 0.25 128/128 brw_kiovec [7] 0.00 0.01 128/128 map_user_kiobuf [35] 0.00 0.00 128/128 mark_dirty_kiobuf [205] 0.00 0.00 128/128 unmap_kiobuf [206] ----------------------------------------------- 0.03 0.25 128/128 rw_raw_dev [6] [7] 1.2 0.03 0.25 128 brw_kiovec [7] 0.03 0.17 32768/32795 submit_bh [8] 0.03 0.00 32768/32771 set_bh_page [13] 0.01 0.00 128/128 wait_kio [36] 0.00 0.01 128/128 kiobuf_wait_for_io [44] 0.00 0.00 32768/32770 init_buffer [129] ----------------------------------------------- 0.00 0.00 1/32795 block_read_full_page [118] 0.00 0.00 2/32795 ll_rw_block [111] 0.00 0.00 24/32795 write_locked_buffers [61] 0.03 0.17 32768/32795 brw_kiovec [7] [8] 0.9 0.03 0.17 32795 submit_bh [8] 0.01 0.16 32795/32795 generic_make_request [9] ----------------------------------------------- 0.01 0.16 32795/32795 submit_bh [8] [9] 0.8 0.01 0.16 32795 generic_make_request [9] 0.15 0.00 32795/32795 _make_request [10] 0.01 0.00 32795/32795 blk_get_queue [40] ----------------------------------------------- 0.15 0.00 32795/32795 generic_make_request [9] [10] 0.7 0.15 0.00 32795 _make_request [10] 0.00 0.00 32661/32661 elevator_linus_merge [130] 0.00 0.00 32640/32640 scsi_back_merge_fn_c [131] 0.00 0.00 32514/32514 elevator_linus_merge_cleanup [132] 0.00 0.00 134/134 generic_plug_device [203] 0.00 0.00 3/3 attempt_merge [345] 0.00 0.00 2/2 scsi_front_merge_fn_c [426] ----------------------------------------------- A couple of observations from the ACG : (1) The (128) raw reads of 128KB result in 32768 calls to submit_bh which is acquiring the io_request_lock at least (2) times for each call. (2) The 128K raw read appears to be broken down into (2) I/O of 64KB each. I think the 1st one completes and then the 2nd one is initiated. Seems that there may be some room for improvement if we were able to reduce the calls to submit_bh by sending down 1 or 2 buffer heads instead of 256 for each 128KB read. Also, I did a quick test on SMP running (8) instances of the rawread program. The observation was that possibly some of the I/O is driven down to the SCSI layer before all of the buffer heads for one of the 128KB reads have made it down to the block device layer. This may result in more IO which could impact performance. We will be studying this code path more. Comments ? Bill Hartner bha...@us... IBM Linux Technology Center |