RE: [Lse-tech] code path of 128KB read() from raw device (kernpro f acg)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Steve, Thanks for forwarding this mail.  Recently I have gone over handling
of iobufs in kernel in great detail with Andreas (maintainer).  That was one
issue resulting in severe loss of performance starting with 2.4.5 kernel.
If you put Andrea's or my patch the result on IA-32 could go as much as 40%
higher and on IA-64 it will be about 20% higher depending on the workload.
Most of this benefit in performance is coming directly or indirectly from
better iobuf (and associated buffer head) management.
Couple of points about this mail below: On 2.4.9 kernel the 128KB blocks
should go as is (without breaking them in smaller chunk of 64K).  The limit
is 512K.  So all IO requests <=512K get sumbitted in one iteration.  But if
these requests are greater than 512K then kernel first waits for completeing
the first 512K before submitting the next.  The data below indicates ---look
at the numbers against wait_kio for instance.  Anyways, the limitation is
mainly because of the current handling of bhs associated with iobufs.  After
putting my patch this limitation can be completely removed without any
problems (Haven't tried that though).  The other part that this mail is
refering to is breaking the requests to 512 byte sector quantity.  I've this
on my todo list to try and see its affect on real world situation.  But
currently I'm on AIO ...thats is the top priority. So, it is possible that I
may not touch this for another couple of weeks.  Having said that, I also
think that the lower layer is combining these requests broken by raw IO
layer.  And the data below kind of indicates that too.  But this needs to be
properly investigated. 

Could you also please explain the name of all the columns in the following
data.

rohit

-----Original Message-----
From: Carbonari, Steven 
Sent: Friday, September 21, 2001 10:05 AM
To: Seth, Rohit
Cc: Mallick, Asit K; Prickett, Terry O
Subject: FW: [Lse-tech] code path of 128KB read() from raw device (kernprof
acg)

Rohit,

I thought you might be interested in seeing this, since you have looked
closely at raw IO with the TPC-C testing.  Does you patch address any of the
issues brought up below?

Steve

-----Original Message-----
From: Bill Hartner [mailto:ha...@au...]
Sent: Friday, September 21, 2001 9:53 AM
To: lse...@li...
Subject: [Lse-tech] code path of 128KB read() from raw device (kernprof acg)

I created a simple program (readraw.c) that does reads from a raw device 
in order to understand the code path [in support of a database benchmarking 
effort that is using raw i/o].  A link to the program : 

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c
<http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c>  

I collected a kernprof ACG for one instance of the rawread program that 
did (128) reads of 128KB from the raw device.  The system under test was 
a 4-way 200 Mhz and a 2.4.9 kernel w/ SGI profile patch. 

Here is a link to the ACG : 

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.acg
.txt
<http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.ac
g.txt>  

Here is the raw read code down to the block device layer : 

----------------------------------------------- 
                0.00    0.31     150/150         system_call [3] 
[4]      1.4    0.00    0.31     150         sys_read [4] 
                0.00    0.29     128/128         raw_read [5] 
                0.00    0.02      18/18          proc_file_read [22] 
                0.00    0.00       4/14          generic_file_read [78] 
                0.00    0.00     151/253         fput [191] 
                0.00    0.00     150/177         fget [194] 
----------------------------------------------- 
                0.00    0.29     128/128         sys_read [4] 
[5]      1.3    0.00    0.29     128         raw_read [5] 
                0.00    0.29     128/128         rw_raw_dev [6] 
----------------------------------------------- 
                0.00    0.29     128/128         raw_read [5] 
[6]      1.3    0.00    0.29     128         rw_raw_dev [6] 
                0.03    0.25     128/128         brw_kiovec [7] 
                0.00    0.01     128/128         map_user_kiobuf [35] 
                0.00    0.00     128/128         mark_dirty_kiobuf [205] 
                0.00    0.00     128/128         unmap_kiobuf [206] 
----------------------------------------------- 
                0.03    0.25     128/128         rw_raw_dev [6] 
[7]      1.2    0.03    0.25     128         brw_kiovec [7] 
                0.03    0.17   32768/32795       submit_bh [8] 
                0.03    0.00   32768/32771       set_bh_page [13] 
                0.01    0.00     128/128         wait_kio [36] 
                0.00    0.01     128/128         kiobuf_wait_for_io [44] 
                0.00    0.00   32768/32770       init_buffer [129] 
----------------------------------------------- 
                0.00    0.00       1/32795       block_read_full_page [118] 
                0.00    0.00       2/32795       ll_rw_block [111] 
                0.00    0.00      24/32795       write_locked_buffers [61] 
                0.03    0.17   32768/32795       brw_kiovec [7] 
[8]      0.9    0.03    0.17   32795         submit_bh [8] 
                0.01    0.16   32795/32795       generic_make_request [9] 
----------------------------------------------- 
                0.01    0.16   32795/32795       submit_bh [8] 
[9]      0.8    0.01    0.16   32795         generic_make_request [9] 
                0.15    0.00   32795/32795       _make_request [10] 
                0.01    0.00   32795/32795       blk_get_queue [40] 
----------------------------------------------- 
                0.15    0.00   32795/32795       generic_make_request [9] 
[10]     0.7    0.15    0.00   32795         _make_request [10] 
                0.00    0.00   32661/32661       elevator_linus_merge [130] 
                0.00    0.00   32640/32640       scsi_back_merge_fn_c [131] 
                0.00    0.00   32514/32514
elevator_linus_merge_cleanup [132] 
                0.00    0.00     134/134         generic_plug_device [203] 
                0.00    0.00       3/3           attempt_merge [345] 
                0.00    0.00       2/2           scsi_front_merge_fn_c [426]

----------------------------------------------- 

A couple of observations from the ACG : 

(1) The (128) raw reads of 128KB result in 32768 calls to submit_bh which 
    is acquiring the io_request_lock at least (2) times for each call. 

(2) The 128K raw read appears to be broken down into (2) I/O of 64KB each. 
    I think the 1st one completes and then the 2nd one is initiated. 

Seems that there may be some room for improvement if we were able to reduce 
the calls to submit_bh by sending down 1 or 2 buffer heads instead of 256 
for each 128KB read. 

Also, I did a quick test on SMP running (8) instances of the rawread 
program.  The observation was that possibly some of the I/O is driven 
down to the SCSI layer before all of the buffer heads for one of the 
128KB reads have made it down to the block device layer.  This may 
result in more IO which could impact performance. 

We will be studying this code path more.  Comments ? 

Bill Hartner 
bha...@us... 
IBM Linux Technology Center