Steve, Thanks for forwarding this mail.  Recently I have gone over handling of iobufs in kernel in great detail with Andreas (maintainer).  That was one issue resulting in severe loss of performance starting with 2.4.5 kernel.  If you put Andrea's or my patch the result on IA-32 could go as much as 40% higher and on IA-64 it will be about 20% higher depending on the workload.  Most of this benefit in performance is coming directly or indirectly from better iobuf (and associated buffer head) management.
Couple of points about this mail below: On 2.4.9 kernel the 128KB blocks should go as is (without breaking them in smaller chunk of 64K).  The limit is 512K.  So all IO requests <=512K get sumbitted in one iteration.  But if these requests are greater than 512K then kernel first waits for completeing the first 512K before submitting the next.  The data below indicates ---look at the numbers against wait_kio for instance.  Anyways, the limitation is mainly because of the current handling of bhs associated with iobufs.  After putting my patch this limitation can be completely removed without any problems (Haven't tried that though).  The other part that this mail is refering to is breaking the requests to 512 byte sector quantity.  I've this on my todo list to try and see its affect on real world situation.  But currently I'm on AIO ...thats is the top priority. So, it is possible that I may not touch this for another couple of weeks.  Having said that, I also think that the lower layer is combining these requests broken by raw IO layer.  And the data below kind of indicates that too.  But this needs to be properly investigated. 
 
Could you also please explain the name of all the columns in the following data.
 
rohit
-----Original Message-----
From: Carbonari, Steven
Sent: Friday, September 21, 2001 10:05 AM
To: Seth, Rohit
Cc: Mallick, Asit K; Prickett, Terry O
Subject: FW: [Lse-tech] code path of 128KB read() from raw device (kernprof acg)

Rohit,
 
I thought you might be interested in seeing this, since you have looked closely at raw IO with the TPC-C testing.  Does you patch address any of the issues brought up below?
 
Steve
 
 
-----Original Message-----
From: Bill Hartner [mailto:hartner@austin.ibm.com]
Sent: Friday, September 21, 2001 9:53 AM
To: lse-tech@lists.sourceforge.net
Subject: [Lse-tech] code path of 128KB read() from raw device (kernprof acg)

 
I created a simple program (readraw.c) that does reads from a raw device
in order to understand the code path [in support of a database benchmarking
effort that is using raw i/o].  A link to the program :

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c

I collected a kernprof ACG for one instance of the rawread program that
did (128) reads of 128KB from the raw device.  The system under test was
a 4-way 200 Mhz and a 2.4.9 kernel w/ SGI profile patch.

Here is a link to the ACG :

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.acg.txt

Here is the raw read code down to the block device layer :

-----------------------------------------------
                0.00    0.31     150/150         system_call [3]
[4]      1.4    0.00    0.31     150         sys_read [4]
                0.00    0.29     128/128         raw_read [5]
                0.00    0.02      18/18          proc_file_read [22]
                0.00    0.00       4/14          generic_file_read [78]
                0.00    0.00     151/253         fput [191]
                0.00    0.00     150/177         fget [194]
-----------------------------------------------
                0.00    0.29     128/128         sys_read [4]
[5]      1.3    0.00    0.29     128         raw_read [5]
                0.00    0.29     128/128         rw_raw_dev [6]
-----------------------------------------------
                0.00    0.29     128/128         raw_read [5]
[6]      1.3    0.00    0.29     128         rw_raw_dev [6]
                0.03    0.25     128/128         brw_kiovec [7]
                0.00    0.01     128/128         map_user_kiobuf [35]
                0.00    0.00     128/128         mark_dirty_kiobuf [205]
                0.00    0.00     128/128         unmap_kiobuf [206]
-----------------------------------------------
                0.03    0.25     128/128         rw_raw_dev [6]
[7]      1.2    0.03    0.25     128         brw_kiovec [7]
                0.03    0.17   32768/32795       submit_bh [8]
                0.03    0.00   32768/32771       set_bh_page [13]
                0.01    0.00     128/128         wait_kio [36]
                0.00    0.01     128/128         kiobuf_wait_for_io [44]
                0.00    0.00   32768/32770       init_buffer [129]
-----------------------------------------------
                0.00    0.00       1/32795       block_read_full_page [118]
                0.00    0.00       2/32795       ll_rw_block [111]
                0.00    0.00      24/32795       write_locked_buffers [61]
                0.03    0.17   32768/32795       brw_kiovec [7]
[8]      0.9    0.03    0.17   32795         submit_bh [8]
                0.01    0.16   32795/32795       generic_make_request [9]
-----------------------------------------------
                0.01    0.16   32795/32795       submit_bh [8]
[9]      0.8    0.01    0.16   32795         generic_make_request [9]
                0.15    0.00   32795/32795       _make_request [10]
                0.01    0.00   32795/32795       blk_get_queue [40]
-----------------------------------------------
                0.15    0.00   32795/32795       generic_make_request [9]
[10]     0.7    0.15    0.00   32795         _make_request [10]
                0.00    0.00   32661/32661       elevator_linus_merge [130]
                0.00    0.00   32640/32640       scsi_back_merge_fn_c [131]
                0.00    0.00   32514/32514       elevator_linus_merge_cleanup [132]
                0.00    0.00     134/134         generic_plug_device [203]
                0.00    0.00       3/3           attempt_merge [345]
                0.00    0.00       2/2           scsi_front_merge_fn_c [426]
-----------------------------------------------

A couple of observations from the ACG :

(1) The (128) raw reads of 128KB result in 32768 calls to submit_bh which
    is acquiring the io_request_lock at least (2) times for each call.

(2) The 128K raw read appears to be broken down into (2) I/O of 64KB each.
    I think the 1st one completes and then the 2nd one is initiated.

Seems that there may be some room for improvement if we were able to reduce
the calls to submit_bh by sending down 1 or 2 buffer heads instead of 256
for each 128KB read.

Also, I did a quick test on SMP running (8) instances of the rawread
program.  The observation was that possibly some of the I/O is driven
down to the SCSI layer before all of the buffer heads for one of the
128KB reads have made it down to the block device layer.  This may
result in more IO which could impact performance.

We will be studying this code path more.  Comments ?

Bill Hartner
bhartner@us.ibm.com
IBM Linux Technology Center