I created a simple program (readraw.c) that does reads from a raw device
in order to understand the code path [in support of a database benchmarking
effort that is using raw i/o].  A link to the program :

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread.c

I collected a kernprof ACG for one instance of the rawread program that
did (128) reads of 128KB from the raw device.  The system under test was
a 4-way 200 Mhz and a 2.4.9 kernel w/ SGI profile patch.

Here is a link to the ACG :

http://lse.sourceforge.net/benchmarks/ltc/rawread/20_sep_2001/rawread249.acg.txt

Here is the raw read code down to the block device layer :

-----------------------------------------------
                0.00    0.31     150/150         system_call [3]
[4]      1.4    0.00    0.31     150         sys_read [4]
                0.00    0.29     128/128         raw_read [5]
                0.00    0.02      18/18          proc_file_read [22]
                0.00    0.00       4/14          generic_file_read [78]
                0.00    0.00     151/253         fput [191]
                0.00    0.00     150/177         fget [194]
-----------------------------------------------
                0.00    0.29     128/128         sys_read [4]
[5]      1.3    0.00    0.29     128         raw_read [5]
                0.00    0.29     128/128         rw_raw_dev [6]
-----------------------------------------------
                0.00    0.29     128/128         raw_read [5]
[6]      1.3    0.00    0.29     128         rw_raw_dev [6]
                0.03    0.25     128/128         brw_kiovec [7]
                0.00    0.01     128/128         map_user_kiobuf [35]
                0.00    0.00     128/128         mark_dirty_kiobuf [205]
                0.00    0.00     128/128         unmap_kiobuf [206]
-----------------------------------------------
                0.03    0.25     128/128         rw_raw_dev [6]
[7]      1.2    0.03    0.25     128         brw_kiovec [7]
                0.03    0.17   32768/32795       submit_bh [8]
                0.03    0.00   32768/32771       set_bh_page [13]
                0.01    0.00     128/128         wait_kio [36]
                0.00    0.01     128/128         kiobuf_wait_for_io [44]
                0.00    0.00   32768/32770       init_buffer [129]
-----------------------------------------------
                0.00    0.00       1/32795       block_read_full_page [118]
                0.00    0.00       2/32795       ll_rw_block [111]
                0.00    0.00      24/32795       write_locked_buffers [61]
                0.03    0.17   32768/32795       brw_kiovec [7]
[8]      0.9    0.03    0.17   32795         submit_bh [8]
                0.01    0.16   32795/32795       generic_make_request [9]
-----------------------------------------------
                0.01    0.16   32795/32795       submit_bh [8]
[9]      0.8    0.01    0.16   32795         generic_make_request [9]
                0.15    0.00   32795/32795       _make_request [10]
                0.01    0.00   32795/32795       blk_get_queue [40]
-----------------------------------------------
                0.15    0.00   32795/32795       generic_make_request [9]
[10]     0.7    0.15    0.00   32795         _make_request [10]
                0.00    0.00   32661/32661       elevator_linus_merge [130]
                0.00    0.00   32640/32640       scsi_back_merge_fn_c [131]
                0.00    0.00   32514/32514       elevator_linus_merge_cleanup [132]
                0.00    0.00     134/134         generic_plug_device [203]
                0.00    0.00       3/3           attempt_merge [345]
                0.00    0.00       2/2           scsi_front_merge_fn_c [426]
-----------------------------------------------

A couple of observations from the ACG :

(1) The (128) raw reads of 128KB result in 32768 calls to submit_bh which
    is acquiring the io_request_lock at least (2) times for each call.

(2) The 128K raw read appears to be broken down into (2) I/O of 64KB each.
    I think the 1st one completes and then the 2nd one is initiated.

Seems that there may be some room for improvement if we were able to reduce
the calls to submit_bh by sending down 1 or 2 buffer heads instead of 256
for each 128KB read.

Also, I did a quick test on SMP running (8) instances of the rawread
program.  The observation was that possibly some of the I/O is driven
down to the SCSI layer before all of the buffer heads for one of the
128KB reads have made it down to the block device layer.  This may
result in more IO which could impact performance.

We will be studying this code path more.  Comments ?

Bill Hartner
bhartner@us.ibm.com
IBM Linux Technology Center