|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-03 14:14:20
|
Hello, I am in the process of building a SCSI target system using SCST. I have it all working fine, but I encounter some strange performance issues. The setup is as follows: - 1 Target machine (HP DL360 G6, Intel Xeon Quad Core E5520 machine, 4 Gb RAM, 4 SAS 72G 10KRPM disks on a HP P410i CCISS Raid controller with 512Kb BBWC) - 1 Initiator (for now), using open-iscsi (version 2.0.870~rc3-0.4). Hardware is equal to that of the target. Both machines run debian lenny. The SCST version I used is 1.0.2 (from subversion) Iscsi-scst version 1.0.2/0.4.17r212. The 4 disks on the target are configured in RAID5 mode, 1 volume 64k chunk size. This volume is then split into several volumes using LVM. Some of these volumes are then 'exported' using SCST's vdisk handler like so: DEVICE db-master,/dev/vg/db-master,,4096 As you can see I use 4K blocksize, this seems to have the best performance for me. When I create an ext3 filesystem on this disk on the initiator and run some bonnie++ tests, I get the following: Version 1.03d      ------Sequential Output------ --Sequential Input- --Random-                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine       Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP initiator     8G          119286 17 29402  3          65576  3 536.1  1                    ------Sequential Create------ --------Random Create--------                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ The part that worries me is the 'sequential input' bit: 65Mb/second. When I read the 'raw' LVM volume on the target using 'dd' I get read speads of 130Mb/second and above. Testing the ethernet connection using iperf gives me an almost perfect 945Mbit/s full duplex. Dd'ing the raw disk on the initiator side give me about 80Mb/s. My question is: why is the initiator so much slower on reading than it is when writing? I have tried using 'NULLIO' to measure performance and the funny thing is, it's perfect: 112 Mb/second, just about what a gigabit ethernet link can deliver. Also, when I a read a large file over iScsi that is already in the target's cache, I see similar speeds. But as soon as the disk is involved, things slow down considerably. I have tuned all kinds of parameters, blocksize, Max. Send/Reive data segment size, readahead settings, but nothing seems to help. I hope you can tell me why that is, or what I'm doing wrong. Thanks in advance. Kind Regards, Ronald. |
|
From: Stoyan Marinov <stoyan@ma...> - 2009-06-03 15:34:51
|
Try using different scheduler for the target disk. Using deadline should produce better benchmark results. However I personally think cfq is better with more initiators and under load. On Jun 3, 2009, at 5:14 PM, Beheer InterCommIT wrote: > Hello, > > I am in the process of building a SCSI target system using SCST. I > have it all working fine, but I encounter some strange performance > issues. The setup is as follows: > - 1 Target machine (HP DL360 G6, Intel Xeon Quad Core E5520 machine, 4 > Gb RAM, 4 SAS 72G 10KRPM disks on a HP P410i CCISS Raid controller > with 512Kb BBWC) > - 1 Initiator (for now), using open-iscsi (version 2.0.870~rc3-0.4). > Hardware is equal to that of the target. > > Both machines run debian lenny. The SCST version I used is 1.0.2 (from > subversion) Iscsi-scst version 1.0.2/0.4.17r212. > The 4 disks on the target are configured in RAID5 mode, 1 volume 64k > chunk size. This volume is then split into several volumes using LVM. > Some of these volumes are then 'exported' using SCST's vdisk handler > like so: > > DEVICE db-master,/dev/vg/db-master,,4096 > > As you can see I use 4K blocksize, this seems to have the best > performance for me. When I create an ext3 filesystem on this disk on > the initiator and run some bonnie++ tests, I get the following: > > Version 1.03d ------Sequential Output------ --Sequential > Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- -- > Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec > %CP /sec %CP > initiator 8G 119286 17 29402 3 65576 3 > 536.1 1 > ------Sequential Create------ --------Random > Create-------- > -Create-- --Read--- -Delete-- -Create-- -- > Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP /sec %CP > 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ ++ > + +++++ +++ > > The part that worries me is the 'sequential input' bit: 65Mb/second. > When I read the 'raw' LVM volume on the target using 'dd' I get read > speads of 130Mb/second and above. Testing the ethernet connection > using iperf gives me an almost perfect 945Mbit/s full duplex. Dd'ing > the raw disk on the initiator side give me about 80Mb/s. My question > is: why is the initiator so much slower on reading than it is when > writing? > > I have tried using 'NULLIO' to measure performance and the funny thing > is, it's perfect: 112 Mb/second, just about what a gigabit ethernet > link can deliver. Also, when I a read a large file over iScsi that is > already in the target's cache, I see similar speeds. But as soon as > the disk is involved, things slow down considerably. I have tuned all > kinds of parameters, blocksize, Max. Send/Reive data segment size, > readahead settings, but nothing seems to help. I hope you can tell me > why that is, or what I'm doing wrong. > > Thanks in advance. > > Kind Regards, > Ronald. > > ------------------------------------------------------------------------------ > OpenSolaris 2009.06 is a cutting edge operating system for enterprises > looking to deploy the next generation of Solaris that includes the > latest > innovations from Sun and the OpenSource community. Download a copy and > enjoy capabilities such as Networking, Storage and Virtualization. > Go to: http://p.sf.net/sfu/opensolaris-get > _______________________________________________ > Scst-devel mailing list > Scst-devel@... > https://lists.sourceforge.net/lists/listinfo/scst-devel |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-04 08:36:46
|
2009/6/3 Stoyan Marinov <stoyan@...>:
> Try using different scheduler for the target disk. Using deadline should
> produce better benchmark results. However I personally think cfq is better
> with more initiators and under load.
Hi,
I just tried this, but the performance is actually slightly worse with
the deadline scheduler:
Version 1.03d ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
acc-db01 8G 118885 18 28741 2 58850 2 489.2 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
But thanks for the suggesion though. I'll keep using CFQ, there will
be more than one initiator using this target.
Thanks,
Ronald.
|
|
From: Bart Van Assche <bart.vanassche@gm...> - 2009-06-03 17:01:02
|
Met vriendelijke groeten, Bart Van Assche. On Wed, Jun 3, 2009 at 4:14 PM, Beheer InterCommIT <intercommit@...> wrote: > > Hello, > > I am in the process of building a SCSI target system using SCST. I > have it all working fine, but I encounter some strange performance > issues. The setup is as follows: > - 1 Target machine (HP DL360 G6, Intel Xeon Quad Core E5520 machine, 4 > Gb RAM, 4 SAS 72G 10KRPM disks on a HP P410i CCISS Raid controller > with 512Kb BBWC) > - 1 Initiator (for now), using open-iscsi (version 2.0.870~rc3-0.4). > Hardware is equal to that of the target. > > Both machines run debian lenny. The SCST version I used is 1.0.2 (from > subversion) Iscsi-scst version 1.0.2/0.4.17r212. > The 4 disks on the target are configured in RAID5 mode, 1 volume 64k > chunk size. This volume is then split into several volumes using LVM. > Some of these volumes are then 'exported' using SCST's vdisk handler > like so: > > DEVICE db-master,/dev/vg/db-master,,4096 > > As you can see I use 4K blocksize, this seems to have the best > performance for me. When I create an ext3 filesystem on this disk on > the initiator and run some bonnie++ tests, I get the following: > > Version 1.03d      ------Sequential Output------ --Sequential Input- --Random- >                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine       Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > initiator     8G          119286 17 29402  3          65576  3 536.1  1 >                    ------Sequential Create------ --------Random Create-------- >                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- >              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP >                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ > > The part that worries me is the 'sequential input' bit: 65Mb/second. > When I read the 'raw' LVM volume on the target using 'dd' I get read > speads of 130Mb/second and above. Testing the ethernet connection > using iperf gives me an almost perfect 945Mbit/s full duplex. Dd'ing > the raw disk on the initiator side give me about 80Mb/s. My question > is: why is the initiator so much slower on reading than it is when > writing? > > I have tried using 'NULLIO' to measure performance and the funny thing > is, it's perfect: 112 Mb/second, just about what a gigabit ethernet > link can deliver. Also, when I a read a large file over iScsi that is > already in the target's cache, I see similar speeds. But as soon as > the disk is involved, things slow down considerably. I have tuned all > kinds of parameters, blocksize, Max. Send/Reive data segment size, > readahead settings, but nothing seems to help. I hope you can tell me > why that is, or what I'm doing wrong. It would help if you would use the usual abbreviations for "megabytes per second" and "megabits per second" (MB/s and Mb/s respectively). It would also help if you could explain which (dd) performance numbers apply to buffered reads and which performance numbers apply to non-buffered reads. Please have a look at the script scripts/blockdev-perftest in the SCST repository. This script allows to perform reproducible measurements on storage media -- either local disks or iSCSI initiators. This script passes certain flags to dd to make sure that the results are reproducible (oflag=sync for buffered I/O, caches are flushed for direct I/O). The large difference between read and write throughput as reported by Bonnie++ is probably because of write buffering. Please have a look at the file scst/README. In this document you find several performance recommendations. And as you probably know there are faster filesystems available than ext3, e.g. XFS and ext4. Bart. |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-04 08:57:39
|
2009/6/3 Bart Van Assche <bart.vanassche@...>:
> Met vriendelijke groeten,
>
> Bart Van Assche.
>
>
> On Wed, Jun 3, 2009 at 4:14 PM, Beheer InterCommIT
> <intercommit@...> wrote:
>>
>> Hello,
[ ... measurements ... ]
>
> It would help if you would use the usual abbreviations for "megabytes
> per second" and "megabits per second" (MB/s and Mb/s respectively). It
> would also help if you could explain which (dd) performance numbers
> apply to buffered reads and which performance numbers apply to
> non-buffered reads.
Good point. I reread my own e-mail and I indeed made a mess of Mb MB
etc. It is as follows:
When I do this:
dd if=/dev/sda of=/dev/null bs=512K count=2000
I get this:
72.5 MB/s
Doing a dd on a file on an ext3 filesystem gives roughly the same
througput, 70MB/s
Iperf results are in Mb/s (Megabits per second), and they are
consistent at 940Mb/s
> Please have a look at the script scripts/blockdev-perftest in the SCST
> repository. This script allows to perform reproducible measurements on
> storage media -- either local disks or iSCSI initiators. This script
> passes certain flags to dd to make sure that the results are
> reproducible (oflag=sync for buffered I/O, caches are flushed for
> direct I/O).
Thanks, I used the script and see this:
On the target machine:
# ./blockdev-perftest -r -d /dev/cciss/c0d0
blocksize W W W R R R
67108864 -1 -1 -1 4.43054 4.53515 4.09654
33554432 -1 -1 -1 5.02689 4.92622 5.91466
16777216 -1 -1 -1 5.92167 5.94618 5.77884
8388608 -1 -1 -1 7.92642 7.88264 8.13229
4194304 -1 -1 -1 8.05637 8.25026 9.40499
2097152 -1 -1 -1 7.49901 7.23422 8.01207
1048576 -1 -1 -1 7.09388 7.31321 8.30574
524288 -1 -1 -1 12.0139 11.1748 10.1806
262144 -1 -1 -1 5.27295 5.59759 5.59222
131072 -1 -1 -1 6.30552 6.30552 6.97137
65536 -1 -1 -1 8.78964 8.78272 8.75832
32768 -1 -1 -1 8.91387 8.92108 8.92622
16384 -1 -1 -1 10.4863 9.81457 9.69688
8192 -1 -1 -1 10.3729 10.4808 9.98869
4096 -1 -1 -1 13.4932 12.7701 12.7941
2048 -1 -1 -1 21.4519 22.1776 21.2721
1024 -1 -1 -1 40.0147 39.5766 39.0789
512 -1 -1 -1 75.7187 77.2054 75.5061
Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s
And on the initiator:
# ./blockdev-perftest -r -d /dev/sda
blocksize W W W R R R
67108864 -1 -1 -1 14.7363 9.26515 9.12347
33554432 -1 -1 -1 9.27393 9.10125 9.12389
16777216 -1 -1 -1 9.14465 9.11383 9.12217
8388608 -1 -1 -1 9.31766 9.15109 9.24628
4194304 -1 -1 -1 9.56073 9.41906 9.53268
2097152 -1 -1 -1 10.1216 9.99816 9.79712
1048576 -1 -1 -1 10.0281 10.0149 10.0884
524288 -1 -1 -1 10.2743 10.3298 10.5449
262144 -1 -1 -1 12.1592 12.169 12.2282
131072 -1 -1 -1 14.474 13.831 14.1786
65536 -1 -1 -1 15.3866 14.6369 16.32
32768 -1 -1 -1 19.2831 19.2891 19.287
16384 -1 -1 -1 29.5137 29.2757 29.5453
8192 -1 -1 -1 38.1043 37.3658 37.5391
4096 -1 -1 -1 67.4633 68.9313 67.3177
2048 -1 -1 -1 0.000148064 0.000115727 0.000112724
1024 -1 -1 -1 0.000114121 0.000180819 0.000183753
512 -1 -1 -1 0.000238927 0.000113702 0.000116425
The results on the initiator are pretty much useless, because after
the first run the data is in the target's cache. So after that, the
network becomes the bottleneck and I see a nice 112MB/s. The first run
is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s
> The large difference between read and write throughput as reported by
> Bonnie++ is probably because of write buffering.
>
> Please have a look at the file scst/README. In this document you find
> several performance recommendations.
I have re-read the README but found nothing I missed. I should have
mentioned I'm running a vanilla 2.6.29.4 kernel with all SCST patches
applied and options enabled. I have tried all the recommendations in
the README wrt. readahead settings (default, 512KB, 4MB),
max_sectors_kb, kernel config options, etc. but the 70MB/s is still
the limit.
> And as you probably know there are faster filesystems available than
> ext3, e.g. XFS and ext4.
Indeed. However, in my current tests this does not seem to make a
difference. Reading the raw device gives the same througput as reading
a file from a filesystem.
> Bart.
>
Thanks very much for your suggestions, they were very useful (although
the problem is not yet fixed).
Kind Regards,
Ronald.
|
|
From: Bart Van Assche <bart.vanassche@gm...> - 2009-06-04 09:24:06
|
On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT <intercommit@...> wrote: > > [ ... ] > > Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s > > And on the initiator: > [ ... ] > > The results on the initiator are pretty much useless, because after > the first run the data is in the target's cache. So after that, the > network becomes the bottleneck and I see a nice 112MB/s. The first run > is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s Did I understand correctly that you want to optimize performance for cold cache linear reads ? Are you aware that when reading from a cold cache through a single initiator the theoretical maximum throughput is (1 / (1/231 + 1/112)) = 75 MB/s ? Bart. |
|
From: Pasi Kärkkäinen <pasik@ik...> - 2009-06-15 14:17:18
|
On Thu, Jun 04, 2009 at 11:24:04AM +0200, Bart Van Assche wrote: > On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT > <intercommit@...> wrote: > > > > [ ... ] > > > > Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s > > > > And on the initiator: > > [ ... ] > > > > The results on the initiator are pretty much useless, because after > > the first run the data is in the target's cache. So after that, the > > network becomes the bottleneck and I see a nice 112MB/s. The first run > > is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s > > Did I understand correctly that you want to optimize performance for > cold cache linear reads ? Are you aware that when reading from a cold > cache through a single initiator the theoretical maximum throughput is > (1 / (1/231 + 1/112)) = 75 MB/s ? > Can you please explain this calculation a bit more.. 231 is the throughput of the disk, and 112 is the throughput of the network, but I didn't quite get where the formula comes from.. Thanks. -- Pasi |
|
From: Richard Sharpe <realrichardsharpe@gm...> - 2009-06-15 14:59:39
|
On Mon, Jun 15, 2009 at 6:50 AM, Pasi Kärkkäinen<pasik@...> wrote:
> On Thu, Jun 04, 2009 at 11:24:04AM +0200, Bart Van Assche wrote:
>> On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT
>> <intercommit@...> wrote:
>> >
>> > [ ... ]
>> >
>> > Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s
>> >
>> > And on the initiator:
>> > [ ... ]
>> >
>> > The results on the initiator are pretty much useless, because after
>> > the first run the data is in the target's cache. So after that, the
>> > network becomes the bottleneck and I see a nice 112MB/s. The first run
>> > is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s
>>
>> Did I understand correctly that you want to optimize performance for
>> cold cache linear reads ? Are you aware that when reading from a cold
>> cache through a single initiator the theoretical maximum throughput is
>> (1 / (1/231 + 1/112)) = 75 MB/s ?
>>
>
> Can you please explain this calculation a bit more..
> 231 is the throughput of the disk, and 112 is the throughput of the network,
> but I didn't quite get where the formula comes from..
I think what he is saying is:
1. First you have to get the data into the targets cache, which you
can do at 231MB/s
2. Then you have to transfer it over the wire to the initiator, which
you can do at 112MB/s.
So, time taken to read 1MB from a cold cache is:
time taken to read 1MB from disk into cache + time taken to read
1MB from cache to initiator
which is:
(1/231 + 1/112)
The max throughput you can achieve is just the inverse of that.
--
Regards,
Richard Sharpe
|
|
From: Bart Van Assche <bart.vanassche@gm...> - 2009-06-15 15:04:36
|
On Mon, Jun 15, 2009 at 4:59 PM, Richard Sharpe<realrichardsharpe@...> wrote: > On Mon, Jun 15, 2009 at 6:50 AM, Pasi Kärkkäinen<pasik@...> wrote: >> On Thu, Jun 04, 2009 at 11:24:04AM +0200, Bart Van Assche wrote: >>> On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT >>> <intercommit@...> wrote: >>> > >>> > [ ... ] >>> > >>> > Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s >>> > >>> > And on the initiator: >>> > [ ... ] >>> > >>> > The results on the initiator are pretty much useless, because after >>> > the first run the data is in the target's cache. So after that, the >>> > network becomes the bottleneck and I see a nice 112MB/s. The first run >>> > is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s >>> >>> Did I understand correctly that you want to optimize performance for >>> cold cache linear reads ? Are you aware that when reading from a cold >>> cache through a single initiator the theoretical maximum throughput is >>> (1 / (1/231 + 1/112)) = 75 MB/s ? >>> >> >> Can you please explain this calculation a bit more.. >> 231 is the throughput of the disk, and 112 is the throughput of the network, >> but I didn't quite get where the formula comes from.. > > I think what he is saying is: > > 1. First you have to get the data into the targets cache, which you > can do at 231MB/s > > 2. Then you have to transfer it over the wire to the initiator, which > you can do at 112MB/s. > > So, time taken to read 1MB from a cold cache is: > >  time taken to read 1MB from disk into cache + time taken to read > 1MB from cache to initiator > > which is: > >   (1/231 + 1/112) > > The max throughput you can achieve is just the inverse of that. Exactly :-) All I can add to the above is that the above calculation is based on the following assumptions: - Readahead on the target has been disabled or does not work. - During the transfer the data is copied once in the target by SCST. The above calculation is reasonably accurate if the disk and network transfer speeds are significantly smaller than the memory-to-memory transfer rate (about 3000 MB/s on a modern system). This assumption is valid for 1 GbE networks, but not for 10 GbE or IB networks. Bart. |
|
From: Chris Stelter <robotbeat@gm...> - 2009-06-15 23:31:19
|
Is it best, then, to use Block I/O rather than File I/O in those situations? Or, is pass-through an even better option? Say I have a ram-disk-based SAS RAID array. What's the fastest way to move this data with iSCSI? With SRP? With Fibre Channel? -Chris S. On Mon, Jun 15, 2009 at 10:04 AM, Bart Van Assche<bart.vanassche@...> wrote: > On Mon, Jun 15, 2009 at 4:59 PM, Richard > Sharpe<realrichardsharpe@...> wrote: >> On Mon, Jun 15, 2009 at 6:50 AM, Pasi Kärkkäinen<pasik@...> wrote: >>> On Thu, Jun 04, 2009 at 11:24:04AM +0200, Bart Van Assche wrote: >>>> On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT >>>> <intercommit@...> wrote: >>>> > >>>> > [ ... ] >>>> > >>>> > Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s >>>> > >>>> > And on the initiator: >>>> > [ ... ] >>>> > >>>> > The results on the initiator are pretty much useless, because after >>>> > the first run the data is in the target's cache. So after that, the >>>> > network becomes the bottleneck and I see a nice 112MB/s. The first run >>>> > is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s >>>> >>>> Did I understand correctly that you want to optimize performance for >>>> cold cache linear reads ? Are you aware that when reading from a cold >>>> cache through a single initiator the theoretical maximum throughput is >>>> (1 / (1/231 + 1/112)) = 75 MB/s ? >>>> >>> >>> Can you please explain this calculation a bit more.. >>> 231 is the throughput of the disk, and 112 is the throughput of the network, >>> but I didn't quite get where the formula comes from.. >> >> I think what he is saying is: >> >> 1. First you have to get the data into the targets cache, which you >> can do at 231MB/s >> >> 2. Then you have to transfer it over the wire to the initiator, which >> you can do at 112MB/s. >> >> So, time taken to read 1MB from a cold cache is: >> >>  time taken to read 1MB from disk into cache + time taken to read >> 1MB from cache to initiator >> >> which is: >> >>   (1/231 + 1/112) >> >> The max throughput you can achieve is just the inverse of that. > > Exactly :-) > > All I can add to the above is that the above calculation is based on > the following assumptions: > - Readahead on the target has been disabled or does not work. > - During the transfer the data is copied once in the target by SCST. > The above calculation is reasonably accurate if the disk and network > transfer speeds are significantly smaller than the memory-to-memory > transfer rate (about 3000 MB/s on a modern system). This assumption is > valid for 1 GbE networks, but not for 10 GbE or IB networks. > > Bart. > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables unlimited > royalty-free distribution of the report engine for externally facing > server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Scst-devel mailing list > Scst-devel@... > https://lists.sourceforge.net/lists/listinfo/scst-devel > |
|
From: Sam Haxor <generationgnu@ya...> - 2009-06-22 16:51:56
|
----- Original Message ----
> From: Chris Stelter <robotbeat@...>
> To: Bart Van Assche <bart.vanassche@...>
> Cc: Vladislav Bolkhovitin <vst@...>; scst-devel@...; Beheer InterCommIT <intercommit@...>
> Sent: Monday, June 15, 2009 7:30:47 PM
> Subject: Re: [Scst-devel] Performance Question
>
> Is it best, then, to use Block I/O rather than File I/O in those
> situations? Or, is pass-through an even better option? Say I have a
> ram-disk-based SAS RAID array. What's the fastest way to move this
> data with iSCSI? With SRP? With Fibre Channel?
>
> -Chris S.
Depends what the goal is -
Say you are only interested in measuring throughput or line-performance-testing.Here data-verification is not important. In that case you can use scst's vdisk-null-io handler or disk-handler.
Chetan
|
|
From: Sam Haxor <generationgnu@ya...> - 2009-06-22 19:01:12
|
In scst_lib.c/scst_copy_sg()
We shouldn't copy the data if the dev->handler == NULL_IO or DISK_HANDLER right ?
Chetan
|
|
From: Vladislav Bolkhovitin <vst@vl...> - 2009-06-03 18:52:28
|
Beheer InterCommIT, on 06/03/2009 06:14 PM wrote: > Hello, > > I am in the process of building a SCSI target system using SCST. I > have it all working fine, but I encounter some strange performance > issues. The setup is as follows: > - 1 Target machine (HP DL360 G6, Intel Xeon Quad Core E5520 machine, 4 > Gb RAM, 4 SAS 72G 10KRPM disks on a HP P410i CCISS Raid controller > with 512Kb BBWC) > - 1 Initiator (for now), using open-iscsi (version 2.0.870~rc3-0.4). > Hardware is equal to that of the target. > > Both machines run debian lenny. The SCST version I used is 1.0.2 (from > subversion) Iscsi-scst version 1.0.2/0.4.17r212. > The 4 disks on the target are configured in RAID5 mode, 1 volume 64k > chunk size. This volume is then split into several volumes using LVM. > Some of these volumes are then 'exported' using SCST's vdisk handler > like so: > > DEVICE db-master,/dev/vg/db-master,,4096 > > As you can see I use 4K blocksize, this seems to have the best > performance for me. When I create an ext3 filesystem on this disk on > the initiator and run some bonnie++ tests, I get the following: > > Version 1.03d ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > initiator 8G 119286 17 29402 3 65576 3 536.1 1 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ > > The part that worries me is the 'sequential input' bit: 65Mb/second. > When I read the 'raw' LVM volume on the target using 'dd' I get read > speads of 130Mb/second and above. Testing the ethernet connection > using iperf gives me an almost perfect 945Mbit/s full duplex. Dd'ing > the raw disk on the initiator side give me about 80Mb/s. My question > is: why is the initiator so much slower on reading than it is when > writing? > > I have tried using 'NULLIO' to measure performance and the funny thing > is, it's perfect: 112 Mb/second, just about what a gigabit ethernet > link can deliver. Also, when I a read a large file over iScsi that is > already in the target's cache, I see similar speeds. But as soon as > the disk is involved, things slow down considerably. I have tuned all > kinds of parameters, blocksize, Max. Send/Reive data segment size, > readahead settings, but nothing seems to help. I hope you can tell me > why that is, or what I'm doing wrong. > > Thanks in advance. Search this list archive. Brief summary: - Use 2.6.27+ kernels with all SCST patches applied and CFQ scheduler - For kernels below 2.6.27 use deadline scheduler and set 4M read-ahead *BEFORE* loading SCST modules. Also read SCST README. > Kind Regards, > Ronald. |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-04 09:03:27
|
2009/6/3 Vladislav Bolkhovitin <vst@...>: > > Search this list archive. > > Brief summary: > > Â - Use 2.6.27+ kernels with all SCST patches applied and CFQ scheduler > > Â - For kernels below 2.6.27 use deadline scheduler and set 4M read-ahead > *BEFORE* loading SCST modules. > > Also read SCST README. Hi Vladislav, Thanks for your reply. I forgot to mention that I'm already running a vanilla 2.6.29.4 kernel with SCST patches applied and kernel config options enabled. Also, all SCST modules are compiled without extra_checks, debug etc. Using different io schedulers does not seem to make much difference, so I'll stick to CFQ for now. All tests in my fist e-mail were done using CFQ on this patched kernel. Readahead was set to 512KB on both the target and the initiator. Also, please see my reply I just sent on the list to Bart. Kind Regards, Ronald. |
|
From: Vladislav Bolkhovitin <vst@vl...> - 2009-06-04 17:32:59
|
Beheer InterCommIT, on 06/04/2009 01:03 PM wrote: > 2009/6/3 Vladislav Bolkhovitin <vst@...>: >> Search this list archive. >> >> Brief summary: >> >> - Use 2.6.27+ kernels with all SCST patches applied and CFQ scheduler >> >> - For kernels below 2.6.27 use deadline scheduler and set 4M read-ahead >> *BEFORE* loading SCST modules. >> >> Also read SCST README. > > Hi Vladislav, > > Thanks for your reply. I forgot to mention that I'm already running a > vanilla 2.6.29.4 kernel with SCST patches applied and kernel config > options enabled. Also, all SCST modules are compiled without > extra_checks, debug etc. Using different io schedulers does not seem > to make much difference, so I'll stick to CFQ for now. All tests in my > fist e-mail were done using CFQ on this patched kernel. Readahead was > set to 512KB on both the target and the initiator. Also, please see my > reply I just sent on the list to Bart. I guess, you use FILEIO, correct? You should use it. You can also try to test with different numbers of FILEIO threads via num_threads module parameter. Also it's worth to try this patch: http://lkml.org/lkml/2009/5/21/319 > Kind Regards, > Ronald. |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-08 09:34:25
|
> > I guess, you use FILEIO, correct? You should use it. Yes indeed. > You can also try to test with different numbers of FILEIO threads via > num_threads module parameter. > > Also it's worth to try this patch: http://lkml.org/lkml/2009/5/21/319 I have tried it and: wow! Now I get around 90 MB/s with 'dd', using 512KB readahead on both the target and the initiator: dd if=/dev/sdc of=/dev/null bs=512K count=4000 4000+0 records in 4000+0 records out 2097152000 bytes (2.1 GB) copied, 23.3725 s, 89.7 MB/s Now that's more like it! Thanks! I hope this patch makes it into the mainline kernel. Kind Regards, Ronald. |
|
From: Vladislav Bolkhovitin <vst@vl...> - 2009-06-08 17:04:37
|
Beheer InterCommIT, on 06/08/2009 01:34 PM wrote: >> I guess, you use FILEIO, correct? You should use it. > > Yes indeed. > >> You can also try to test with different numbers of FILEIO threads via >> num_threads module parameter. >> >> Also it's worth to try this patch: http://lkml.org/lkml/2009/5/21/319 > > I have tried it and: wow! Now I get around 90 MB/s with 'dd', using > 512KB readahead on both the target and the initiator: > > dd if=/dev/sdc of=/dev/null bs=512K count=4000 > 4000+0 records in > 4000+0 records out > 2097152000 bytes (2.1 GB) copied, 23.3725 s, 89.7 MB/s Good! I committed this patch as readahead-2.6.X.patch for kernels .25-.29. Patches for kernels prior .29 were NOT tested. Backports on earlier kernels are welcome. Thanks, Vlad |
|
From: Klaus Hochlehnert <Mailings@kh...> - 2009-06-26 02:16:26
|
Hi Vlad,
here's the readahead patch for Ubuntu 8.04 - 2.6.24 kernel:
--- linux-2.6.24-24.53/mm/readahead.c 2008-02-11 06:51:11.000000000 +0100
+++ linux-2.6.24-24.53.copy/mm/readahead.c 2009-06-09 21:59:13.640647726 +0200
@@ -472,5 +472,8 @@ page_cache_async_readahead(struct addres
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+
+ if (PageUptodate(page))
+ blk_run_backing_dev(mapping->backing_dev_info, NULL);
}
EXPORT_SYMBOL_GPL(page_cache_async_readahead);
I'm running it on one target without problems.
Regards, Klaus
-----Original Message-----
From: Vladislav Bolkhovitin [mailto:vst@...]
Sent: Monday, June 08, 2009 7:02 PM
To: Beheer InterCommIT
Cc: scst-devel@...
Subject: Re: [Scst-devel] Performance Question
Beheer InterCommIT, on 06/08/2009 01:34 PM wrote:
>> I guess, you use FILEIO, correct? You should use it.
>
> Yes indeed.
>
>> You can also try to test with different numbers of FILEIO threads via
>> num_threads module parameter.
>>
>> Also it's worth to try this patch: http://lkml.org/lkml/2009/5/21/319
>
> I have tried it and: wow! Now I get around 90 MB/s with 'dd', using
> 512KB readahead on both the target and the initiator:
>
> dd if=/dev/sdc of=/dev/null bs=512K count=4000
> 4000+0 records in
> 4000+0 records out
> 2097152000 bytes (2.1 GB) copied, 23.3725 s, 89.7 MB/s
Good!
I committed this patch as readahead-2.6.X.patch for kernels .25-.29.
Patches for kernels prior .29 were NOT tested. Backports on earlier
kernels are welcome.
Thanks,
Vlad
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Scst-devel mailing list
Scst-devel@...
https://lists.sourceforge.net/lists/listinfo/scst-devel
|
|
From: Vladislav Bolkhovitin <vst@vl...> - 2009-06-27 08:20:22
|
Hi Klaus, Klaus Hochlehnert, on 06/26/2009 05:35 AM wrote: > Hi Vlad, > > here's the readahead patch for Ubuntu 8.04 - 2.6.24 kernel: > > --- linux-2.6.24-24.53/mm/readahead.c 2008-02-11 06:51:11.000000000 +0100 > +++ linux-2.6.24-24.53.copy/mm/readahead.c 2009-06-09 21:59:13.640647726 +0200 > @@ -472,5 +472,8 @@ page_cache_async_readahead(struct addres > > /* do read-ahead */ > ondemand_readahead(mapping, ra, filp, true, offset, req_size); > + > + if (PageUptodate(page)) > + blk_run_backing_dev(mapping->backing_dev_info, NULL); > } > EXPORT_SYMBOL_GPL(page_cache_async_readahead); > > > > I'm running it on one target without problems. Unfortunately, your patch is corrupted (tabs replaced by spaces) and I can't commit it as is. Can you resend it as an attachment? Thanks, Vlad |
|
From: Klaus Hochlehnert <Mailings@kh...> - 2009-06-27 18:47:09
Attachments:
0003-readahead-2.6.24.patch
|
Hi Vlad, here's the patch as attachment. I also tested this file on my Ubuntu sources and it worked there. If it doesn't work on a vanilla kernel, please let me know. Regards, Klaus -----Original Message----- From: Vladislav Bolkhovitin [mailto:vst@...] Sent: Saturday, June 27, 2009 10:20 AM To: Klaus Hochlehnert Cc: 'Beheer InterCommIT'; 'scst-devel@...' Subject: Re: [Scst-devel] Performance Question Hi Klaus, Klaus Hochlehnert, on 06/26/2009 05:35 AM wrote: > Hi Vlad, > > here's the readahead patch for Ubuntu 8.04 - 2.6.24 kernel: > > --- linux-2.6.24-24.53/mm/readahead.c 2008-02-11 06:51:11.000000000 +0100 > +++ linux-2.6.24-24.53.copy/mm/readahead.c 2009-06-09 21:59:13.640647726 +0200 > @@ -472,5 +472,8 @@ page_cache_async_readahead(struct addres > > /* do read-ahead */ > ondemand_readahead(mapping, ra, filp, true, offset, req_size); > + > + if (PageUptodate(page)) > + blk_run_backing_dev(mapping->backing_dev_info, NULL); > } > EXPORT_SYMBOL_GPL(page_cache_async_readahead); > > > > I'm running it on one target without problems. Unfortunately, your patch is corrupted (tabs replaced by spaces) and I can't commit it as is. Can you resend it as an attachment? Thanks, Vlad |
|
From: Vladislav Bolkhovitin <vst@vl...> - 2009-06-29 18:12:46
|
Klaus Hochlehnert, on 06/27/2009 10:34 PM wrote: > Hi Vlad, > > here's the patch as attachment. > I also tested this file on my Ubuntu sources and it worked there. > If it doesn't work on a vanilla kernel, please let me know. > > Regards, Klaus > > -----Original Message----- > From: Vladislav Bolkhovitin [mailto:vst@...] > Sent: Saturday, June 27, 2009 10:20 AM > To: Klaus Hochlehnert > Cc: 'Beheer InterCommIT'; 'scst-devel@...' > Subject: Re: [Scst-devel] Performance Question > > Hi Klaus, > > Klaus Hochlehnert, on 06/26/2009 05:35 AM wrote: >> Hi Vlad, >> >> here's the readahead patch for Ubuntu 8.04 - 2.6.24 kernel: >> >> --- linux-2.6.24-24.53/mm/readahead.c 2008-02-11 06:51:11.000000000 +0100 >> +++ linux-2.6.24-24.53.copy/mm/readahead.c 2009-06-09 21:59:13.640647726 +0200 >> @@ -472,5 +472,8 @@ page_cache_async_readahead(struct addres >> >> /* do read-ahead */ >> ondemand_readahead(mapping, ra, filp, true, offset, req_size); >> + >> + if (PageUptodate(page)) >> + blk_run_backing_dev(mapping->backing_dev_info, NULL); >> } >> EXPORT_SYMBOL_GPL(page_cache_async_readahead); >> >> >> >> I'm running it on one target without problems. Committed, thanks |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-04 09:35:31
|
2009/6/4 Bart Van Assche <bart.vanassche@...>: > On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT > <intercommit@...> wrote: >> >> [ ... ] >> >> Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s >> >> And on the initiator: >> [ ... ] >> >> The results on the initiator are pretty much useless, because after >> the first run the data is in the target's cache. So after that, the >> network becomes the bottleneck and I see a nice 112MB/s. The first run >> is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s > > Did I understand correctly that you want to optimize performance for > cold cache linear reads ? Are you aware that when reading from a cold > cache through a single initiator the theoretical maximum throughput is > (1 / (1/231 + 1/112)) = 75 MB/s ? Really? No I was not aware of that. Is that because of the overhead of using iscsi? Well, in that case there is no problem and everythings works just fine. The only way to scale up is to get faster network connections, I guess. Thanks, Ronald. |
|
From: Bart Van Assche <bart.vanassche@gm...> - 2009-06-04 09:49:17
|
On Thu, Jun 4, 2009 at 11:35 AM, Beheer InterCommIT <intercommit@...> wrote: > > 2009/6/4 Bart Van Assche <bart.vanassche@...>: > > On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT > > <intercommit@...> wrote: > >> > >> [ ... ] > >> > >> Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s > >> > >> And on the initiator: > >> [ ... ] > >> > >> The results on the initiator are pretty much useless, because after > >> the first run the data is in the target's cache. So after that, the > >> network becomes the bottleneck and I see a nice 112MB/s. The first run > >> is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s > > > > Did I understand correctly that you want to optimize performance for > > cold cache linear reads ? Are you aware that when reading from a cold > > cache through a single initiator the theoretical maximum throughput is > > (1 / (1/231 + 1/112)) = 75 MB/s ? > > Really? No I was not aware of that. Is that because of the overhead of > using iscsi? Well, in that case there is no problem and everythings > works just fine. The only way to scale up is to get faster network > connections, I guess. When a single initiator reads data from a cold cache, reading from disk and sending data over the iSCSI link happen after each other, so the bandwidth of the iSCSI link is only used partially. When a single initiator uses buffered I/O to write data, sending data over the iSCSI link and writing the data to disk happen simultaneously, so the full bandwidth of the iSCSI link can be used if the disks are fast enough. This behavior is not specific to iSCSI but also occurs with other storage protocols. If I remember correctly Vlad has invented a clever algorithm that improves read bandwidth for linear I/O. This algorithm has not yet been implemented though. Bart. |
|
From: Beheer InterCommIT <intercommit@gm...> - 2009-06-04 10:01:22
|
2009/6/4 Bart Van Assche <bart.vanassche@...>: > On Thu, Jun 4, 2009 at 11:35 AM, Beheer InterCommIT > <intercommit@...> wrote: >> >> 2009/6/4 Bart Van Assche <bart.vanassche@...>: >> > On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT >> > <intercommit@...> wrote: >> >> >> >> [ ... ] >> >> >> >> Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s >> >> >> >> And on the initiator: >> >> [ ... ] >> >> >> >> The results on the initiator are pretty much useless, because after >> >> the first run the data is in the target's cache. So after that, the >> >> network becomes the bottleneck and I see a nice 112MB/s. The first run >> >> is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s >> > >> > Did I understand correctly that you want to optimize performance for >> > cold cache linear reads ? Are you aware that when reading from a cold >> > cache through a single initiator the theoretical maximum throughput is >> > (1 / (1/231 + 1/112)) = 75 MB/s ? >> >> Really? No I was not aware of that. Is that because of the overhead of >> using iscsi? Well, in that case there is no problem and everythings >> works just fine. The only way to scale up is to get faster network >> connections, I guess. > > When a single initiator reads data from a cold cache, reading from > disk and sending data over the iSCSI link happen after each other, so > the bandwidth of the iSCSI link is only used partially. When a single > initiator uses buffered I/O to write data, sending data over the iSCSI > link and writing the data to disk happen simultaneously, so the full > bandwidth of the iSCSI link can be used if the disks are fast enough. > This behavior is not specific to iSCSI but also occurs with other > storage protocols. Ok, that explains a lot. So when more than one initiator is involved, speed will be good for both of them and the full network bandwidth will be utilized. Great! > If I remember correctly Vlad has invented a clever algorithm that > improves read bandwidth for linear I/O. This algorithm has not yet > been implemented though. Sounds interesting. So my "problem" really isn't a problem at all. Thank you very much for taking the time to answer my questions. Kind Regards, Ronald. |
|
From: Bart Van Assche <bart.vanassche@gm...> - 2009-06-08 09:37:44
|
Met vriendelijke groeten, Bart Van Assche. On Thu, Jun 4, 2009 at 11:49 AM, Bart Van Assche<bart.vanassche@...> wrote: > On Thu, Jun 4, 2009 at 11:35 AM, Beheer InterCommIT > <intercommit@...> wrote: >> >> 2009/6/4 Bart Van Assche <bart.vanassche@...>: >> > On Thu, Jun 4, 2009 at 10:57 AM, Beheer InterCommIT >> > <intercommit@...> wrote: >> >> >> >> [ ... ] >> >> >> >> Here you can see the disk can do 1GB in 4.43 seconds = 231 MB/s >> >> >> >> And on the initiator: >> >> [ ... ] >> >> >> >> The results on the initiator are pretty much useless, because after >> >> the first run the data is in the target's cache. So after that, the >> >> network becomes the bottleneck and I see a nice 112MB/s. The first run >> >> is valid though, so 1 GB in 14.7 seconds = 69.6 MB/s >> > >> > Did I understand correctly that you want to optimize performance for >> > cold cache linear reads ? Are you aware that when reading from a cold >> > cache through a single initiator the theoretical maximum throughput is >> > (1 / (1/231 + 1/112)) = 75 MB/s ? >> >> Really? No I was not aware of that. Is that because of the overhead of >> using iscsi? Well, in that case there is no problem and everythings >> works just fine. The only way to scale up is to get faster network >> connections, I guess. > > When a single initiator reads data from a cold cache, reading from > disk and sending data over the iSCSI link happen after each other, so > the bandwidth of the iSCSI link is only used partially. When a single > initiator uses buffered I/O to write data, sending data over the iSCSI > link and writing the data to disk happen simultaneously, so the full > bandwidth of the iSCSI link can be used if the disks are fast enough. > This behavior is not specific to iSCSI but also occurs with other > storage protocols. An update: the above is only correct when readahead has been disabled. Bart. |