From: Griffiths, R. A <ric...@in...> - 2002-06-20 15:26:35
|
We ran without highmem enabled so the Kernel only saw 1GB of memory. Richard -----Original Message----- From: Jens Axboe [mailto:ax...@su...] Sent: Wednesday, June 19, 2002 11:05 PM To: Andrew Morton Cc: mg...@un...; Linux Kernel Mailing List; lse...@li...; ric...@in... Subject: Re: ext3 performance bottleneck as the number of spindles gets large On Wed, Jun 19 2002, Andrew Morton wrote: > mgross wrote: > > > > We've been doing some throughput comparisons and benchmarks of block I/O > > throughput for 8KB writes as the number of SCSI addapters and drives per > > adapter is increased. > > > > The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box. > > Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as > > well as one of those patched up O(1) 2.4.18 kernels out there. > > umm. Are you not using block-highmem? That is a must-have. > > http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9 aa2/00_block-highmem-all-18b-12.gz please use http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.4/2.4.19-pre1 0/block-highmem-all-19.bz2 -- Jens Axboe |
From: Griffiths, R. A <ric...@in...> - 2002-06-20 20:45:37
|
No. The platform group is set on a journaling file system. They had already run a comparison of the ones available and based on their criteria, ext3 was the best choice. Based on the lockmeter data, it does look as though the scaling is trapped behind the BKL. Richard -----Original Message----- From: Andrew Morton [mailto:ak...@zi...] Sent: Thursday, June 20, 2002 1:19 PM To: Griffiths, Richard A Cc: 'Jens Axboe'; mg...@un...; Linux Kernel Mailing List; lse...@li... Subject: Re: ext3 performance bottleneck as the number of spindles gets large "Griffiths, Richard A" wrote: > > We ran without highmem enabled so the Kernel only saw 1GB of memory. > Yup. I take it back - high ext3 lock contention happens on 2.5 as well, which has block-highmem. With heavy write traffic onto six disks, two controllers, six filesystems, four CPUs the machine spends about 40% of the time spinning on locks in fs/ext3/inode.c You're un dual CPU, so the contention is less. Not very nice. But given that the longest spin time was some tens of milliseconds, with the average much lower, it shouldn't affect overall I/O throughput. Possibly something else is happening. Have you tested ext2? |
From: Griffiths, R. A <ric...@in...> - 2002-06-20 21:50:47
|
I should have mentioned the throughput we saw on 4 adapters 6 drives was 126KB/s. The max theoretical bus bandwith is 640MB/s. -----Original Message----- From: Andrew Morton [mailto:ak...@zi...] Sent: Thursday, June 20, 2002 2:26 PM To: mg...@un... Cc: Griffiths, Richard A; 'Jens Axboe'; Linux Kernel Mailing List; lse...@li... Subject: Re: ext3 performance bottleneck as the number of spindles gets large mgross wrote: > > On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote: > > Yup. I take it back - high ext3 lock contention happens on 2.5 > > as well, which has block-highmem. With heavy write traffic onto > > six disks, two controllers, six filesystems, four CPUs the machine > > spends about 40% of the time spinning on locks in fs/ext3/inode.c > > You're un dual CPU, so the contention is less. > > > > Not very nice. But given that the longest spin time was some > > tens of milliseconds, with the average much lower, it shouldn't > > affect overall I/O throughput. > > How could losing 40% of your CPU's to spin locks NOT spank your throughtput? The limiting factor is usually disk bandwidth, seek latency, rotational latency. That's why I want to know your bandwidth. > Can you copy your lockmeter data from its kernel_flag section? Id like to > see it. I don't find lockmeter very useful. Here's oprofile output for 2.5.23: c013ec08 873 1.07487 rmqueue c018a8e4 950 1.16968 do_get_write_access c013b00c 969 1.19307 kmem_cache_alloc_batch c018165c 1120 1.37899 ext3_writepage c0193120 1457 1.79392 journal_add_journal_head c0180e30 1458 1.79515 ext3_prepare_write c0136948 6546 8.05969 generic_file_write c01838ac 42608 52.4606 .text.lock.inode So I lost two CPUs on the BKL in fs/ext3/inode.c. The remaining two should be enough to saturate all but the most heroic disk subsystems. A couple of possibilities come to mind: 1: Processes which should be submitting I/O against disk "A" are instead spending tons of time asleep in the page allocator waiting for I/O to complete against disk "B". 2: ext3 is just too slow for the rate of data which you're trying to push at it. This exhibits as lock contention, but the root cause is the cost of things like ext3_mark_inode_dirty(). And *that* is something we can fix - can shave 75% off the cost of that. Need more data... > > > > Possibly something else is happening. Have you tested ext2? > > No. We're attempting to see if we can scale to large numbers of spindles > with EXT3 at the moment. Perhaps we can effect positive changes to ext3 > before giving up on it and moving to another Journaled FS. Have you tried *any* other fs? - |
From: Andrew M. <ak...@zi...> - 2002-06-21 07:54:09
|
"Griffiths, Richard A" wrote: > > I should have mentioned the throughput we saw on 4 adapters 6 drives was > 126KB/s. The max theoretical bus bandwith is 640MB/s. I hope that was 128MB/s? Please try the below patch (againt 2.4.19-pre10). It halves the lock contention, and it does that by making the fs twice as efficient, so that's a bonus. I wouldn't be surprised if it made no difference. I'm not seeing much difference between ext2 and ext3 here. If you have time, please test ext2 and/or reiserfs and/or ext3 in writeback mode. And please tell us some more details regarding the performance bottleneck. I assume that you mean that the IO rate per disk slows as more disks are added to an adapter? Or does the total throughput through the adapter fall as more disks are added? Thanks. --- 2.4.19-pre10/fs/ext3/inode.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002 +++ 2.4.19-pre10-akpm/fs/ext3/inode.c Fri Jun 21 00:28:59 2002 @@ -1016,21 +1016,20 @@ static int ext3_prepare_write(struct fil int ret, needed_blocks = ext3_writepage_trans_blocks(inode); handle_t *handle; - lock_kernel(); handle = ext3_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; } - unlock_kernel(); ret = block_prepare_write(page, from, to, ext3_get_block); - lock_kernel(); if (ret != 0) goto prepare_write_failed; if (ext3_should_journal_data(inode)) { + lock_kernel(); ret = walk_page_buffers(handle, page->buffers, from, to, NULL, do_journal_get_write_access); + unlock_kernel(); if (ret) { /* * We're going to fail this prepare_write(), @@ -1043,10 +1042,12 @@ static int ext3_prepare_write(struct fil } } prepare_write_failed: - if (ret) + if (ret) { + lock_kernel(); ext3_journal_stop(handle, inode); + unlock_kernel(); + } out: - unlock_kernel(); return ret; } @@ -1094,7 +1095,6 @@ static int ext3_commit_write(struct file struct inode *inode = page->mapping->host; int ret = 0, ret2; - lock_kernel(); if (ext3_should_journal_data(inode)) { /* * Here we duplicate the generic_commit_write() functionality @@ -1102,22 +1102,43 @@ static int ext3_commit_write(struct file int partial = 0; loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + lock_kernel(); ret = walk_page_buffers(handle, page->buffers, from, to, &partial, commit_write_fn); + unlock_kernel(); if (!partial) SetPageUptodate(page); kunmap(page); if (pos > inode->i_size) inode->i_size = pos; EXT3_I(inode)->i_state |= EXT3_STATE_JDATA; + if (inode->i_size > inode->u.ext3_i.i_disksize) { + inode->u.ext3_i.i_disksize = inode->i_size; + lock_kernel(); + ret2 = ext3_mark_inode_dirty(handle, inode); + unlock_kernel(); + if (!ret) + ret = ret2; + } } else { if (ext3_should_order_data(inode)) { + lock_kernel(); ret = walk_page_buffers(handle, page->buffers, from, to, NULL, journal_dirty_sync_data); + unlock_kernel(); } /* Be careful here if generic_commit_write becomes a * required invocation after block_prepare_write. */ if (ret == 0) { + /* + * generic_commit_write() will run mark_inode_dirty() + * if i_size changes. So let's piggyback the + * i_disksize mark_inode_dirty into that. + */ + loff_t new_i_size = + ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + if (new_i_size > EXT3_I(inode)->i_disksize) + EXT3_I(inode)->i_disksize = new_i_size; ret = generic_commit_write(file, page, from, to); } else { /* @@ -1129,12 +1150,7 @@ static int ext3_commit_write(struct file kunmap(page); } } - if (inode->i_size > inode->u.ext3_i.i_disksize) { - inode->u.ext3_i.i_disksize = inode->i_size; - ret2 = ext3_mark_inode_dirty(handle, inode); - if (!ret) - ret = ret2; - } + lock_kernel(); ret2 = ext3_journal_stop(handle, inode); unlock_kernel(); if (!ret) @@ -2165,9 +2181,11 @@ bad_inode: /* * Post the struct inode info into an on-disk inode location in the * buffer-cache. This gobbles the caller's reference to the - * buffer_head in the inode location struct. + * buffer_head in the inode location struct. + * + * On entry, the caller *must* have journal write access to the inode's + * backing block, at iloc->bh. */ - static int ext3_do_update_inode(handle_t *handle, struct inode *inode, struct ext3_iloc *iloc) @@ -2176,12 +2194,6 @@ static int ext3_do_update_inode(handle_t struct buffer_head *bh = iloc->bh; int err = 0, rc, block; - if (handle) { - BUFFER_TRACE(bh, "get_write_access"); - err = ext3_journal_get_write_access(handle, bh); - if (err) - goto out_brelse; - } raw_inode->i_mode = cpu_to_le16(inode->i_mode); if(!(test_opt(inode->i_sb, NO_UID32))) { raw_inode->i_uid_low = cpu_to_le16(low_16_bits(inode->i_uid)); --- 2.4.19-pre10/mm/filemap.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002 +++ 2.4.19-pre10-akpm/mm/filemap.c Fri Jun 21 00:28:59 2002 @@ -2924,6 +2924,7 @@ generic_file_write(struct file *file,con long status = 0; int err; unsigned bytes; + time_t time_now; if ((ssize_t) count < 0) return -EINVAL; @@ -3026,8 +3027,12 @@ generic_file_write(struct file *file,con goto out; remove_suid(inode); - inode->i_ctime = inode->i_mtime = CURRENT_TIME; - mark_inode_dirty_sync(inode); + time_now = CURRENT_TIME; + if (inode->i_ctime != time_now || inode->i_mtime != time_now) { + inode->i_ctime = time_now; + inode->i_mtime = time_now; + mark_inode_dirty_sync(inode); + } if (file->f_flags & O_DIRECT) goto o_direct; --- 2.4.19-pre10/fs/jbd/transaction.c~ext3-speedup-1 Fri Jun 21 00:28:59 2002 +++ 2.4.19-pre10-akpm/fs/jbd/transaction.c Fri Jun 21 00:28:59 2002 @@ -237,7 +237,9 @@ handle_t *journal_start(journal_t *journ handle->h_ref = 1; current->journal_info = handle; + lock_kernel(); err = start_this_handle(journal, handle); + unlock_kernel(); if (err < 0) { kfree(handle); current->journal_info = NULL; @@ -1388,8 +1390,10 @@ int journal_stop(handle_t *handle) transaction->t_outstanding_credits -= handle->h_buffer_credits; transaction->t_updates--; if (!transaction->t_updates) { - wake_up(&journal->j_wait_updates); - if (journal->j_barrier_count) + if (waitqueue_active(&journal->j_wait_updates)) + wake_up(&journal->j_wait_updates); + if (journal->j_barrier_count && + waitqueue_active(&journal->j_wait_transaction_locked)) wake_up(&journal->j_wait_transaction_locked); } - |
From: mgross <mg...@un...> - 2002-06-21 18:42:37
|
Andrew Morton wrote: >"Griffiths, Richard A" wrote: > >>I should have mentioned the throughput we saw on 4 adapters 6 drives was >>126KB/s. The max theoretical bus bandwith is 640MB/s. >> > >I hope that was 128MB/s? > Yes that was MB/s, the data was taken in KB a set of 3 zeros where missing. > > >Please try the below patch (againt 2.4.19-pre10). It halves the lock >contention, and it does that by making the fs twice as efficient, so >that's a bonus. > We'll give it a try. I'm on travel right now so it may be a few days if Richard doesn't get to before I get back. > > >I wouldn't be surprised if it made no difference. I'm not seeing >much difference between ext2 and ext3 here. > >If you have time, please test ext2 and/or reiserfs and/or ext3 >in writeback mode. > Soon after we finish beating the ext3 file system up I'll take a swing at some other file systems. > >And please tell us some more details regarding the performance bottleneck. >I assume that you mean that the IO rate per disk slows as more >disks are added to an adapter? Or does the total throughput through >the adapter fall as more disks are added? > No, the IO block write throughput for the system goes down as drives are added under this work load. We measure the system throughput not the per drive throughput, but one could infer the per drive throughput by dividing. Running bonnie++ on with 300MB files doing 8Kb sequential writes we get the following system wide throughput as a function of the number of drives attached and by number of addapters. One addapter 1 drive per addapter 127,702KB/Sec 2 drives per addapter 93,283 KB/Sec 6 drives per addapter 85,626 KB/Sec 2 addapters 1 drive per addapter 92,095 KB/Sec 2 drives per addapter 110,956 KB/Sec 6 drives per addapter 106,883 KB/Sec 4 addapters 1 drive per addapter 121,125 KB/Sec 2 drives per addapter 117,575 KB/Sec 6 drives per addapter 116,570 KB/Sec Not too pritty. --mgross |
From: Chris M. <ma...@su...> - 2002-06-21 19:27:09
|
On Fri, 2002-06-21 at 14:46, mgross wrote: > Andrew Morton wrote: > > > >Please try the below patch (againt 2.4.19-pre10). It halves the lock > >contention, and it does that by making the fs twice as efficient, so > >that's a bonus. > > > We'll give it a try. I'm on travel right now so it may be a few days if > Richard doesn't get to before I get back. You might want to try this too, Andrew fixed UPDATE_ATIME() to only call the dirty_inode method once per second, but generic_file_write should do the same. It reduces BKL contention by reducing calls to ext3 and reiserfs dirty_inode calls, which are much more expensive than simply marking the inode dirty. -chris --- linux/mm/filemap.c Mon, 28 Jan 2002 09:51:50 -0500 +++ linux/mm/filemap.c Sun, 12 May 2002 16:16:59 -0400 @@ -2826,6 +2826,14 @@ } } +static void update_inode_times(struct inode *inode) +{ + time_t now = CURRENT_TIME; + if (inode->i_ctime != now || inode->i_mtime != now) { + inode->i_ctime = inode->i_mtime = now; + mark_inode_dirty_sync(inode); + } +} /* * Write to a file through the page cache. * @@ -2955,8 +2963,7 @@ goto out; remove_suid(inode); - inode->i_ctime = inode->i_mtime = CURRENT_TIME; - mark_inode_dirty_sync(inode); + update_inode_times(inode); if (file->f_flags & O_DIRECT) goto o_direct; |
From: Andrew M. <ak...@zi...> - 2002-06-21 19:58:27
|
mgross wrote: > > ... > >And please tell us some more details regarding the performance bottleneck. > >I assume that you mean that the IO rate per disk slows as more > >disks are added to an adapter? Or does the total throughput through > >the adapter fall as more disks are added? > > > No, the IO block write throughput for the system goes down as drives are > added under this work load. We measure the system throughput not the > per drive throughput, but one could infer the per drive throughput by > dividing. > > Running bonnie++ on with 300MB files doing 8Kb sequential writes we get > the following system wide throughput as a function of the number of > drives attached and by number of addapters. > > One addapter > 1 drive per addapter 127,702KB/Sec > 2 drives per addapter 93,283 KB/Sec > 6 drives per addapter 85,626 KB/Sec 127 megabytes/sec to a single disk? Either that's a very fast disk, or you're using very small bytes :) > 2 addapters > 1 drive per addapter 92,095 KB/Sec > 2 drives per addapter 110,956 KB/Sec > 6 drives per addapter 106,883 KB/Sec > > 4 addapters > 1 drive per addapter 121,125 KB/Sec > 2 drives per addapter 117,575 KB/Sec > 6 drives per addapter 116,570 KB/Sec > Possibly what is happening here is that a significant amount of dirty data is being left in memory and is escaping the measurement period. When you run the test against more disks, the *total* amount of dirty memory is increased, so the kernel is forced to perform more writeback within the measurement period. So with two filesystems, you're actually performing more I/O. You need to either ensure that all I/O is occurring *within the measurement interval*, or make the test write so much data (wrt main memory size) that any leftover unwritten stuff is insignificant. bonnie++ is too complex for this work. Suggest you use http://www.zip.com.au/~akpm/linux/write-and-fsync.c which will just write and fsync a file. Time how long that takes. Or you could experiment with bonnie++'s fsync option. My suggestion is to work with this workload: for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ... do write-and-fsync $i/foo 4000 & done which will write a 4 gig file to each disk. This will defeat any caching effects and is just a way simpler workload, which will allow you to test one thing in isolation. So anyway. All this possibly explains the "negative scalability" in the single-adapter case. For four adapters with one disk on each, 120 megs/sec seems reasonable, assuming the sustained write bandwidth of a single disk is 30 megs/sec. For four adapters, six disks on each you should be doing better. Something does appear to be wrong there. - |
From: Christopher E. B. <cb...@wo...> - 2002-06-23 04:10:55
|
On Thu, 20 Jun 2002, Griffiths, Richard A wrote: > I should have mentioned the throughput we saw on 4 adapters 6 drives was > 126KB/s. The max theoretical bus bandwith is 640MB/s. This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is 503MB/sec minus PCI overhead... This of course assumes nothing else is using the PCI bus. 120 something MB/sec sounds a hell of a lot like topping out a 32bit 33Mhz PCI bus, but IIRC the earlier posting listed 39160 cards, PCI 64bit w/ backward compat to 32bit. You do have *ALL* of these cards plugged into a full PCI 64bit/66Mhz slot right? Not plugging them into a 32bit/33Mhz slot? 32bit/33Mhz (32 * 33,000,000) / (1024 * 1024 * 8) = 125.89 MByte/sec 64bit/33Mhz (64 * 33,000,000) / (1024 * 1024 * 8) = 251.77 MByte/sec 64bit/66Mhz (64 * 66,000,000) / (1024 * 1024 * 8) = 503.54 MByte/sec NOTE: PCI transfer rates are often listed as 32bit/33Mhz, 132 MByte/sec 64bit/33Mhz, 264 MByte/sec 64bit/66Mhz, 528 MByte/sec This is somewhat true, but only if we start with Mbit rates as used in transmission rates (1,000,000 bits/sec) and work from there, instead of 2^20 (1,048,576). I will not argue about PCI 32bit/33Mhz being 1056Mbit, if talking about line rate, but when we are talking about storage media and transfers to/from as measured by files remember to convert. -- I route, therefore you are. |
From: Andreas D. <ad...@cl...> - 2002-06-23 04:34:51
|
On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote: > On Thu, 20 Jun 2002, Griffiths, Richard A wrote: > > > I should have mentioned the throughput we saw on 4 adapters 6 drives was > > 126KB/s. The max theoretical bus bandwith is 640MB/s. > > This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is > 503MB/sec minus PCI overhead... Assuming you only have a single PCI bus... Cheers, Andreas -- Andreas Dilger http://www-mddsp.enel.ucalgary.ca/People/adilger/ http://sourceforge.net/projects/ext2resize/ |
From: Christopher E. B. <cb...@wo...> - 2002-06-23 06:08:22
|
On Sat, 22 Jun 2002, Andreas Dilger wrote: > On Jun 22, 2002 22:02 -0600, Christopher E. Brown wrote: > > On Thu, 20 Jun 2002, Griffiths, Richard A wrote: > > > > > I should have mentioned the throughput we saw on 4 adapters 6 drives was > > > 126KB/s. The max theoretical bus bandwith is 640MB/s. > > > > This is *NOT* correct. Assuming a 64bit 66Mhz PCI bus your MAX is > > 503MB/sec minus PCI overhead... > > Assuming you only have a single PCI bus... Yes, we could (for example) assume a DP264 board, it features 2/4/8 way memory interleave, dual 21264 CPUs, and 2 separate PCI 64bit 66Mhz buses. However, multiple busses are *rare* on x86. There are alot of chained busses via PCI to PCI bridge, but few systems with 2 or more PCI busses of any type with parallel access to the CPU. -- I route, therefore you are. |
From: William L. I. I. <wl...@ho...> - 2002-06-23 06:36:35
|
On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote: > However, multiple busses are *rare* on x86. There are alot of chained > busses via PCI to PCI bridge, but few systems with 2 or more PCI > busses of any type with parallel access to the CPU. NUMA-Q has them. Cheers, Bill |
From: Dave H. <hav...@us...> - 2002-06-23 07:30:26
|
William Lee Irwin III wrote: > On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote: > >>However, multiple busses are *rare* on x86. There are alot of chained >>busses via PCI to PCI bridge, but few systems with 2 or more PCI >>busses of any type with parallel access to the CPU. > > NUMA-Q has them. > Yep, 2 independent busses per quad. That's a _lot_ of busses when you have an 8 or 16 quad system. (I wonder who has one of those... ;) Almost all of the server-type boxes that we play with have multiple PCI busses. Even my old dual-PPro has 2. -- Dave Hansen hav...@us... |
From: William L. I. I. <wl...@ho...> - 2002-06-23 07:37:35
|
>> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote: >>> However, multiple busses are *rare* on x86. There are alot of chained >>> busses via PCI to PCI bridge, but few systems with 2 or more PCI >>> busses of any type with parallel access to the CPU. William Lee Irwin III wrote: >> NUMA-Q has them. On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote: > Yep, 2 independent busses per quad. That's a _lot_ of busses when you > have an 8 or 16 quad system. (I wonder who has one of those... ;) > Almost all of the server-type boxes that we play with have multiple > PCI busses. Even my old dual-PPro has 2. I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the "independent" bit coming into play. Cheers, Bill |
From: Dave H. <hav...@us...> - 2002-06-23 07:45:29
|
William Lee Irwin III wrote: > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote: >> Yep, 2 independent busses per quad. That's a _lot_ of busses >> when you have an 8 or 16 quad system. (I wonder who has one of >> those... ;) Almost all of the server-type boxes that we play with >> have multiple PCI busses. Even my old dual-PPro has 2. > > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the > "independent" bit coming into play. > Hmmmm. Maybe there is another one for the onboard devices. I thought that there were 8 slots and 4 per bus. I could be wrong. BTW, the ISA slot is EISA and as far as I can tell is only used for the MDC. -- Dave Hansen hav...@us... |
From: Christopher E. B. <cb...@wo...> - 2002-06-23 08:03:52
|
On Sun, 23 Jun 2002, Dave Hansen wrote: > William Lee Irwin III wrote: > > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote: > >> Yep, 2 independent busses per quad. That's a _lot_ of busses > >> when you have an 8 or 16 quad system. (I wonder who has one of > >> those... ;) Almost all of the server-type boxes that we play with > >> have multiple PCI busses. Even my old dual-PPro has 2. > > > > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the > > "independent" bit coming into play. > > > Hmmmm. Maybe there is another one for the onboard devices. I thought > that there were 8 slots and 4 per bus. I could > be wrong. BTW, the ISA slot is EISA and as far as I can tell is only > used for the MDC. Do you mean independent in that there are 2 sets of 4 slots each detected as a seperate PCI bus, or independent in that each set of 4 had *direct* access to the cpu side, and *does not* access via a PCI:PCI bridge? I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have chianed buses. Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys HR/HS6000) had 2 PCI buses, however the second BUS hung off of a PCI:PCI bridge. -- I route, therefore you are. |
From: David L. <dav...@di...> - 2002-06-23 08:16:22
|
most chipsets only have one PCI bus on them so any others need to be bridged to that one. David Lang On Sun, 23 Jun 2002, Christopher E. Brown wrote: > Date: Sun, 23 Jun 2002 01:55:28 -0600 (MDT) > From: Christopher E. Brown <cb...@wo...> > To: Dave Hansen <hav...@us...> > Cc: William Lee Irwin III <wl...@ho...>, > Andreas Dilger <ad...@cl...>, > "Griffiths, Richard A" <ric...@in...>, > 'Andrew Morton' <ak...@zi...>, mg...@un..., > 'Jens Axboe' <ax...@su...>, > Linux Kernel Mailing List <lin...@vg...>, > lse...@li... > Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of > spindles gets large > > On Sun, 23 Jun 2002, Dave Hansen wrote: > > > William Lee Irwin III wrote: > > > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote: > > >> Yep, 2 independent busses per quad. That's a _lot_ of busses > > >> when you have an 8 or 16 quad system. (I wonder who has one of > > >> those... ;) Almost all of the server-type boxes that we play with > > >> have multiple PCI busses. Even my old dual-PPro has 2. > > > > > > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the > > > "independent" bit coming into play. > > > > > Hmmmm. Maybe there is another one for the onboard devices. I thought > > that there were 8 slots and 4 per bus. I could > > be wrong. BTW, the ISA slot is EISA and as far as I can tell is only > > used for the MDC. > > > Do you mean independent in that there are 2 sets of 4 slots each > detected as a seperate PCI bus, or independent in that each set of 4 > had *direct* access to the cpu side, and *does not* access via a > PCI:PCI bridge? > > > > I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have > chianed buses. Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys > HR/HS6000) had 2 PCI buses, however the second BUS hung off of a > PCI:PCI bridge. > > > -- > I route, therefore you are. > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > |
From: Dave H. <hav...@us...> - 2002-06-23 08:32:32
|
Christopher E. Brown wrote: > Do you mean independent in that there are 2 sets of 4 slots each > detected as a seperate PCI bus, or independent in that each set of 4 > had *direct* access to the cpu side, and *does not* access via a > PCI:PCI bridge? No PCI:PCI bridges, at least for NUMA-Q. http://telia.dl.sourceforge.net/sourceforge/lse/linux_on_numaq.pdf -- Dave Hansen hav...@us... |
From: Martin J. B. <Mar...@us...> - 2002-06-23 16:23:24
|
> >> Yep, 2 independent busses per quad. That's a _lot_ of busses > >> when you have an 8 or 16 quad system. (I wonder who has one of > >> those... ;) Almost all of the server-type boxes that we play with > >> have multiple PCI busses. Even my old dual-PPro has 2. > > > > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the > > "independent" bit coming into play. > > > Hmmmm. Maybe there is another one for the onboard devices. I thought > that there were 8 slots and 4 per bus. I could > be wrong. BTW, the ISA slot is EISA and as far as I can tell is only > used for the MDC. NUMA-Q has 2 PCI buses per quad, 3 slots in one, 4 in the other, plus the EISA slots. Multiple independant PCI buses are also available on other more common architecutres, eg Netfinity 8500R, x360, x440, etc. Anything with the Intel Profusion chipset will have this feature, the bottleneck becomes the "P6 system bus" backplane they're all connected to, which has a theoretical limit of 800Mb/s IIRC, though nobody's been able to get more than 420Mb/s out of it in practice, as far as I know. The thing that makes the NUMA-Q a massive IO shovelling engine is having one of these IO backplanes per quad too ... 16 x 800Mb/s = 12.8Gb/s ;-) M. |
From: <ebi...@xm...> - 2002-06-23 17:17:08
|
William Lee Irwin III <wl...@ho...> writes: > On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote: > > However, multiple busses are *rare* on x86. There are alot of chained > > busses via PCI to PCI bridge, but few systems with 2 or more PCI > > busses of any type with parallel access to the CPU. > > NUMA-Q has them. As do the latest round of dual P4 Xeon chipsets. The Intel E7500 and the Serverworks Grand Champion. So on new systems this is easy to get if you want it. Eric |
From: Griffiths, R. A <ric...@in...> - 2002-06-24 22:51:38
|
Andrew, I ran your write-and-fsync program. Here are the results: #cntrlrs x #drives 2x2 avg = 25.04 MB/s aggregate = 100 MB/s 2x4 avg = 9.17 MB/s aggregate = 110 MB/s 4x2 avg = 14.55 MB/s aggregate = 116.4 MB/s 4x6 avg = 4.94 MB/s aggregate = 118.6 MB/s Your program only addresses large I/O (1MB) against a fairly large file (4GB). We did that as well with Bonnie++ (2GB file 1MB I/O requests). The results without the fsync option ran about 94 - 100 MB/s. Our concern was creating a more real world mix of I/O. How well does the system scale against a variety of I/O request sizes on various size files. Where we saw the worst overall scaling was with 8K requests. Richard mgross wrote: > > ... > >And please tell us some more details regarding the performance bottleneck. > >I assume that you mean that the IO rate per disk slows as more > >disks are added to an adapter? Or does the total throughput through > >the adapter fall as more disks are added? > > > No, the IO block write throughput for the system goes down as drives are > added under this work load. We measure the system throughput not the > per drive throughput, but one could infer the per drive throughput by > dividing. > > Running bonnie++ on with 300MB files doing 8Kb sequential writes we get > the following system wide throughput as a function of the number of > drives attached and by number of addapters. > > One addapter > 1 drive per addapter 127,702KB/Sec > 2 drives per addapter 93,283 KB/Sec > 6 drives per addapter 85,626 KB/Sec 127 megabytes/sec to a single disk? Either that's a very fast disk, or you're using very small bytes :) > 2 addapters > 1 drive per addapter 92,095 KB/Sec > 2 drives per addapter 110,956 KB/Sec > 6 drives per addapter 106,883 KB/Sec > > 4 addapters > 1 drive per addapter 121,125 KB/Sec > 2 drives per addapter 117,575 KB/Sec > 6 drives per addapter 116,570 KB/Sec > Possibly what is happening here is that a significant amount of dirty data is being left in memory and is escaping the measurement period. When you run the test against more disks, the *total* amount of dirty memory is increased, so the kernel is forced to perform more writeback within the measurement period. So with two filesystems, you're actually performing more I/O. You need to either ensure that all I/O is occurring *within the measurement interval*, or make the test write so much data (wrt main memory size) that any leftover unwritten stuff is insignificant. bonnie++ is too complex for this work. Suggest you use http://www.zip.com.au/~akpm/linux/write-and-fsync.c which will just write and fsync a file. Time how long that takes. Or you could experiment with bonnie++'s fsync option. My suggestion is to work with this workload: for i in /mnt/1 /mnt/2 /mnt/3 /mnt/4 ... do write-and-fsync $i/foo 4000 & done which will write a 4 gig file to each disk. This will defeat any caching effects and is just a way simpler workload, which will allow you to test one thing in isolation. So anyway. All this possibly explains the "negative scalability" in the single-adapter case. For four adapters with one disk on each, 120 megs/sec seems reasonable, assuming the sustained write bandwidth of a single disk is 30 megs/sec. For four adapters, six disks on each you should be doing better. Something does appear to be wrong there. - |
From: Andrew M. <ak...@zi...> - 2002-06-20 20:20:19
|
"Griffiths, Richard A" wrote: > > We ran without highmem enabled so the Kernel only saw 1GB of memory. > Yup. I take it back - high ext3 lock contention happens on 2.5 as well, which has block-highmem. With heavy write traffic onto six disks, two controllers, six filesystems, four CPUs the machine spends about 40% of the time spinning on locks in fs/ext3/inode.c You're un dual CPU, so the contention is less. Not very nice. But given that the longest spin time was some tens of milliseconds, with the average much lower, it shouldn't affect overall I/O throughput. Possibly something else is happening. Have you tested ext2? |
From: mgross <mg...@un...> - 2002-06-20 21:02:00
|
On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote: > Yup. =A0I take it back - high ext3 lock contention happens on 2.5 > as well, which has block-highmem. =A0With heavy write traffic onto > six disks, two controllers, six filesystems, four CPUs the machine > spends about 40% of the time spinning on locks in fs/ext3/inode.c > You're un dual CPU, so the contention is less. > > Not very nice. =A0But given that the longest spin time was some > tens of milliseconds, with the average much lower, it shouldn't > affect overall I/O throughput. How could losing 40% of your CPU's to spin locks NOT spank your throughtp= ut? =20 Can you copy your lockmeter data from its kernel_flag section? Id like t= o=20 see it. > > Possibly something else is happening. =A0Have you tested ext2? No. We're attempting to see if we can scale to large numbers of spindles= =20 with EXT3 at the moment. Perhaps we can effect positive changes to ext3=20 before giving up on it and moving to another Journaled FS. --mgross |
From: Andrew M. <ak...@zi...> - 2002-06-20 21:27:38
|
mgross wrote: > > On Thursday 20 June 2002 04:18 pm, Andrew Morton wrote: > > Yup. I take it back - high ext3 lock contention happens on 2.5 > > as well, which has block-highmem. With heavy write traffic onto > > six disks, two controllers, six filesystems, four CPUs the machine > > spends about 40% of the time spinning on locks in fs/ext3/inode.c > > You're un dual CPU, so the contention is less. > > > > Not very nice. But given that the longest spin time was some > > tens of milliseconds, with the average much lower, it shouldn't > > affect overall I/O throughput. > > How could losing 40% of your CPU's to spin locks NOT spank your throughtput? The limiting factor is usually disk bandwidth, seek latency, rotational latency. That's why I want to know your bandwidth. > Can you copy your lockmeter data from its kernel_flag section? Id like to > see it. I don't find lockmeter very useful. Here's oprofile output for 2.5.23: c013ec08 873 1.07487 rmqueue c018a8e4 950 1.16968 do_get_write_access c013b00c 969 1.19307 kmem_cache_alloc_batch c018165c 1120 1.37899 ext3_writepage c0193120 1457 1.79392 journal_add_journal_head c0180e30 1458 1.79515 ext3_prepare_write c0136948 6546 8.05969 generic_file_write c01838ac 42608 52.4606 .text.lock.inode So I lost two CPUs on the BKL in fs/ext3/inode.c. The remaining two should be enough to saturate all but the most heroic disk subsystems. A couple of possibilities come to mind: 1: Processes which should be submitting I/O against disk "A" are instead spending tons of time asleep in the page allocator waiting for I/O to complete against disk "B". 2: ext3 is just too slow for the rate of data which you're trying to push at it. This exhibits as lock contention, but the root cause is the cost of things like ext3_mark_inode_dirty(). And *that* is something we can fix - can shave 75% off the cost of that. Need more data... > > > > Possibly something else is happening. Have you tested ext2? > > No. We're attempting to see if we can scale to large numbers of spindles > with EXT3 at the moment. Perhaps we can effect positive changes to ext3 > before giving up on it and moving to another Journaled FS. Have you tried *any* other fs? - |