From: Martin J. B. <Mar...@us...> - 2002-08-28 22:06:00
|
> So initializing these fields causes us to pull them in because they are > not in our cache, and moving the initialization to the free causes them > to access warm cache entries for them (presumably they just got done > being used for something.) If this structure is being allocated > because the calling function wants to use it, doesn't that just move > the dirty cache issue to another routine (which would have had the > cache warmed by the allocator before?) Stated another way, how do you > know you haven't just moved cache misses rather than eliminating them? Presumably we allocated the skb header because we actually wanted to use it - therefore we will touch at least some of the cachelines in it on alloc anyway. If that's the case, it would be more efficient to zero at least those lines that we touch at alloc time (assuming they're not going to be in the freeing CPUs cache already most of the time). M. |
From: Mala A. <ma...@us...> - 2002-08-29 04:04:40
|
Troy drives his system to get maximum compliant concurrent connections. Yes there is no direct releationship between conforming connections and cpu utilization. Yes you can get 10% conforming connections with 10% cpu but that is not how performance runs are done. Troy told me that his cpu utilization is 100% so far in all his compliant SPECweb99 runs.I have neither used SPECweb99 nor have any knowledge about it. So I don't know why the cpu utilization is not reported in the results table of the report. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Mala A. <ma...@us...> - 2002-08-29 04:22:22
|
>Martin wrote.. >> Cycles for the whole routines alloc_skb and __kfree_skb are as follows: >> >> 2.5.25+skbinit patch >> -------------------- >> >> alloc/free average cycles >> ------------------------- >> Runs: 1st 2nd 3rd >> >> CPU0: 447/1015 580/846 402/905 >> CPU1: 419/1003 383/915 547/856 >OK, so after you moved the zeroing of the skb header out of >__kfree_skb (which I agree is a good idea, at least in principle), >what is left in there that actually takes any time? >Is it the dst_release's atomic_dec? As to the rest of the function, >I was under the impression (maybe incorrectly) that we don't use >destructors, I presume netfilter is not configured, you took >out skb_headerinit, and kfree_skbmem is not inlined. How come >this is still burning so many cycles? >Do you have some instruction level profiling data that shows >you what's still there? The cycles reported for __kfree_skb includes kfree_skbmem also. I will measure cycles excluding kfree_skbmem. I don't think that should take much cycles. I think we are burning more cycles in freemiss case. Freemiss case is what I wanted to avoid and got side tracked with skb initialization. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... http://www-124.ibm.com/developerworks/opensource/linuxperf http://www-124.ibm.com/developerworks/projects/linuxperf Phone:838-8088; Tie-line:678-8088 |
From: Martin J. B. <Mar...@us...> - 2002-08-29 04:37:18
|
> The cycles reported for __kfree_skb includes kfree_skbmem also. Ah, OK ... that makes more sense - I missed that. > I will measure cycles excluding kfree_skbmem. I don't think that > should take much cycles. I think we are burning more cycles in > freemiss case. Freemiss case is what I wanted to avoid and got > side tracked with skb initialization. OK, the last set of specweb data I saw (without your patch) (with readprofile, 8-way SMP): __kfree_skb [8026da70]: 22066 kfree_skbmem [8026da00]: 2589 So kfree_skbmem should have been a small part of the equation ... but your cycles for __kfree_skb don't seem to be shrinking in the same proportions? Will be interesting to see your data on the remainder of __kfree_skb. I wonder if there's something else expensive lurking in there? Or perhaps the costs are just different for an 8-way machine. Thanks, M. |
From: Benjamin L. <bc...@re...> - 2002-08-22 18:32:58
|
On Thu, Aug 22, 2002 at 12:22:34PM -0500, Mala Anand wrote: > I would like to stress again that this patch helps only when the > allocations > and frees occur on two different CPUs. I measured it in a UNI system and > did not see any impact. Thanks, that looks a lot more complete. We discussed this on irc a bit, and Andi Kleen pointed out that several years of hacking on skbs has probably changed the layout significantly from the original intention of keeping all the initializations to a cacheline or two. I also pointed out that it might be worth looking at cache misses and perhaps adding a prefetch instruction or two, especially during allocation when an skb will be used immediately. Another point is to check the order of writes that gcc is generating to the skb: if the writes are sequential, the cpu can combine them and make use of the internal 64 bit bus to the cache. In combination with write buffers in the cpu, that makes the writes in __kfree_skb almost free, but if the cache lines are spread out or cold, that would explain the degredation you're seeing. Cheers, -ben |
From: Dave H. <hav...@us...> - 2002-08-22 19:03:11
|
Mala Anand wrote: > The third scope would be measuring this patch in a workload environment. > We measured it in a web serving workload and found that we get 0.7% > improvement. First of all, the patch doesn't apply at all against the current bitkeeper tree. I can post the exact one I used if you like. I tried this under our Specweb99 setup. Here's a snippet of readprofile with, then without the patch: 8788 __kfree_skb 8970 mod_timer 9095 file_read_actor 10778 alloc_skb 10905 skb_clone 11368 e1000_clean_tx_irq 13595 e1000_intr 18367 csum_partial_copy_generic 27848 e1000_xmit_frame 225838 poll_idle 623160 total 0.4107 alloc:free ratio: 1.226 (__kfree_skb+alloc_skb)/total = 3.14% 4535 alloc_skb 4559 do_tcp_sendpages 4596 e1000_clean_rx_irq 4847 dev_queue_xmit 5020 tcp_clean_rtx_queue 5155 batch_entropy_store 5165 kmalloc 5309 tcp_transmit_skb 6060 do_schedule 6138 qdisc_restart 6235 tcp_v4_rcv 6393 kfree 6787 do_gettimeofday 7089 __d_lookup 7810 ip_queue_xmit 8303 skb_clone 8858 file_read_actor 8885 mod_timer 9375 .text.lock.namei 10267 .text.lock.dec_and_lock 10936 e1000_clean_tx_irq 13001 __kfree_skb 13322 skb_release_data 13562 e1000_intr 18099 csum_partial_copy_generic 27447 e1000_xmit_frame 225023 poll_idle 628695 total 0.4143 alloc:free ratio: 0.348 (__kfree_skb+alloc_skb)/total = 2.79% You can see the entire readprofile here: http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-08-22-2002-11.20.17/ http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-mala-08-22-2002-11.44.25/ No, I don't know why I have so much idle time. -- Dave Hansen hav...@us... |
From: Nivedita S. <ni...@us...> - 2002-08-22 20:59:58
|
Dave, Just FYI, the profile of the second link (*mala*) is the one youre quoting first in this msg, and the profile of the first link (presumably prior to mala's patch) is the one you've quoted second in the mail. Hopefully, the links are just misnamed, and the profiles listed before/after are the right ones here in the mail. :). It would be useful to know how consistent these profiles are, and the variance youre seeing with these runs, before reaching any conclusions.. For instance, skb_release_data(), which wasnt altered, increased from 7259 to 13,322, which is on par with the kind of gain expected by the patch in the other functions. So is this just normal variance, or a result of the patch? Looking at most of the Specweb profiles and networking in general, and because so much here depends on which cpu code gets run on, and cache behaviour, I'd say youre going to get a lot of variance... thanks, Nivedita Quoting Dave Hansen <hav...@us...>: [snip] > First of all, the patch doesn't apply at all against the current > bitkeeper tree. I can post the exact one I used if you like. > > I tried this under our Specweb99 setup. Here's a snippet of > readprofile with, then without the patch: > > 8788 __kfree_skb > 8970 mod_timer > 9095 file_read_actor > 10778 alloc_skb > 10905 skb_clone > 11368 e1000_clean_tx_irq > 13595 e1000_intr > 18367 csum_partial_copy_generic > 27848 e1000_xmit_frame > 225838 poll_idle > 623160 total 0.4107 > > alloc:free ratio: 1.226 > (__kfree_skb+alloc_skb)/total = 3.14% > > 4535 alloc_skb > 4559 do_tcp_sendpages > 4596 e1000_clean_rx_irq > 4847 dev_queue_xmit > 5020 tcp_clean_rtx_queue > 5155 batch_entropy_store > 5165 kmalloc > 5309 tcp_transmit_skb > 6060 do_schedule > 6138 qdisc_restart > 6235 tcp_v4_rcv > 6393 kfree > 6787 do_gettimeofday > 7089 __d_lookup > 7810 ip_queue_xmit > 8303 skb_clone > 8858 file_read_actor > 8885 mod_timer > 9375 .text.lock.namei > 10267 .text.lock.dec_and_lock > 10936 e1000_clean_tx_irq > 13001 __kfree_skb > 13322 skb_release_data > 13562 e1000_intr > 18099 csum_partial_copy_generic > 27447 e1000_xmit_frame > 225023 poll_idle > 628695 total 0.4143 > > alloc:free ratio: 0.348 > (__kfree_skb+alloc_skb)/total = 2.79% > > You can see the entire readprofile here: > http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap -08-22-2002-11.20.17/ > http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap -mala-08-22-2002-11.44.25/ > No, I don't know why I have so much idle time. > > -- > Dave Hansen > hav...@us... > > _______________________________________________ > ibm-specweb99 mailing list > ibm...@li... > http://ltc.linux.ibm.com/mailman/listinfo/ibm-specweb99 > > |
From: William L. I. I. <wl...@ho...> - 2002-08-22 22:09:56
|
On Thu, Aug 22, 2002 at 12:02:27PM -0700, Dave Hansen wrote: > You can see the entire readprofile here: > http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-08-22-2002-11.20.17/ > http://www.sr71.net/~specweb99/run-specweb-100sec-2400-2.5.31-bk+4-kmap-mala-08-22-2002-11.44.25/ > No, I don't know why I have so much idle time. Hmm, I found that tiobench was spending a lot of time idle due to __wait_on_inode() and get_request_wait(). I bumped up the size of the inode wait table to 1024 and the request queue size to 16384 and saw that most of them then spent their time stuck on ->i_sem during the initial open of the file they were going to pound on for the duration of the run. I determined this by just ^C'ing with kgdb and backtracing various "stuck" processes. I think various profiling patches might be able to give you an idea of what people are going to sleep on too. Cheers, Bill |
From: Bill H. <ha...@au...> - 2002-08-23 19:11:30
|
Mala Anand wrote: > > Baseline 2.5.25 > ---------------- > alloc/free average cycles > ------------------------- > Runs: 1st 2nd 3rd > > CPU0: 337/1163 336/1132 304/1100 > CPU1: 318/1164 309/1153 311/1127 > > 2.5.25+skbinit patch > -------------------- > > alloc/free average cycles > ------------------------- > Runs: 1st 2nd 3rd > > CPU0: 447/1015 580/846 402/905 > CPU1: 419/1003 383/915 547/856 > > The above figures indicate that the cycles spent in alloc_skb and > __kfree_skb have gained 5% in the patch case. However if you > take the absolute cycles and average them for the three runs it > comes around 145 cycles saving that is close to what I posted earlier > by measuring just the changed code. As the scope of the code measured > widens the percentage improvement comes down. Measuring just the initialization code yielded a reduction of 156 cycles. Measuring alloc_skb and __kfree_skb yielded a reduction of 145 cycles. This was on a 2 CPU system. The worst case scenario would be allocating the skb header on one CPU then freeing it on another CPU. The best case would be doing all of the allocs and frees on one CPU. You can use process/irq affinity to create both of these cases. Can you measure these ? Bill |
From: Martin J. B. <Mar...@us...> - 2002-08-28 22:01:21
|
> Cycles for the whole routines alloc_skb and __kfree_skb are as follows: > > 2.5.25+skbinit patch > -------------------- > > alloc/free average cycles > ------------------------- > Runs: 1st 2nd 3rd > > CPU0: 447/1015 580/846 402/905 > CPU1: 419/1003 383/915 547/856 OK, so after you moved the zeroing of the skb header out of __kfree_skb (which I agree is a good idea, at least in principle), what is left in there that actually takes any time? Is it the dst_release's atomic_dec? As to the rest of the function, I was under the impression (maybe incorrectly) that we don't use destructors, I presume netfilter is not configured, you took out skb_headerinit, and kfree_skbmem is not inlined. How come this is still burning so many cycles? Do you have some instruction level profiling data that shows you what's still there? Thanks, Martin ------------------------------------------ void __kfree_skb(struct sk_buff *skb) { if (skb->list) { printk(KERN_WARNING "Warning: kfree_skb passed an skb still " "on a list (from %p).\n", NET_CALLER(skb)); BUG(); } dst_release(skb->dst); if(skb->destructor) { if (in_irq()) printk(KERN_WARNING "Warning: kfree_skb on " "hard IRQ %p\n", NET_CALLER(skb)); skb->destructor(skb); } #ifdef CONFIG_NETFILTER nf_conntrack_put(skb->nfct); #endif skb_headerinit(skb, NULL, 0); /* clean state */ kfree_skbmem(skb); } static inline void dst_release(struct dst_entry * dst) { if (dst) atomic_dec(&dst->__refcnt); } |