From: Mala A. <ma...@us...> - 2002-07-27 02:30:09
|
I found a problem with per cpu slab allocator implementation in Linux kernel. The per cpu array of objects get emptied unnecessarily when there is no room to put back the freed object. This might lead to a scenario where the object array is emptied to find that the subsequent alloc requests end up in re-populating the array with the objects. These wasted cycles could be saved by simply changing the array to a singly linked list of free objects. To see how much this is really happening I looked at the slab stats collected using SPECweb99 workload. The following slabinfo stats are edited for clarity sake. The allocmiss and freemiss are the counters that indicate how many times we are re-populating the object array with objects (allocmiss) and how many times the object array was emptied to make room to add freed objects (freemiss). slabinfo - version: 1.1 (statistics) (SMP) allochit allocmiss freehit freemiss tcp_tw_bucket 7297373 60783 7299082 56577 tcp_open_request 13236826 1427 13236852 1369 file lock cache 13020821 6467 13020878 6336 skbuff_head_cache 770394 38817 401689 3201 sock 13231542 6584 13231816 5948 buffer_head 5886789 119467 3793394 10946 size-4096 333688059 3327893 333690264 3322182 size-2048 91797861 451537 91798246 450602 size-256 355278409 773333 355281803 766049 size-32 32253719 306 32246987 150 The slab stats counter above shows that allocmiss and freemiss happen less than 1% of the time, which is not significant to consider changing the code. However it is not only the number of times this happen, in relation to allochit and freehit, is important but the amount of processing being done when this happens is also important. Next I looked at the Readprofile taken using the same specweb workload: 31653 total 0.0209 1374 e1000_xmit_frame 0.8218 1266 __kfree_skb 5.4569 1261 ProcessTransmitInterrupts 2.7898 1202 csum_partial_copy_generic 4.8468 1158 skb_release_data 9.9828 1114 __wake_up 9.6034 1024 ProcessReceiveInterrupts 1.3913 795 tcp_clean_rtx_queue 1.0461 754 net_rx_action 1.2240 696 kfree 4.8333 *** 369 kmalloc 1.0853 247 kfree_skbmem 2.3750 181 __free_pages 5.6562 63 kmem_cache_alloc 0.2351 *** 57 __free_pages_ok 0.1122 55 kmem_cache_free 0.4435 *** kfree is one of the top 10 hot routines and this is where freemiss processing happens. kmalloc and kmem_cache_alloc include allocmiss processing. I am working on fixing this. Comments and suggestions are welcome. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... Phone:512-838-8088; |
From: Luck, T. <ton...@in...> - 2002-07-29 16:48:10
|
Mala, You don't specify any details of how the "singly linked list of free objects" would be implemented. You cannot use any of the memory in the freed object (as the constructor for a slab is only called when memory is first allocated, not when an object is recycled) so using any part of the object might confuse a caller by giving them a corrupted object. Are you going to have some external structure to maintain the linked list? Or secretly enlarge the object to provide space for the link, or some other way? -Tony -----Original Message----- From: Mala Anand [mailto:ma...@us...] Sent: Friday, July 26, 2002 7:30 PM To: lin...@vg...; lse Cc: Bill Hartner Subject: Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss I found a problem with per cpu slab allocator implementation in Linux kernel. The per cpu array of objects get emptied unnecessarily when there is no room to put back the freed object. This might lead to a scenario where the object array is emptied to find that the subsequent alloc requests end up in re-populating the array with the objects. These wasted cycles could be saved by simply changing the array to a singly linked list of free objects. To see how much this is really happening I looked at the slab stats collected using SPECweb99 workload. The following slabinfo stats are edited for clarity sake. The allocmiss and freemiss are the counters that indicate how many times we are re-populating the object array with objects (allocmiss) and how many times the object array was emptied to make room to add freed objects (freemiss). slabinfo - version: 1.1 (statistics) (SMP) allochit allocmiss freehit freemiss tcp_tw_bucket 7297373 60783 7299082 56577 tcp_open_request 13236826 1427 13236852 1369 file lock cache 13020821 6467 13020878 6336 skbuff_head_cache 770394 38817 401689 3201 sock 13231542 6584 13231816 5948 buffer_head 5886789 119467 3793394 10946 size-4096 333688059 3327893 333690264 3322182 size-2048 91797861 451537 91798246 450602 size-256 355278409 773333 355281803 766049 size-32 32253719 306 32246987 150 The slab stats counter above shows that allocmiss and freemiss happen less than 1% of the time, which is not significant to consider changing the code. However it is not only the number of times this happen, in relation to allochit and freehit, is important but the amount of processing being done when this happens is also important. Next I looked at the Readprofile taken using the same specweb workload: 31653 total 0.0209 1374 e1000_xmit_frame 0.8218 1266 __kfree_skb 5.4569 1261 ProcessTransmitInterrupts 2.7898 1202 csum_partial_copy_generic 4.8468 1158 skb_release_data 9.9828 1114 __wake_up 9.6034 1024 ProcessReceiveInterrupts 1.3913 795 tcp_clean_rtx_queue 1.0461 754 net_rx_action 1.2240 696 kfree 4.8333 *** 369 kmalloc 1.0853 247 kfree_skbmem 2.3750 181 __free_pages 5.6562 63 kmem_cache_alloc 0.2351 *** 57 __free_pages_ok 0.1122 55 kmem_cache_free 0.4435 *** kfree is one of the top 10 hot routines and this is where freemiss processing happens. kmalloc and kmem_cache_alloc include allocmiss processing. I am working on fixing this. Comments and suggestions are welcome. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... Phone:512-838-8088; ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Lse-tech mailing list Lse...@li... https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mala A. <ma...@us...> - 2002-07-30 12:36:07
|
>Tony Luck writes .. >You don't specify any details of how the "singly linked list of >free objects" would be implemented. You cannot use any of the >memory in the freed object (as the constructor for a slab is only >called when memory is first allocated, not when an object is recycled) >so using any part of the object might confuse a caller by giving them >a corrupted object. I am creating a link list of free objects per cpu. When objects are deallocated by the caller they get added to its cpu free object link list. The freed objects do not migrate to other caches, they are put back to the present cpu's link list. so they don't have to be re-initialized. I am planning on putting a (configurable) limit on the number of free objects that can stay in a free list. >Are you going to have some external structure to maintain the linked >list? Or secretly enlarge the object to provide space for the link, >or some other way? No I am using the object(beginning space) to store the links. When allocated, I can initialize the space occupied by the link address. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... Phone:838-8088; Tie-line:678-8088 |
From: Luck, T. <ton...@in...> - 2002-07-30 16:20:24
|
Mala Anand wrote: > >Are you going to have some external structure to maintain the linked > >list? Or secretly enlarge the object to provide space for the link, > >or some other way? > No I am using the object(beginning space) to store the links. When > allocated, I can initialize the space occupied by the link address. You can't use the start of the object (or any other part) in this way, you'll have no way to restore the value you overwrote. Take a look at Jeff Bonwick's paper on slab allocators which explains this a lot better than I can: http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon wick.a -Tony |
From: Mala A. <ma...@us...> - 2002-08-01 12:42:26
|
Tony Luck wrote.. >> No I am using the object(beginning space) to store the links. When >> allocated, I can initialize the space occupied by the link address. >You can't use the start of the object (or any other part) in this way, >you'll have no way to restore the value you overwrote. >Take a look at Jeff Bonwick's paper on slab allocators which explains >this a lot better than I can: > http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon >wick.a I read the document and see that I cannot use the object space even when the object is free. However I would like to know how much of this assumption is used in the Linux Kernel code. Looking through the tcpip stack code I realized that most of the caches(tcp_opne_request, tcp_bind_bucket, tcp_tw_bucket, inet_peer_cache etc.,) do not have constructors and destructors.Some of the caches have them ofcourse. Skbs have constructor, however the code executes constructor before freeing the object and initializes the rest of the fields during allocation. In this case, this feature of preserving the object between uses do not eliminate any of the object initialization code. Having said that I would like to know if in any part of the Linux code is taking advantage of this assumption. That is, the object is preserved by the slab allocator between uses so the user does not have to reinitialize. Furthermore I think this design does not take into consideration of multiprocessor issues such as cache-bouncing, cache-warmthness etc., And also the original implementation of the slab cache in the Linux kernel did not have per cpu support (I am not sure if the paper takes into consideration of SMP etc., also). So this assumption needs to be examined in the light of SMP, NUMA etc., I would like to explore the possibility of changing this assumption if possible in lieu of SMP/NUMA cache effects. In the present design there is a limit on how many free objects are held in the per cpu array. So when an object is freed it might end in another cpu more often. The main cost lies in memory latency than execution of initializing the fields. I doubt if we get the same gain as explained in the paper by preserving the fields between uses on an SMP/NUMA machines. I agree that preserving read only variables that can be used between uses will help performance. We still can do that by revising the assumption to leave the first 4 or whatever bytes needed to store the links. What do you think? Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... Phone:838-8088; Tie-line:678-8088 |
From: Dipankar S. <dip...@in...> - 2002-08-01 13:18:31
|
On Thu, Aug 01, 2002 at 07:42:10AM -0500, Mala Anand wrote: > > > Tony Luck wrote.. > >> No I am using the object(beginning space) to store the links. When > >> allocated, I can initialize the space occupied by the link address. > > >You can't use the start of the object (or any other part) in this way, > >you'll have no way to restore the value you overwrote. > > >Take a look at Jeff Bonwick's paper on slab allocators which explains > >this a lot better than I can: > > > > http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon > > >wick.a > > In the present design there is a limit on how many free objects are held > in the per cpu array. So when an object is freed it might end in another > cpu more often. The main cost lies in memory latency than execution of > initializing the fields. I doubt if we get the same gain as explained in > the paper by preserving the fields between uses on an SMP/NUMA machines. > > I agree that preserving read only variables that can be used between uses > will help performance. We still can do that by revising the assumption to > leave the first 4 or whatever bytes needed to store the links. What do you > think? Mala, Isn't it possible to tune the cpucache limit by writing to /proc/slabinfo so that you avoid frequent draining of free objects ? Am I missing something here ? Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Mala A. <ma...@us...> - 2002-08-01 13:31:47
|
Dipankar wrote.. >Isn't it possible to tune the cpucache limit by writing to >/proc/slabinfo so that you avoid frequent draining of free objects ? >Am I missing something here ? Are you referring to raising the per cpu array limit? I don't think you tune that using /proc/slabinfo. However that does not solve the problem, it only delays it. It needs to grow/shrink dynamically based on need. I am not only referring to frequently draining of free objects but also as a result of this refilling the object array due to subsequent allocations and so on. Regards, Mala Mala Anand IBM Linux Technology Center - Kernel Performance E-mail:ma...@us... Phone:838-8088; Tie-line:678-8088 |
From: Dipankar S. <dip...@in...> - 2002-08-01 13:40:33
|
On Thu, Aug 01, 2002 at 08:31:45AM -0500, Mala Anand wrote: > > Dipankar wrote.. > >Isn't it possible to tune the cpucache limit by writing to > >/proc/slabinfo so that you avoid frequent draining of free objects ? > >Am I missing something here ? > > Are you referring to raising the per cpu array limit? I don't think you > tune that using /proc/slabinfo. However that does not solve the problem, Hmm... then what does slabinfo_write()->kmem_tune_cpucache() do ? > it only delays it. It needs to grow/shrink dynamically based on need. I > am not only referring to frequently draining of free objects but also > as a result of this refilling the object array due to subsequent > allocations and so on. If draining of free objects become rare, shouldn't refilling of the object also become rare ? Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Luck, T. <ton...@in...> - 2002-08-01 21:31:11
|
> Furthermore I think this design does not take into consideration of > multiprocessor issues such as cache-bouncing, cache-warmthness etc., > And also the original implementation of the slab cache in the Linux > kernel did not have per cpu support (I am not sure if the paper takes > into consideration of SMP etc., also). So this assumption needs to be > examined in the light of SMP, NUMA etc., I would like to explore the > possibility of changing this assumption if possible in lieu of SMP/NUMA > cache effects. > > In the present design there is a limit on how many free objects are held > in the per cpu array. So when an object is freed it might end in another > cpu more often. The main cost lies in memory latency than execution of > initializing the fields. I doubt if we get the same gain as explained in > the paper by preserving the fields between uses on an SMP/NUMA machines. Bonwick has a newer paper (http://www.usenix.org/events/usenix01/bonwick.html) that describes how per cpu support can be added. I've forgotten my Usenix password, so I can't get the full text of the paper online at the moment. But, if I recall correctly his magazine layer included support to dynamically adjust the size of the per-cpu lists. The question becomes: Are the performance benefits high enough to justify this extra code complexity? Especially as tuning using /proc/slabinfo is already available to mitigate problems that are bad enough for people to notice. Can you quantify the SMP/NUMA benefits? I took some measurements a while ago that showed that a huge percentage of slab allocations were freed by the same cpu after a very short lifetime. I didn't look into how often the problems that you cite occur. > I agree that preserving read only variables that can be used between uses > will help performance. We still can do that by revising the assumption to > leave the first 4 or whatever bytes needed to store the links. What do you > think? You'd need enough bytes to store your pointer (so "whatever" == 8 on 64-bit architectures). Users that care to arrange the fields of their structures in "used together" order for better cache locality tend to put there efforts into the first elements of a structure. You might get less resistance to change if you use the tail end of the object? But this is a potentially big change. Drivers can create their own slab caches, and if you change the semantics, then you may well break something. -Tony |