Thread: Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Mala A. <ma...@us...> - 2002-07-27 02:30:09

I found a problem with per cpu slab allocator implementation in Linux
kernel. The per cpu array of objects get emptied unnecessarily when
there is no room to put back the freed object.  This might lead to a
scenario where the object array is emptied to find that the subsequent
alloc requests end up in re-populating the array with the objects.
These wasted cycles could be saved by simply changing the array
to a singly linked list of free objects. To see how much this is
really happening I looked at the slab stats collected using SPECweb99
workload.

The following slabinfo stats are edited for clarity sake. The allocmiss
and freemiss are the counters that indicate how many times we are
re-populating the object array with objects (allocmiss) and how many
times the object array was emptied to make room to add freed objects
(freemiss).

slabinfo - version: 1.1 (statistics) (SMP)
                  allochit    allocmiss freehit   freemiss
tcp_tw_bucket        7297373  60783    7299082     56577
tcp_open_request    13236826   1427   13236852      1369
file lock cache     13020821   6467   13020878      6336
skbuff_head_cache     770394  38817     401689      3201
sock                13231542   6584   13231816      5948
buffer_head          5886789 119467    3793394     10946
size-4096          333688059 3327893 333690264   3322182
size-2048           91797861  451537  91798246    450602
size-256           355278409  773333 355281803    766049
size-32             32253719    306   32246987       150


The slab stats counter above shows that allocmiss and freemiss
happen less than 1% of the time, which is not significant to consider
changing the code. However it is not only the number of times this happen,
in relation to allochit and freehit, is important but the amount of
processing being done when this happens is also important.

Next I looked at the Readprofile taken using the same specweb workload:

31653 total                                       0.0209
  1374 e1000_xmit_frame                           0.8218
  1266 __kfree_skb                                5.4569
  1261 ProcessTransmitInterrupts                  2.7898
  1202 csum_partial_copy_generic                  4.8468
  1158 skb_release_data                           9.9828
  1114 __wake_up                                  9.6034
  1024 ProcessReceiveInterrupts                   1.3913
   795 tcp_clean_rtx_queue                        1.0461
   754 net_rx_action                              1.2240
   696 kfree                                      4.8333 ***
   369 kmalloc                                    1.0853
   247 kfree_skbmem                               2.3750
   181 __free_pages                               5.6562
    63 kmem_cache_alloc                           0.2351 ***
    57 __free_pages_ok                            0.1122
    55 kmem_cache_free                            0.4435 ***

kfree is one of the top 10 hot routines and this is where freemiss
processing happens. kmalloc and kmem_cache_alloc include allocmiss
processing. I am working on fixing this. Comments and suggestions
are welcome.

Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:ma...@us...
   Phone:512-838-8088;

RE: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Luck, T. <ton...@in...> - 2002-07-29 16:48:10

Mala,

You don't specify any details of how the "singly linked list of
free objects" would be implemented.  You cannot use any of the
memory in the freed object (as the constructor for a slab is only
called when memory is first allocated, not when an object is recycled)
so using any part of the object might confuse a caller by giving them
a corrupted object.

Are you going to have some external structure to maintain the linked
list?  Or secretly enlarge the object to provide space for the link,
or some other way?

-Tony

-----Original Message-----
From: Mala Anand [mailto:ma...@us...]
Sent: Friday, July 26, 2002 7:30 PM
To: lin...@vg...; lse
Cc: Bill Hartner
Subject: Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss



I found a problem with per cpu slab allocator implementation in Linux
kernel. The per cpu array of objects get emptied unnecessarily when
there is no room to put back the freed object.  This might lead to a
scenario where the object array is emptied to find that the subsequent
alloc requests end up in re-populating the array with the objects.
These wasted cycles could be saved by simply changing the array
to a singly linked list of free objects. To see how much this is
really happening I looked at the slab stats collected using SPECweb99
workload.

The following slabinfo stats are edited for clarity sake. The allocmiss
and freemiss are the counters that indicate how many times we are
re-populating the object array with objects (allocmiss) and how many
times the object array was emptied to make room to add freed objects
(freemiss).

slabinfo - version: 1.1 (statistics) (SMP)
                  allochit    allocmiss freehit   freemiss
tcp_tw_bucket        7297373  60783    7299082     56577
tcp_open_request    13236826   1427   13236852      1369
file lock cache     13020821   6467   13020878      6336
skbuff_head_cache     770394  38817     401689      3201
sock                13231542   6584   13231816      5948
buffer_head          5886789 119467    3793394     10946
size-4096          333688059 3327893 333690264   3322182
size-2048           91797861  451537  91798246    450602
size-256           355278409  773333 355281803    766049
size-32             32253719    306   32246987       150


The slab stats counter above shows that allocmiss and freemiss
happen less than 1% of the time, which is not significant to consider
changing the code. However it is not only the number of times this happen,
in relation to allochit and freehit, is important but the amount of
processing being done when this happens is also important.

Next I looked at the Readprofile taken using the same specweb workload:

31653 total                                       0.0209
  1374 e1000_xmit_frame                           0.8218
  1266 __kfree_skb                                5.4569
  1261 ProcessTransmitInterrupts                  2.7898
  1202 csum_partial_copy_generic                  4.8468
  1158 skb_release_data                           9.9828
  1114 __wake_up                                  9.6034
  1024 ProcessReceiveInterrupts                   1.3913
   795 tcp_clean_rtx_queue                        1.0461
   754 net_rx_action                              1.2240
   696 kfree                                      4.8333 ***
   369 kmalloc                                    1.0853
   247 kfree_skbmem                               2.3750
   181 __free_pages                               5.6562
    63 kmem_cache_alloc                           0.2351 ***
    57 __free_pages_ok                            0.1122
    55 kmem_cache_free                            0.4435 ***

kfree is one of the top 10 hot routines and this is where freemiss
processing happens. kmalloc and kmem_cache_alloc include allocmiss
processing. I am working on fixing this. Comments and suggestions
are welcome.

Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:ma...@us...
   Phone:512-838-8088;








-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Lse-tech mailing list
Lse...@li...
https://lists.sourceforge.net/lists/listinfo/lse-tech

RE: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Mala A. <ma...@us...> - 2002-07-30 12:36:07

>Tony Luck writes ..
>You don't specify any details of how the "singly linked list of
>free objects" would be implemented.  You cannot use any of the
>memory in the freed object (as the constructor for a slab is only
>called when memory is first allocated, not when an object is recycled)
>so using any part of the object might confuse a caller by giving them
>a corrupted object.
I am creating a link list of free objects per cpu. When objects are
deallocated by the caller they get added to its cpu free object link
list.  The freed objects do not migrate to other caches, they are put
back to the present cpu's link list. so they don't have to be
re-initialized.  I am planning on putting a (configurable) limit on
the number of free objects that can stay in a free list.

>Are you going to have some external structure to maintain the linked
>list?  Or secretly enlarge the object to provide space for the link,
>or some other way?
No I am using the object(beginning space) to store the links. When
allocated, I can initialize the space occupied by the link address.


Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:ma...@us...
   Phone:838-8088; Tie-line:678-8088

RE: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Luck, T. <ton...@in...> - 2002-07-30 16:20:24

Mala Anand wrote:
> >Are you going to have some external structure to maintain the linked
> >list?  Or secretly enlarge the object to provide space for the link,
> >or some other way?
> No I am using the object(beginning space) to store the links. When
> allocated, I can initialize the space occupied by the link address.

You can't use the start of the object (or any other part) in this way,
you'll have no way to restore the value you overwrote.

Take a look at Jeff Bonwick's paper on slab allocators which explains
this a lot better than I can:

http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon
wick.a

-Tony

RE: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Mala A. <ma...@us...> - 2002-08-01 12:42:26

Tony Luck wrote..
>> No I am using the object(beginning space) to store the links. When
>> allocated, I can initialize the space occupied by the link address.

>You can't use the start of the object (or any other part) in this way,
>you'll have no way to restore the value you overwrote.

>Take a look at Jeff Bonwick's paper on slab allocators which explains
>this a lot better than I can:

>
http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon

>wick.a

 I read the document and see that I cannot use the object space even
when the object is free.  However I would like to know how much of this
assumption is used in the Linux Kernel code.  Looking through the tcpip
stack code I realized that most of the caches(tcp_opne_request,
tcp_bind_bucket, tcp_tw_bucket, inet_peer_cache etc.,) do not have
constructors and destructors.Some of the caches have them ofcourse.
Skbs have constructor, however the code executes constructor before
freeing the object and initializes the rest of the fields during
allocation.  In this case, this feature of preserving the object
between uses do not eliminate any of the object initialization code.
Having said that I would like to know if in any part of the Linux code
is taking advantage of this assumption. That is, the object is preserved
by the slab allocator between uses so the user does not have to
reinitialize.

Furthermore I think this design does not take into consideration of
multiprocessor issues such as cache-bouncing, cache-warmthness etc.,
And also the original implementation of the slab cache in the Linux
kernel did not have per cpu support (I am not sure if the paper takes
into consideration of SMP etc., also). So this assumption needs to be
examined in the light of SMP, NUMA etc., I would like to explore the
possibility of changing this assumption if possible in lieu of SMP/NUMA
cache effects.

In the present design there is a limit on how many free objects are held
in the per cpu array. So when an object is freed it might end in another
cpu more often.  The main cost lies in memory latency than execution of
initializing the fields.  I doubt if we get the same gain as explained in
the paper by preserving the fields between uses on an SMP/NUMA machines.

I agree that preserving read only variables that can be used between uses
will help performance. We still can do that by revising the assumption to
leave the first 4 or whatever bytes needed to store the links. What do you
think?



Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:ma...@us...
   Phone:838-8088; Tie-line:678-8088

Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Dipankar S. <dip...@in...> - 2002-08-01 13:18:31

On Thu, Aug 01, 2002 at 07:42:10AM -0500, Mala Anand wrote:
> 
> 
> Tony Luck wrote..
> >> No I am using the object(beginning space) to store the links. When
> >> allocated, I can initialize the space occupied by the link address.
> 
> >You can't use the start of the object (or any other part) in this way,
> >you'll have no way to restore the value you overwrote.
> 
> >Take a look at Jeff Bonwick's paper on slab allocators which explains
> >this a lot better than I can:
> 
> >
> http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bon
> 
> >wick.a
> 
> In the present design there is a limit on how many free objects are held
> in the per cpu array. So when an object is freed it might end in another
> cpu more often.  The main cost lies in memory latency than execution of
> initializing the fields.  I doubt if we get the same gain as explained in
> the paper by preserving the fields between uses on an SMP/NUMA machines.
> 
> I agree that preserving read only variables that can be used between uses
> will help performance. We still can do that by revising the assumption to
> leave the first 4 or whatever bytes needed to store the links. What do you
> think?

Mala,

Isn't it possible to tune the cpucache limit by writing to
/proc/slabinfo so that you avoid frequent draining of free objects ?
Am I missing something here ?

Thanks
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Mala A. <ma...@us...> - 2002-08-01 13:31:47

Dipankar wrote..
>Isn't it possible to tune the cpucache limit by writing to
>/proc/slabinfo so that you avoid frequent draining of free objects ?
>Am I missing something here ?

Are you referring to raising the per cpu array limit? I don't think you
tune that using /proc/slabinfo.  However that does not solve the problem,
it only delays it.  It needs to grow/shrink dynamically based on need. I
am not only referring to frequently draining of free objects but also
as a result of this refilling the object array due to subsequent
allocations and so on.

Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:ma...@us...
   Phone:838-8088; Tie-line:678-8088

Re: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Dipankar S. <dip...@in...> - 2002-08-01 13:40:33

On Thu, Aug 01, 2002 at 08:31:45AM -0500, Mala Anand wrote:
> 
> Dipankar wrote..
> >Isn't it possible to tune the cpucache limit by writing to
> >/proc/slabinfo so that you avoid frequent draining of free objects ?
> >Am I missing something here ?
> 
> Are you referring to raising the per cpu array limit? I don't think you
> tune that using /proc/slabinfo.  However that does not solve the problem,

Hmm... then what does slabinfo_write()->kmem_tune_cpucache() do ?

> it only delays it.  It needs to grow/shrink dynamically based on need. I
> am not only referring to frequently draining of free objects but also
> as a result of this refilling the object array due to subsequent
> allocations and so on.

If draining of free objects become rare, shouldn't refilling of the
object also become rare ?

Thanks
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

RE: [Lse-tech] [RFC] per cpu slab fix to reduce freemiss

From: Luck, T. <ton...@in...> - 2002-08-01 21:31:11

> Furthermore I think this design does not take into consideration of
> multiprocessor issues such as cache-bouncing, cache-warmthness etc.,
> And also the original implementation of the slab cache in the Linux
> kernel did not have per cpu support (I am not sure if the paper takes
> into consideration of SMP etc., also). So this assumption needs to be
> examined in the light of SMP, NUMA etc., I would like to explore the
> possibility of changing this assumption if possible in lieu of SMP/NUMA
> cache effects.
>
> In the present design there is a limit on how many free objects are held
> in the per cpu array. So when an object is freed it might end in another
> cpu more often.  The main cost lies in memory latency than execution of
> initializing the fields.  I doubt if we get the same gain as explained in
> the paper by preserving the fields between uses on an SMP/NUMA machines.

Bonwick has a newer paper
(http://www.usenix.org/events/usenix01/bonwick.html)
that describes how per cpu support can be added.  I've forgotten my Usenix
password, so I can't get the full text of the paper online at the moment.
But, if I recall correctly his magazine layer included support to
dynamically
adjust the size of the per-cpu lists.

The question becomes: Are the performance benefits high enough to justify
this extra code complexity?  Especially as tuning using /proc/slabinfo is
already available to mitigate problems that are bad enough for people to
notice.

Can you quantify the SMP/NUMA benefits?  I took some measurements a while
ago that showed that a huge percentage of slab allocations were freed by the
same cpu after a very short lifetime.  I didn't look into how often the
problems that you cite occur.

> I agree that preserving read only variables that can be used between uses
> will help performance. We still can do that by revising the assumption to
> leave the first 4 or whatever bytes needed to store the links. What do you
> think?

You'd need enough bytes to store your pointer (so "whatever" == 8 on 64-bit
architectures).  Users that care to arrange the fields of their structures
in "used together" order for better cache locality tend to put there efforts
into the first elements of a structure.  You might get less resistance to
change
if you use the tail end of the object?  But this is a potentially big
change.
Drivers can create their own slab caches, and if you change the semantics,
then
you may well break something.

-Tony