We've recently started to have evidence that VMemPool::allocate is taking a long time in production servers. To investigate further I started adding some simple data recording to VMemPool in this branch:
/vesta/vestasys.org/vesta/repos/86.VMemPool_stats
Having looked at some data from that, it seems clear that the longer the repository runs without weeding the longer the VMemPool free list becomes, the more free blocks get considered and rejected on each allocation before finding one that can be used, and the longer each allocation takes. Since the VMemPool free list handling is currently simplistic, it shouldn't be too hard to improve the situation.
One thing we can do is partition the free list by block size. This will make it easier to avoid spending time considering free blocks which are too small to satisfy the current allocation request. I'm not sure yet exactly how we should partition the free list, and I would like to gather additional data before making a decision. A common method in other allocators is to use powers of 2 (4 bytes, 8 bytes, 16 bytes, etc.), but that might be too coarse-grained for VMemPool which is designed to use tightly packed data structures. Several of the VMemPool allocations are for fixed block sizes and we may want to keep separate free lists for blocks of those sizes.
Another thing we could do is improve the ability to coalesce free space to reduce fragmentation in the VMemPool heap. The only coalescing that happens currently is that a freed block will absorb following free bytes. It would be easy to combine a freed block with another free block that immediately follows it, but to combine with free bytes or free blocks which are before the freed block we will need additional data structures. I think probably the best way to do this would be to keep a table that maps from the end of each free region to the beginning of each free region. It would be efficient both to keep such a table up to date and to use it for coalescing when a block is freed.
Logged In: YES
user_id=304837
Originator: YES
Dave Andrews from Intel had an interesting suggestion for
how to implement better free space coalescing without adding
an additional data structure: have a background thread that
periodically wakes up and looks for opportunities to
coalesce blocks on the free list. This would make it
possible to simply coalesce with free space following blocks
on the free list. Eventually such a thread would find any
free block that precedes another.
This is worth considering as it could be a bit easier to
implement. The difficult part seems to be balancing having
such a thread do a useful amount of work vs. having it
starve out threads doing real work by holding the VMemPool
lock. It would have to use the existing freedMemCursor as
the point to start working each time it begins. If it did
too little work before giving up, some parts of the free
list might never get checked.
I should also mention that I have some data now on
allocation block size popularity. In a recent test-vesta
run the top 10 block sizes with their associated counts
were:
16 : 412269
512 : 51693
32 : 8342
40 : 7205
24 : 3739
56 : 2665
336 : 2589
88 : 1447
120 : 1260
376 : 1241
16 is now the minimum allocation size due to the increase in
the minimum free block size. It's probably the most popular
because of VDirEvaluator, but there are undoubtedly other
kinds of blocks that fall into that size as well.
512 is the size of block allocated by VDirChangeable when a
new block for more directory entries is needed, so it's not
surprising that it's in second place.
Before I make a decision about free list partitioning I'm
planning on gathering similar data from a production
repository.
Logged In: YES
user_id=304837
Originator: YES
It occurred to me today that there's an even simpler
solution for free block coalescing than having a background
thread do it: When considering a block from the free list
for an allocation request, attempt to coalesce it with any
following free space. This will get the job done just in
time to satisfy an allocation request and minimizes the
amount of extra work we expend on coalescing. This would
not be quite as good as coalescing with preceding free space
when each block is freed, but it could get pretty close.
Logged In: YES
user_id=304837
Originator: YES
The free list handling is much improved with the changes I
just checked in as:
/vesta/vestasys.org/vesta/repos/89
However, data from production repositories at Intel still
shows a noticeable number of blocks from the free list
considered and rejected per allocation. This suggests that
we can still squeeze out more performance with a dynamic
free list partitioning rather than the static free list
partitioning we're using now.
For this reason I'm going to leave this tracker entry open
until we can get that implemented.