On Sat, 16 Jun 2001, Bill Hartner wrote:
> So (26.6% utilization) + (18.6 % CPU * 8 cpus spinning) = 1.75 CPUs
> spinning or holding this lock. Needless to say, this is substantial
> on an 8-way. Looking at kmap_high(), kunmap_high(), map_new_virtual(),
> and flush_all_zero_pkmaps() - there is room for improvement - the code
> searchs the 1024 slots for a free virtual_addr .... cache effects on
> SMP ....
I've been doing some more work on the kmap_lock.
The first part isn't very interesting - it modifies highmem.c to use a
freelist linkage. Under "normal" usage patterns, I wouldn't expect this
to give much of an improvement.
The second part is more interesting. :)
Currently, I don't have access to an 8-way box (only a 4), but I'm
guessing the kmap lock contention is caused by the flush_tlb_all() which
is done with the kmap_lock held.
Now, flush_tlb_all() sends an IPI to all engines (processors) and
_waits_ for them all to perform the shootdown.
If any of the engines have interrupts blocked, then flush_tlb_all() busy-
waits until the interrupt (IPI) is delieved and processed. This can be a
significant number of CPU cycles - espically as locks which require
interrupts to be blocked spin with ints disabled (image such a lock which
has contention - it can be quite sometime until the last contentor breaks
out of the critical region and enables ints).
Analysing the usage of flush_tlb_all() shows that it does not need to
busy-wait - it can simply send the "shootdown" IPI and continue; it
doesn't even need to busy-wait for an ack from the other engines.
ie. flush_tlb_all() can become asynchronous.
For example, think about the flush_tlb_all() for highmem. New mappings
cannot be created with interrupts disabled (else the orignal
flush_tlb_all() could deadlock the system), nor from within an interupt
handler. Same for "dropping" a mapping, and gaining a reference to an
Infact, an engine's TLB doesn't need to be flushed until its next call
to kmap_high() or until a context-switch occurs on the engine. As the
highmem TLBs are marked "global", we'd need to add an extra test in
schedule() which I'd rather stay away from.
So, instead, we can wait for the flush on an engine to occur when it
enables interupts. This works for the highmem case, and other uses of
flush_tlb_all(). At least, I believe it does - can anyone find an
existing case where it doesn't?
Ingo, you know the revelent code better than anyone else. Is the idea
Does this sound fragile? Yes, it is, but if it improves scalability and
is well documented, then it is worth doing.
I've attached a patch against 2.4.5. The original code was pulled from
a highly modified tree, but I don't think I've made any mistakes...