Isembard-OS Blog

A tiny ARM kernel with a novel OO threading model

Brought to you by: simonwillcocks

Speed limits

After making my previous blog entry, I realised that 250k system calls per second wasn't really acceptable, so I set out to find any bottlenecks in the process.

My biggest worry was that using the data abort mechanism to enter the kernel was flawed, and that the traditional SWI mechanism was significantly faster.

What I did find was that reading the abort address from coprocessor register 15 is extremely slow, taking the equivalent of over 30 normal instructions. I had assumed (hoped) that it would be almost like reading a processor register or, at worst, like reading a word from memory. It has to be read, however, so I modified the kernel to store the value in memory and retrieve it as necessary.

At the moment, I have a small suite of tests that run either from the UART driver map, or another, client, map. They are designed to run for over a second each (usually significantly more) by repeating the test over a million times and then output a single character to the serial port. A simple program at the other end of the serial port measures and displays the time taken between characters to centisecond accuracy. Each test is run several times, to check for inconsistency.

The results from yesterday are as follows:

loop_count=0x1000000 = 16,777,216

// Test A: Run two instruction loop 10*loop_count times. (5 attempts.)

A 96 => 172960980 loops per second
A 101 => 164482500 loops per second
A 100 => 166111040 loops per second
A 101 => 164482500 loops per second
A 101 => 164482500 loops per second

// Test B: Enter kernel with data abort loop_count times. (5 attempts.)
// Test code in the kernel returns as soon as possible after finding abort
// address is 0.
// Checks abort is from system or user mode, reads and stores Data Fault Address
// Register, then compares address with zero and returns if equal.
// Also checks r12 for magic number (see test b, below).

B 944 => 1775366 loops per second
B 944 => 1775366 loops per second
B 944 => 1775366 loops per second
B 944 => 1775366 loops per second
B 944 => 1775366 loops per second

// Test B1: Enter kernel with data abort loop_count times. (5 attempts.)
// Test code in the kernel returns as soon as possible, without checking the abort address.
// Compares r12 against an unusual number, 0x002a0000, returns if equal, before other checks.

B 682 => 2456400 loops per second
B 682 => 2456400 loops per second
B 682 => 2456400 loops per second
B 682 => 2456400 loops per second
B 681 => 2460002 loops per second

// Test B2: Enter kernel with data abort loop_count times. (5 attempts.)
// Test code in the kernel returns as soon as possible, without checking the abort address.
// Compares r12 against an unusual number, 0x00002a00, returns if equal, after SPSR check.

B 737 => 2273335 loops per second
B 736 => 2276420 loops per second
B 736 => 2276420 loops per second
B 737 => 2273335 loops per second
B 736 => 2276420 loops per second

// Test C: Set SWI handler with immediate return routine, perform loop_count swis.
// Note: Including a LDR r13, [r14, #-4], to read the SWI instruction has no noticable
// effect.

C 910 => 1841626 loops per second
C 910 => 1841626 loops per second
C 910 => 1841626 loops per second
C 910 => 1841626 loops per second
C 911 => 1839607 loops per second

// Test D: Call isembard_object_is_a_string on a remote object (that isn't a string)
// loop_count times.

D 4016 => 417655 loops per second
D 4016 => 417655 loops per second
D 4016 => 417655 loops per second
D 4016 => 417655 loops per second

From these results, I conclude that:

There is an upper limit on system calls using either SWI or data abort to enter the kernel code of under 2.5 million calls per second, see Test B1 and Test C (running at 300MIPS, according to the control loop in Test A)
The practical upper limit is less than 1.8 million calls, based on Test C and Test B

While I was writing this, I got the inter-map build to work (it turned out to be a bug in the build script that meant that drivers larger than the uart driver wouldn't be fully copied into the image).

Here are the results:

Inter-map version (third results only, since variation within a test is at most 1cs):

A 67 => 246723760 loops per second
B 736 => 2276420 loops per second
B 465 => 3600260 loops per second
B 519 => 3226387 loops per second
C 911 => 1839607 loops per second
D 6426 => 261042 loops per second

That is extremely odd. Repeatable, too.

Rebuilding the intra-map version results in the same (slower) timings as
earlier; the re-build generates exactly the same kernel, just the drivers'
code changes:

A 101 => 164482500 loops per second
B 944 => 1775366 loops per second
B 682 => 2456400 loops per second
B 736 => 2276420 loops per second
C 910 => 1841626 loops per second
D 4016 => 417655 loops per second

Even the control loop speed has changed! Presumably, this has to do with buffer/cache flags in the L2TT (or, somehow, a TLB?) being wrong.

Posted by 2013-04-11 Labels: timings