Some early timings

Now I've got two programs running in separate maps, one a simple UART (serial port) driver that using busy-waiting and the other a program that uses the service of the driver, I thought I'd try to work out how long it takes to make a simple inter-map call. Since there are no timers running yet, the simplest way of timing the calls is to do lots of them in a loop and output a character after every million or so, and time the frequency of the output characters.

As a control, I also perform the following loop ten times, writing a character to the serial port each time:

    MOV r0, #0x100000
0:  SUBS r0, r0, #1
    BNE 0b

The number of times around the inner loop is 1048576, chosen because it takes a measurable amount of time to complete (about 220ms, in initial tests). The control loop is included in case I find better combinations of cache flags which affect the execution time.

The inter-map call I make is to ask the service object I got from the UART driver if it is a string or not (it isn't). I chose this call because, apart from the decoding of the feature code to get to the call, the routine returns after just two instructions.

    MOV r0, #0
    MOV pc, #-16 ; Non-exceptional return from inter-map call

Thus, the bulk of the time taken in the loop will be used in the kernel.

The first results resulted in approximately 117000 calls per second, with the control loop showing approximately 4560000 loops per second, which is only 9 MIPS. Clearly, that's not right! Update See below.

The approach taken in this kernel was to clear the non-kernel entries in the translation table on map changes, so that a hardware translation table walk would only encounter invalid entries. This seemed perhaps to be too much work on map changes, so I modified the kernel to change the domain permissions, so that I had one domain for the kernel and another for user code. The idea was to disable the user code domain on map changes, to force an abort on hardware table lookups (at which point I would clear the table, populate it for the requested address, and enable the user domain again). Unfortunately, I had missed the fact that the TLB matches also check the domain permissions:

ARM-ARM, Page B4-4:

If a matching TLB entry is found then the information it contains is used as follows:
1. The access permission bits and the domain are used to determine whether access is permitted. If the access is not permitted the MMU signals a memory abort. Otherwise the access is allowed to proceed.

As a result of this, the inter-map call rate actually halved, to about 52000 calls per second. I can't see an easy way around this, so I will be shelving the domain permissions approach for the foreseeable future.

The 9 MIPS thing, though, needs some looking into!

Update: Enabling the data cache (it was disabled on entry to Isembard-OS) increases the call rate to 183000 per second, with no change to the control loop time. However, the real problem was that the RAM for the drivers was not cached because it was in SRAM, rather than SDRAM. A quick hack (to the quick hack that caches all SDRAM, but no other physical addresses) to fix that takes the call rate to 250000 calls per second, and speeds up the control loop approximately 30-fold. I think that will probably do, for now.

Posted by Simon Willcocks 2013-04-03