Notes:
(This file is in the patch at linux/Documentation/i386/no_hz_tick.txt)
The no HZ tick system is an experimental system designed to test if it
is reasonable to eliminate the HZ tick.
Building a test system:
A. Apply the patch.
B. Read this document.
C. Configure the system how every you like to do it. The following
options are provided:
Under "processor type and features":
Configure tick less system
Configure both ticked and tickless system
Do timepeg stats on tick less system
Under "Kernel Hacking":
Timepeg instrumentation
Interrupt overhead test
Interrupt latency instrumentation
System call timing
Say yes to all except "Interrupt latency instrumentation" and "System
call timing". These are features of the time peg patch that we don't
want to user here.
When you configure both the ticked and tickless system you will be able
to switch from one to the other by:
echo 0 >/proc/sys/kernel/use_no_tick_timers (to get ticks or)
echo 1 >/proc/sys/kernel/use_no_tick_timers (to get no ticks)
The three time peg options turn on instrumentation that will allow you
to see how well the system does with each option.
D. cd ../Documentation/i386/tpt (cd into the kernel tree)
make
This step makes the tpt time peg program that reads the time pegs and
(in our case) pipes them into awk. You can either move "tpt" and
"format.awk" to a convenient directory or execute them from here.
E. Install the new system and boot it.
F. run the tpt/awk instrumentation package:
dir/tpt -s | awk -f dir/format.awk
where "dir" is ../Documentation/i386/tpt, or where every you put tpt and
the format.awk script.
The first run turns on the time peg package. Each additional run will
provide useful (we hope) data.
G. Run a load (or several) of interest on the system and, while it is
running, or before and after, run the instrumentation code as in (F)
above. Do this both for ticked and tick less.
H. Look at the numbers, form an opinion and/or question and send it to
the l-k mailing list (linux-kernel@vger.kernel.org) and/or to me
(george@mvista.com).
Back ground:
The HZ tick is used for the following things in the linux kernel:
Keeping time, in particular the jiffie counter is bumped each tick and
when a second has elapsed, the wall clock function is called.
Expiring timers requested by other kernel and user code.
Calculating system load (actually done every 5 seconds).
Arming the "tq_timer" tasklet list.
In addition there are "accounting" tasks. These are tasks that are
directed at the current executing task, and in SMP x86 systems are, in
fact, triggered by interrupts from the cpu running the task. These task
include:
Time slice count down.
Tracking process times, user and system times for each task.
The profile timer.
The virtual task timer.
The execution limit timer.
The approach:
The first task was to pry the jiffie counter away from the HZ tick.
This was done by making jiffie a function which reads the "tsc" and,
from that, calculates the current jiffie. (Actually, it calculates the
change in jiffies since the last time.) A patch from IBM was used to
rename those items named "jiffie" in the system that were not referring
to the counter, so a "#define" can redefine jiffie to a function.
This patch also provided code to scan the timer lists to find the next
timer for which to schedule an interrupt.
The gettimeofday code was rewritten to use the jiffie and to call
update_time if a second had elapsed.
Several places in the system pick up the time of day to one second
resolution by just loading the wall clock second value. In order to not
turn this into a function also, it was decided to provide a second timer
and to update the wall clock at least every second.
Accounting:
Schedule was changed to start a timer for the time slice each time a
task switch occurs. To cut down on overhead, the other accounting tasks
are run only when a task blocks or the slice timer expires. This means
that a task can exceed its profile, virtual, or execution limit by up to
a time slice.
A tasks "system tsc" time is kept in a couple of new elements of the
task structure. Each interrupt or system call (i.e. system entry) the
current "tsc" is saved in the task structure. On system exit or task
switch, the "tsc" is again read and the elapsed "tsc" count is added to
the tasks "system tsc" time. When the task is switched out, the elapsed
time is divided up into system and user time by converting this "system
tsc" count to jiffies.
Things not done:
The task accounting numbers are not updated when a task asks for them,
thus it gets stale numbers. This is only an issue if the task asks for
its own numbers. If another task asks, the task being examined must not
be running and so its numbers will be right (as long as we don't do SMP).
SMP is not addressed.
Non "tsc" systems (and non x86 systems) are not addressed.
Delay was not changed. Delay is calibrated using the jiffie, which is
now a function, which will introduce inaccuracy in the number. It
should be timed using a timer.
Limitations in the hardware:
The x86 hardware clock is provided by a 16 bit counter that is clocked
at 1.193180 MHZ. This means that the longest time the timer can be set
for is 54.924 ms. I.e. the best we can hope for is to reduce the
interrupt rate by ~5. When longer times are requested the maximum time
is programmed and a "dry" interrupt is taken.
Accuracy:
Care was taken to not only keep the correct time, but to do so
quickly. No divides are used in any of the calculations, save the boot
time set up.
The jiffie calculation keeps 32 bits of "sub jiffie" time and accumulates
these each jiffie request.
Instrumentation:
The timer functions are instrumented by Andrew Morton's time peg code.
A few minor changes were made to this code to allow percentage
calculations and to measure interrupt overhead. In addition an awk
script was written to format and add the percentage calculations. The
part that is outside the kernel is (after applying the patch) in
"../Documentation/i386/tpt/".
In addition, code was added to switch between tick less and ticked using
the proc interface. The switch is:
/proc/sys/kerne/use_no_tick_timers
The system will come up with this set to 1 and "tick less". To go back
to a ticked system just:
echo 0 >/proc/sys/kerne/use_no_tick_timers
(Usually you must be root to do this.)
To grab the time peg data and format it do:
/usr/src/linux/Documentation/i386/tpt/tpt -s |awk -f
/usr/src/linux/Documentation/i386/tpt/format.awk
(note, the above should be one line).
The number columns are: count, min time, max time, average time, total
time. Times are in microseconds.
The time peg labels are, for the most part, the function names. The
exceptions are:
no_hz_stats: interrupt:code is the full timer interrupt time (excluding
the softirq stuff, i.e. run_timer_list).
no_hz_stats: interrupt:overhead is derived by the awk script using the
time peg measured asm("int %x"), which is expanded by awk to the number
of interrupts seen. See ../Documentation/i386/tpt/tpt.txt for more on
this.
no_hz_stats: one_sec_tick is a new function that is timer driven to
update the wall clock each second.
no_hz_stats: setup is the accounting code in schedule that starts a
timer for the time slice.
no_hz_stats: end is the schedule code that deletes the above timer.
no_hz_stats: timer_fn is the schedule function that gets called by the
above timer (and by schedule if it deletes a timer).
no_hz_stats_outlaw: This group contains times of interest that are part
of larger times already accounted for. next_timer is the function that
finds the next timer for which to set up the interrupt.
This work was done by George Anzinger (george@mvista.com).
Changes:
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use