Python/CRASH API Wiki

Brought to you by: alexsid, mooremar

FirstPassAlg

Algorithms For Automated First Pass Dump-analysis

Algorithms For Automated First Pass Dump-analysis

Please free to add your own suggestions or improvements. Each entry should contain:

a brief description of what we are trying to do
an explanation why this could be useful
an algorithm itself

A general idea: we run some 'crash' commands and Python scripts that are already available and produce a nicely formatted and easy to read output. In addition to that, we do some analysis of obtained data and print (if needed) warnings about any unusual/suspicious values. All these warnings should have an easy to grep markup, e.g.

+++WARNING+++ high load average

Some algorithms can be potentially very time-consuming. So it makes sense to implement a fast check as default and thorough check when it is explicitly requested

Checking memory

What: check whether is an evident memory shortage, print the output of 'kmem -i', check whether enough swap is free
Why: quite often the system appears to be hanging if memory is low or swapping heavily
How: run 'kmem -i', print the output. Parse the output and check whether certain thresholds are reached (e.g. 80% of swap)

Checking load averages

What: check whether load averages are high
Why: high load averages may make the system look like hanging
How: 'sys' command, parse the output and analyze the values

Checking the number of uninterruptible processes

What: if load averages are high, check both nr_running() and nr_uninterruptible() numbers
Why: if we have disk (or NFS) problems, we can have a huge number of 'df' processes which will be in uninterruptible state increasing the load - even though they do not take CPU
How: check the output of 'ps' command and 'ps -l' command

Checking the type of the dump and how it was produced

What: check whether this was a panic or a hang, what has triggered dump creation
Why: I remember a case when customer told us that was a panic and he did not touch the console at all - but I was able to see that keyboard/sysrq has been used after all
How: check the panic string as reported by 'sys'. Check the stacks (e.g. 'bt -a') and see whether we see specific patterns.

For example:

 PID: 27598  TASK: e15a2000  CPU: 3   COMMAND: "container"
 #0 [e15a3d8c] sysrq_handle_crash at c01d0c60
 #1 [e15a3d88] __handle_sysrq_nolock at c01d12e8
 #2 [e15a3da8] handle_sysrq at c01d1248
 #3 [e15a3dcc] handle_scancode at c01ced86
 #4 [e15a3df4] handle_kbd_event at c01cff68

shows that that the dump has been created by doing sysrq from keyboard

Checking whether the host was 'alive'

What: check what system activity was visible at the moment the dump was created
Why: it helps to narrow down the problem
How:
- check how many threads have been scheduled during last 5s/1min/5min
- check how many threads are in 'RUNNING' state
- check when was the last time when networking cards have transmitted/received any packets

Checking whether auditd stalls execution

What: check that there are no processes stalled due to auditd subsystem
Why: some processes look hanging as auditd suspends them
How: check the stack traces to see whether we find something like

PID: 3374   TASK: f4828000  CPU: 3   COMMAND: "login"
 #0 [f4829ef0] schedule at c0124230
 #1 [f4829f34] rwsem_down_failed_common at c02a4357
 #2 [f4829f48] rwsem_down_read_failed at c02a4164
 #3 [f4829f6c] .text.lock.control (via auditf_ioctl) at f89fd7e5

Checking for obvious memory corruption

What: check for slab cache corruption (kmem -s command)
Why: most kernel data structures are allocated from the slab cache and the kmem -s command will report any corruption in the slab structures.
How: check the errors reported from the command. May need to make the command resilient to infinite loops (not tested to see if it is).

Checking IRQ balance (a la do_irq_balance)

What: check whether IRQ load is spread uniformly between CPUs
Why: it is not evident whether this is useful on Linux, but on HPUX I remember many cases when performance issues were due to one CPU processing too many IRQs compared to others (mainly for networking)
How: check the contents of per-cpu kstat variables

Checking Memory Fragmentation

What: check whether zones 0 (DMA/LOW) and zone 1 (NORMAL) are fragmented
Why: some hangs or problems are due to the fact that we have free memory, but small chunks only. This makes it difficult for some allocations to succeed, e.g. fork() needs 8Kb chunk on 2.4 and loopback can use up to 16Kb chunk
How: analyze the output of 'kmem -f'

Checking for REAL-TIME Processes

What: check whether runqueues on all CPUs have SCHED_RR processes
Why: if we have SCHED_RR processes as RUNNABLE on all CPUs, normal processes will seem hanging
How: traverse runqueues of all CPUs and check for policy!=0

Are We Looping in an Active Process

What: check whether currently running processes use CPU for a long time even though there are other processes waiting to be scheduled
This is a typical situation when kernel loops in a system call
How: get the output of 'bt -a' and check last_ran for these processes

Wiki: Home