Menu

FirstPassAlg

Alex Sidorenko

Algorithms For Automated First Pass Dump-analysis

Please free to add your own suggestions or improvements. Each entry should contain:

  • a brief description of what we are trying to do
  • an explanation why this could be useful
  • an algorithm itself

A general idea: we run some 'crash' commands and Python scripts that are already available and produce a nicely formatted and easy to read output. In addition to that, we do some analysis of obtained data and print (if needed) warnings about any unusual/suspicious values. All these warnings should have an easy to grep markup, e.g.

+++WARNING+++ high load average

Some algorithms can be potentially very time-consuming. So it makes sense to implement a fast check as default and thorough check when it is explicitly requested

Checking memory

  • What: check whether is an evident memory shortage, print the output of 'kmem -i', check whether enough swap is free
  • Why: quite often the system appears to be hanging if memory is low or swapping heavily
  • How: run 'kmem -i', print the output. Parse the output and check whether certain thresholds are reached (e.g. 80% of swap)

Checking load averages

  • What: check whether load averages are high
  • Why: high load averages may make the system look like hanging
  • How: 'sys' command, parse the output and analyze the values

Checking the number of uninterruptible processes

  • What: if load averages are high, check both nr_running() and nr_uninterruptible() numbers
  • Why: if we have disk (or NFS) problems, we can have a huge number of 'df' processes which will be in uninterruptible state increasing the load - even though they do not take CPU
  • How: check the output of 'ps' command and 'ps -l' command

Checking the type of the dump and how it was produced

  • What: check whether this was a panic or a hang, what has triggered dump creation
  • Why: I remember a case when customer told us that was a panic and he did not touch the console at all - but I was able to see that keyboard/sysrq has been used after all
  • How: check the panic string as reported by 'sys'. Check the stacks (e.g. 'bt -a') and see whether we see specific patterns.

For example:

 PID: 27598  TASK: e15a2000  CPU: 3   COMMAND: "container"
 #0 [e15a3d8c] sysrq_handle_crash at c01d0c60
 #1 [e15a3d88] __handle_sysrq_nolock at c01d12e8
 #2 [e15a3da8] handle_sysrq at c01d1248
 #3 [e15a3dcc] handle_scancode at c01ced86
 #4 [e15a3df4] handle_kbd_event at c01cff68

shows that that the dump has been created by doing sysrq from keyboard

Checking whether the host was 'alive'

  • What: check what system activity was visible at the moment the dump was created
  • Why: it helps to narrow down the problem
  • How:
    • check how many threads have been scheduled during last 5s/1min/5min
    • check how many threads are in 'RUNNING' state
    • check when was the last time when networking cards have transmitted/received any packets

Checking whether auditd stalls execution

  • What: check that there are no processes stalled due to auditd subsystem
  • Why: some processes look hanging as auditd suspends them
  • How: check the stack traces to see whether we find something like
PID: 3374   TASK: f4828000  CPU: 3   COMMAND: "login"
 #0 [f4829ef0] schedule at c0124230
 #1 [f4829f34] rwsem_down_failed_common at c02a4357
 #2 [f4829f48] rwsem_down_read_failed at c02a4164
 #3 [f4829f6c] .text.lock.control (via auditf_ioctl) at f89fd7e5

Checking for obvious memory corruption

  • What: check for slab cache corruption (kmem -s command)
  • Why: most kernel data structures are allocated from the slab cache and the kmem -s command will report any corruption in the slab structures.
  • How: check the errors reported from the command. May need to make the command resilient to infinite loops (not tested to see if it is).

Checking IRQ balance (a la do_irq_balance)

  • What: check whether IRQ load is spread uniformly between CPUs
  • Why: it is not evident whether this is useful on Linux, but on HPUX I remember many cases when performance issues were due to one CPU processing too many IRQs compared to others (mainly for networking)
  • How: check the contents of per-cpu kstat variables

Checking Memory Fragmentation

  • What: check whether zones 0 (DMA/LOW) and zone 1 (NORMAL) are fragmented
  • Why: some hangs or problems are due to the fact that we have free memory, but small chunks only. This makes it difficult for some allocations to succeed, e.g. fork() needs 8Kb chunk on 2.4 and loopback can use up to 16Kb chunk
  • How: analyze the output of 'kmem -f'

Checking for REAL-TIME Processes

  • What: check whether runqueues on all CPUs have SCHED_RR processes
  • Why: if we have SCHED_RR processes as RUNNABLE on all CPUs, normal processes will seem hanging
  • How: traverse runqueues of all CPUs and check for policy!=0

Are We Looping in an Active Process

  • What: check whether currently running processes use CPU for a long time even though there are other processes waiting to be scheduled
    This is a typical situation when kernel loops in a system call
  • How: get the output of 'bt -a' and check last_ran for these processes

Related

Wiki: Home