|
From: MoJiong Q. <qm...@ho...> - 2008-10-01 01:14:44
|
Dear Valgrind community,
A couple of us at VMware have been playing with valgrind code for a while. It is a lot of fun! :-)
As I mentioned last time in my email, we intend to add record and replay support to valgrind. There are several benefits we can see:
* Catch hard-to-reproduce buggy executions with valgrind recording
* Improving valgrind usability by offsetting tools overhead to a deterministic replay phase
- Some tools, like memcheck, has client program side-effect when turned on. We do not support those yet, but we do have some preliminary ideas.
- Other tools, like cachegrind, has no client side-effects. Therefore, they can be turned on during replay time.
* _In the future_, work with other record and replay systems, like VMware VM record and replay (VM-RR)
- VM-RR may help further reduce valgrind record overhead
- VM-RR may produce a valgrind-complaint replay log to enable valgrind to debug a process that was recorded inside a VM
I am going to outline what have we done in the following sections. The code patch is also available at http://www.xhfamily.com/x/vg/valgrind-rr.tgz. We are looking for early feedbacks on improving this code so that eventually it can become part of the valgrind code base. Currently, the code is incomplete. I just want to see if people think I am going the right direction or not.
(Disclaimer: what I said in this email are from my own opinions. VMware does not officially endorse nor support the code of the project.)
Ideas
=====
The basic idea here is to record all sources of non-deterministic information for a client program. The ones we have dealt with are:
1. system calls (some of them, so far)
2. thread scheduling
3. non-deterministic instructions
We have not dealt with signals and tools-induced non-determinism.
Specifically, our valgrind record/replay works on x86-linux. (amd64-linux code is there but not tested). I have
1. wrapped a few syscalls sufficient to run splash2, and socket related syscalls.
splash2 and simple socket programs can be recorded.
2. introduced a record/replay module and its APIs
3. recorded nondeterminism of RDTSC instructions. Other non-deterministic instructions can be supported similarly. But I haven't done that.
4. added thread scheduling support.
5. added a replay divergence detection tool
Overview of the patch
==================
The patch is avaiable at http://www.xhfamily.com/x/vg/valgrind-rr.tgz . After unzipped the tarball, you may find three files: recordreplay.patch can be directly patched to original valgrind-3.3.1; and the command "diff_view.py 20080926_11538.tgz" is a convenient way to view the changes of the patch.
Most of the code is in four locations:
1) coregrind/pub_core_recordreplay.h: the record/replay APIs and global variables exposed to other modules
2) coregrind/m_recordreplay: a new module deals with record/replay stuff
- recordreplay.c implementation of record/replay APIs
- record.c some record-only functions
- replay.c replay-only functions that read replay log
- priv_recordreplay.c module-private header file
- instrument.c an instrumentation callback that deals with non-deterministic instructions
3) coregrind/m_syswrap
- record/replay wrappers of the syscalls are in new files syswrapRR-xxx.c
- syswrap_main.c is modified to log syscall results in record execution, and feedback syscall results in replay execution
4) rrcheck: replay divergence checking tool
How to run it
=============
First, download http://www.xhfamily.com/x/vg/valgrind-rr.tgz , which contains the code patch "recordreplay.patch".
Then, compile valgrind record/replay
- tar xvjf valgrind-3.3.1.tar.bz2 (http://www.valgrind.org/downloads/valgrind-3.3.1.tar.bz2)
- cd valgrind-3.3.1; and patch the recordreplay.patch into valgrind-3.3.1
- aclocal; autoconf; automake; ./configure; make; make install
For splash2 benchmark
- download splash2 benchmark from http://www-flash.stanford.edu/apps/SPLASH/splash2.tar.gz or http://www.grid-appliance.org/files/archer/archive/benchmarks/splash2_072508.tar.gz
- set up multi-threaded version of splash2 according to instructions at http://www.capsl.udel.edu/splash/Download.html
Record and replay splash2 programs as follows. (take fft as an example)
To record:
valgrind --record-replay=1 --log-file-rr=fft.replaylog.none ./FFT -p4 -m20
To replay fft with cachegrind analysis turned on:
valgrind --record-replay=2 --log-file-rr=fft.replaylog.none --tool=cachegrind
or turn on helgrind:
valgrind --record-replay=2 --log-file-rr=fft.replaylog.none --tool=helgrind
There is a limitation for Memcheck for now. We need to specifiy --tool=memcheck to record/replay memcheck:
valgrind --record-replay=1 --log-file-rr=fft.replaylog.memcheck --tool=memcheck ./FFT -p4 -m20
valgrind --record-replay=2 --log-file-rr=fft.replaylog.memcheck --tool=memcheck ./FFT -p4 -m20
Newly added command line options for record/replay:
--record-replay=1|2 to specify in which mode this execution is. 1 means in record, 2 in replay.
--log-file-rr= where to save replay log (in record) or to read replay log (in replay). "_temp_rr_.log" defaulted.
Supported programs:
multi-threaded version of splash2, compress
(socket related syscalls have been wrapped and socket programs should be supported too)
Not supported programs:
programs that will receive signals; programs that uses IPC shared memory communication, and that mmap-s memory with MAP_SHARED flag.
Others may be correctly replayed or crashed, depending on the syscalls called
Some results
============
Here fft and lu are run with 4 threads, while compress is a single-threaded program which operates on an 144M eclipse tarball.
Size of replay log: fft--100K, LU--700K, compress--150M.
Tested in a VMware workstation virtual machine, FC8, 512M memory, 1 cpu 2.2G.
--------------------------------------------------------------------------------------------------------------------------------------------------
|| workload || no Vg ||Vg-no-RR, none||Vg record, none||Vg replay, none||Vg-no-RR, cachegrind|| Vg replay,cachegrind ||
|| lu -p4 -n768 || 1.52s || 6.87s (4.5x) || 12.49s (8.2x) || 12.74s (8.4x) || 84.97s (55.9x)|| 156.46s (103.0x) ||
|| lu -p4 -n1024|| 3.38s || 15.25s (4.5x) || 26.64s (7.9x) || 26.88s (8.0x) ||201.94(59.7x) || 351.57s (104.0x) ||
|| fft -p4 -m20 || 1.08s || 3.22s (3.0x) || 4.92s (4.6x) || 4.14s (3.8x) || 44.65s(41.3x) || 60.92s (56.4x) ||
|| compress ||17.96s || 23.55s (1.3x) || 193.53s (10.8x)||14.78s (0.8x) ||188.10(10.5x) || 180.59(10.1x) ||
--------------------------------------------------------------------------------------------------------------------------------------------------
The results indicate that vg-RR is useful in reducing cachegrind's overhead during recording. This may enable cachegrind on more user-interactive programs where response time is important.
The compress results shows that a lot of the recording overhead is from recording the input of a client program. If we leverage something like VMware RR, we can even reduce that overhead dramatically by only actually write to the log file during VM-replay-time.
Thanks,
Mojiong
_________________________________________________________________
Explore the seven wonders of the world
http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=QBRE
|