|
From: Nikolaus R. <Nik...@ra...> - 2011-01-10 02:15:17
|
Hello, I am trying to use valgrind to debug a FUSE program, i.e. a program that implements a file system in user space. The problem is that the program deadlocks when running under Valgrind. More precisely, it deadlocks in a stat64() syscall of the mountpoint. My guess is that Valgrind is not expecting that a syscall may call back into a different thread of the program that valgrind runs, and that some valgrind internal lock prevents that the callback is handled before the syscall itself has finished - deadlock. Is that correct? Is there a way to work around the problem? Best, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Tom H. <to...@co...> - 2011-01-10 07:12:56
|
On 10/01/11 01:13, Nikolaus Rath wrote: > The problem is that the program deadlocks when running under Valgrind. > More precisely, it deadlocks in a stat64() syscall of the mountpoint. > > My guess is that Valgrind is not expecting that a syscall may call back > into a different thread of the program that valgrind runs, and that some > valgrind internal lock prevents that the callback is handled before the > syscall itself has finished - deadlock. My guess would be that stat64 is not marked as a system call that can block and that it needs the MayBlock flag adding. Tom -- Tom Hughes (to...@co...) http://compton.nu/ |
|
From: Julian S. <js...@ac...> - 2011-01-10 17:08:07
|
> My guess would be that stat64 is not marked as a system call that can > block and that it needs the MayBlock flag adding. Yes, that could be it. I guess, normally, a stat64 (etc) call is completed eventually by the kernel, without the need for any other thread to run. But in this case a stat64 call might somehow require some other thread to progress. Nikolaus, try adding *flags |= SfMayBlock; to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is 32-bit x86). Does that help? J |
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-10 17:20:26
|
On 01/10/2011 12:08 PM, Julian Seward wrote: > >> My guess would be that stat64 is not marked as a system call that can >> block and that it needs the MayBlock flag adding. > > Yes, that could be it. I guess, normally, a stat64 (etc) call is > completed eventually by the kernel, without the need for any other > thread to run. But in this case a stat64 call might somehow require > some other thread to progress. In my case it definitely requires another thread to progress. No "might" or "somehow" there :-). > Nikolaus, try adding > > *flags |= SfMayBlock; > > to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is > 32-bit x86). Does that help? Yes, it's 32bit x86. I'll try this tonight and report back. Thanks for the quick help! Best, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-10 18:08:34
|
Tom Hughes <to...@co...> writes: > On 10/01/11 01:13, Nikolaus Rath wrote: > >> The problem is that the program deadlocks when running under Valgrind. >> More precisely, it deadlocks in a stat64() syscall of the mountpoint. >> >> My guess is that Valgrind is not expecting that a syscall may call back >> into a different thread of the program that valgrind runs, and that some >> valgrind internal lock prevents that the callback is handled before the >> syscall itself has finished - deadlock. > > My guess would be that stat64 is not marked as a system call that can > block and that it needs the MayBlock flag adding. Can you give me specific instructions how to add the flag? Then I can recompile valgrind and test it. Thanks, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-10 23:13:27
|
Julian Seward <js...@ac...> writes:
>> My guess would be that stat64 is not marked as a system call that can
>> block and that it needs the MayBlock flag adding.
>
> Yes, that could be it. I guess, normally, a stat64 (etc) call is
> completed eventually by the kernel, without the need for any other
> thread to run. But in this case a stat64 call might somehow require
> some other thread to progress.
>
> Nikolaus, try adding
>
> *flags |= SfMayBlock;
>
> to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is
> 32-bit x86). Does that help?
No, I'm afraid it's still blocking in the same call.
Did I make the change correctly? The file
coregrind/m_syswrap/syswrap-x86-linux.c now looks like this:
PRE(sys_stat64)
{
*flags |= SfMayBlock;
PRINT("sys_stat64 ( %#lx(%s), %#lx )",ARG1,(char*)ARG1,ARG2);
PRE_REG_READ2(long, "stat64", char *, file_name, struct stat64 *, buf);
PRE_MEM_RASCIIZ( "stat64(file_name)", ARG1 );
PRE_MEM_WRITE( "stat64(buf)", ARG2, sizeof(struct vki_stat64) );
}
Best,
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
|
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-11 02:07:44
Attachments:
fuse.c
|
Nikolaus Rath <Nik...@ra...> writes: > Julian Seward <js...@ac...> writes: >>> My guess would be that stat64 is not marked as a system call that can >>> block and that it needs the MayBlock flag adding. >> >> Yes, that could be it. I guess, normally, a stat64 (etc) call is >> completed eventually by the kernel, without the need for any other >> thread to run. But in this case a stat64 call might somehow require >> some other thread to progress. >> >> Nikolaus, try adding >> >> *flags |= SfMayBlock; >> >> to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is >> 32-bit x86). Does that help? > > > No, I'm afraid it's still blocking in the same call. The devil's in the detail. I had to write a minimal example and observe it mysteriously working before I looked closely enough, but I finally noticed that the call is *l*stat, and not stat. After adding the flag to sys_lstat64 the problem is fixed. Is there any chance of getting the flag added in valgrind by default, or would that hurt performance too much? I'm attaching my minimal example anyway, maybe it's useful for someone at some point. Best, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Julian S. <js...@ac...> - 2011-01-11 00:48:30
|
Yes, you made the change correctly.
Maybe some other thread is blocking? Can you figure out the
state of all threads? --trace-syscalls=yes gives you an
strace like output for your program, which might be useful.
J
On Tuesday, January 11, 2011, Nikolaus Rath wrote:
> Julian Seward <js...@ac...> writes:
> >> My guess would be that stat64 is not marked as a system call that can
> >> block and that it needs the MayBlock flag adding.
> >
> > Yes, that could be it. I guess, normally, a stat64 (etc) call is
> > completed eventually by the kernel, without the need for any other
> > thread to run. But in this case a stat64 call might somehow require
> > some other thread to progress.
> >
> > Nikolaus, try adding
> >
> > *flags |= SfMayBlock;
> >
> > to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is
> > 32-bit x86). Does that help?
>
> No, I'm afraid it's still blocking in the same call.
>
> Did I make the change correctly? The file
> coregrind/m_syswrap/syswrap-x86-linux.c now looks like this:
>
> PRE(sys_stat64)
> {
> *flags |= SfMayBlock;
> PRINT("sys_stat64 ( %#lx(%s), %#lx )",ARG1,(char*)ARG1,ARG2);
> PRE_REG_READ2(long, "stat64", char *, file_name, struct stat64 *, buf);
> PRE_MEM_RASCIIZ( "stat64(file_name)", ARG1 );
> PRE_MEM_WRITE( "stat64(buf)", ARG2, sizeof(struct vki_stat64) );
> }
>
>
>
> Best,
>
>
> -Nikolaus
|
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-11 01:23:46
|
On 01/10/2011 07:48 PM, Julian Seward wrote: > Maybe some other thread is blocking? Can you figure out the > state of all threads? --trace-syscalls=yes gives you an > strace like output for your program, which might be useful. I am pretty sure that the deadlock always occurs when thread 2 tries to check if the mountpoint has been brought up (which I assume translates to a simple stat call on the mountpoint) after thread 1 has initialized the mountpoint. It does not seem to matter what thread 1 is doing after that, or thread 2 before that. Going further than that would require me to instrument and recompile the FUSE code, which I'd rather avoid if possible. Does Valgrind provide any means to determine the thread state? I could not find the --trace-syscalls option in the manual. Could you explain the output to me? What are the two numbers after the PID? Here are the last lines before the deadlock: SYSCALL[23148,2](240) sys_futex ( 0x53e80a0, 129, 1, 0x50744d8, 0x43ddbf4 ) --> [async] ... SYSCALL[23148,1](240) ... [async] --> Success(0x0:0x0) SYSCALL[23148,1](240) sys_futex ( 0x53e80a0, 129, 1, 0x47f8ccc, 0x44de1e8 ) --> [async] ... SYSCALL[23148,3](240) ... [async] --> Success(0x0:0x0) SYSCALL[23148,3](142) sys_select ( 0, 0x0, 0x0, 0x0, 0xc2e7254 ) --> [async] ... SYSCALL[23148,1](240) ... [async] --> Success(0x0:0x0) SYSCALL[23148,1](196) sys_lstat64 ( 0x50d77d0(/tmp/tmp1d2rmU), 0xbea601ec ) Best, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-11 02:21:21
|
Julian Seward <js...@ac...> writes: >> My guess would be that stat64 is not marked as a system call that can >> block and that it needs the MayBlock flag adding. > > Yes, that could be it. I guess, normally, a stat64 (etc) call is > completed eventually by the kernel, without the need for any other > thread to run. But in this case a stat64 call might somehow require > some other thread to progress. > > Nikolaus, try adding > > *flags |= SfMayBlock; > > to PRE(sys_stat64) in syswrap-x86-linux.c (I assume this is > 32-bit x86). As I said in my other mail, this fixed the stat64() hang. However, now I am facing the same problem with the chown() syscall. I couldn't find a PRE(sys_chown) structure in the same file, but a PRE(sys_chown16) in syswrap-linux.c. Is that the right one to modify? I anticipate that I will need to do the same thing for chmod setxattr getxattr statvfs readlink mknod mkdir rmdir unlink symlink rename link opendir readdir listxattr removexattr create open close Can I safely add the flag using the same procedure, or are there any issues I should look out for? Best, -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Julian S. <js...@ac...> - 2011-01-11 13:08:00
|
> As I said in my other mail, this fixed the stat64() hang. Good. > Can I safely add the flag using the same procedure, I think so, yes. I suggest to do it on a demand-driven basis. > or are there any issues I should look out for? None that come obviously to mind. Send us the patch when you are done. J |
|
From: Nikolaus R. <Nik...@ra...> - 2011-01-17 20:02:19
Attachments:
valgrind.diff
|
Julian Seward <js...@ac...> writes: >> or are there any issues I should look out for? > > None that come obviously to mind. > > Send us the patch when you are done. I tried the attached patch against 3.6.0~svn20100724, but I seem to have introduced a new problem: SYSCALL[3677,1]( 85) sys_readlink ( 0x82e3ca0(/usr/bin/python-dbg), 0xbeb01cfa, 4096 ) --> [pre-success] Success(0x0:0xd) valgrind: m_syswrap/syswrap-main.c:1473 (vgPlain_client_syscall): Assertion '0 == (sci->flags & ~(SfPollAfter | SfYieldAfter | SfNoWriteResult))' failed. ==3677== at 0x380269BF: report_and_quit (m_libcassert.c:191) ==3677== by 0x38026C74: vgPlain_assert_fail (m_libcassert.c:265) ==3677== by 0x380676FB: vgPlain_client_syscall (syswrap-main.c:1473) ==3677== by 0x38063927: handle_syscall (scheduler.c:885) ==3677== by 0x38065402: vgPlain_scheduler (scheduler.c:1081) ==3677== by 0x38096ACA: run_a_thread_NORETURN (syswrap-linux.c:94) sched status: running_tid=1 Any idea what I did wrong? -Nikolaus -- »Time flies like an arrow, fruit flies like a Banana.« PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C |
|
From: Julian S. <js...@ac...> - 2011-01-18 12:37:51
|
IIRC there is some special-case kludgery with handling sys_readlink, to catch the case where the arg is /proc/self/exe. Maybe worth checking for that and disabling it? J On Monday, January 17, 2011, Nikolaus Rath wrote: > Julian Seward <js...@ac...> writes: > >> or are there any issues I should look out for? > > > > None that come obviously to mind. > > > > Send us the patch when you are done. > > I tried the attached patch against 3.6.0~svn20100724, but I seem to have > introduced a new problem: > > SYSCALL[3677,1]( 85) sys_readlink ( 0x82e3ca0(/usr/bin/python-dbg), > 0xbeb01cfa, 4096 ) --> [pre-success] Success(0x0:0xd) valgrind: > m_syswrap/syswrap-main.c:1473 (vgPlain_client_syscall): Assertion '0 == > (sci->flags & ~(SfPollAfter | SfYieldAfter | SfNoWriteResult))' failed. > ==3677== at 0x380269BF: report_and_quit (m_libcassert.c:191) > ==3677== by 0x38026C74: vgPlain_assert_fail (m_libcassert.c:265) > ==3677== by 0x380676FB: vgPlain_client_syscall (syswrap-main.c:1473) > ==3677== by 0x38063927: handle_syscall (scheduler.c:885) > ==3677== by 0x38065402: vgPlain_scheduler (scheduler.c:1081) > ==3677== by 0x38096ACA: run_a_thread_NORETURN (syswrap-linux.c:94) > > sched status: > running_tid=1 > > > Any idea what I did wrong? > > > > -Nikolaus |