|
From: Carl L. <ce...@li...> - 2024-04-26 15:46:23
|
Paul:
I tried the --sanity-level=4 option with valgrind. The output was identical to running without that option.
Using gdb on valgrind, we make a call to VG_(client_syscall). I have put some print statements in the
routine to print the syscall number. The function eventually calls putSyscallStatusIntoGuestState. I have print statements in putSyscallStatusIntoGuestState to print how many times I have called this function. I have stepped thru putSyscallStatusIntoGuestState and it seems to be fine (note, I have not yet looked at the contents of SyscallInfo and ThreadState data structures but I will work on that). Eventually the call to putSyscallStatusIntoGuestState gets to running code in the guest state. The segfault occurs after we have been running in the guest state for awhile. One thing is the seg fault varies, sometime it happens on sysnum 90 which looks to be mmap, other times it is on syscall 6. I wonder if there are actually two threads running and I can't tell which thread actually hits the segfault. Further more, I see each of these system calls being made 10 to 20 times before the seg fault. Not sure how we woule get the call correct multiple times and then mess it up. But, checking the SyscallInfo and ThreadState data structures for each call might give some clues. The variability of which syscall we are on when the segfault occurs and the fact that it isn't the first time we process that syscall makes it a little harder in gdb to see what is happening when things fail as you don't know exactly when it is going to fail.
Do you know if there are actually multiple valgrind threads running when we are doing syscalls? I thought the execution of valgrind was all single threaded but I can't say for sure.
I will add some more prints to dump the SyscallInfo and ThreadState data structures to see if I can see any issues between a call that succeeds and one where things fail. Thanks for the suggestions, I will let you know what I find.
Carl
On 4/25/24 12:44, Paul Floyd wrote:
>
>
> On 23-04-24 15:52, Carl Love wrote:
>> Paul:
>>
>> I have been digging some more with gdb. I have also put in some print statements to try and figure out when and what syscall the issue occurs on. The issue occurs after processing a system call and we return to running the user code. While running the user code we encounter the seg fault.
>>
>> Valgrind calls void VG_(client_syscall) in syswrap-main.c to process a system call. The function calls putSyscallStatusIntoGuestState, twice, as part of processing the system call. I see a variable number of calls to putSyscallStatusIntoGuestState before hitting the seg fault. Note, the number of calls to putSyscallStatusIntoGuestState before the seg fault varies when just running valgrind. I see 2474 or 2478, calls to the function while processing system call 90 or I see 3604 or 3608 function calls for system call 6 before we hit the seg fault. I am puzzled by the inconsistency in the number of calls/sys call number before the segmentation fault, it doesn't "feel" like it is the system call processing per say that is the issue but maybe some other issue, just guessing???? Also, when the failure occurs, it isn't the first time that system call has been handled. It looked like we have processed that system call 10 to 20 times previously without a failure. Don't know
>> if that helps or not?
>>
>> If you have any thoughts as to a possible root cause, please let me know and I will look into it. Thanks.
>>
>
> Hi Carl
>
> For client syscalls Valgrind will make several syscalls itself. The sequeence is
>
> 1. gettid
> 2. write (that's the syscall 6) to release the fifo big lock
> 3. sigprocmask to set the guest signal mask
> 4. the actual client syscall
> 5. sigprocmask to set the Valgrind signal mask
> 6. read to acquire the fifo big lock
>
> On multithreaded apps it might be a different thread that performs the read.
>
> Going back to your first error
>
> ==2282131== Process terminating with default action of signal 11 (SIGSEGV)
> ==2282131== Bad permissions for mapped region at address 0x6B50000
> ==2282131== at 0x43962B8: __memset_power10 (in /usr/lib64/glibc-hwcaps/power9/libc-2.28.so)
> ==2282131== by 0x1013FBFF: PyTuple_New (in /home/carll/anaconda3/envs/faiss/bin/python3.11)
>
>
> I would expect memcheck to redirect the __memset_power10 call which was why I was wondering if other tools were OK.
>
> Do the SyscallInfo and ThreadState data structures look OK in gdb when you hit the segfault?
>
>
> Does running with --sanity-level=4 change anything (other than making it a lot slower)?
>
> A+
> Paul
>
|