From: William C. <wc...@re...> - 2016-06-29 15:22:29
|
On 06/24/2016 08:06 PM, 大平怜 wrote: > I forgot to include oprofile-list. > > > Regards, > Rei Odaira Thanks for the patches. Right now I am at Red Hat Summit and away from the computer much of the time this week. However, I will look at the patches this week. -Will > > ---------- Forwarded message ---------- > From: *大平怜* <rei...@gm... <mailto:rei...@gm...>> > Date: 2016-06-24 19:04 GMT-05:00 > Subject: Re: Potential deadlock in operf when using --pid > To: William Cohen <wc...@re... <mailto:wc...@re...>> > > > Hi, > > Now I come back to this problem. > > I am attaching two patches to solve it. They apply to the latest master in the git repository. Please apply oprofile_avoid_deadlock_1.patch first and then oprofile_avoid_deadlock_2.patch. The first patch is just for refactoring the code to make the second patch possible, and the second patch is for the actual problem solving. > > So, the problem is that the operf-record process is blocked on a write to the sample output pipe when the pipe is full, while the operf-read process, which is the consumer of the pipe, cannot read from the pipe because it is blocked on a read from the comm pipe, to which the operf-record process is supposed to write. Please refer to the previous emails for the detailed problem description. > > |<--sample output pipe--| > operf-read |-------comm pipe------>| operf-record > |<------comm pipe-------| > > My solution is to make the sample output pipe non-blocking on the operf-record side. When the sample output pipe is full, instead of waiting on it, the operf-record process polls both the sample output pipe and the (read-end of) comm pipe. If there is a message in the comm pipe from the operf-read process, the operf-record process handles it and writes a response to the comm pipe, so that the operf-read process can finish being blocked on the read from the comm pipe and eventually can consume the sample output pipe. > > There are some other ways to solve this problem, so any comments are welcome. With my patches, you can finish profiling oprofile_multithread_test by a single ctrl-c. They passed all the tests in oprofile-tests. > > > Regards, > Rei Odaira > > > 2015-10-02 14:19 GMT-05:00 William Cohen <wc...@re... <mailto:wc...@re...>>: > > On 10/02/2015 11:10 AM, 大平怜 wrote: > > Sorry, I sent wrong source code. I am attaching the right one. > > > > > > Regards, > > Rei Odaira > > > > 2015-10-01 18:32 GMT-05:00 大平怜 <rei...@gm... <mailto:rei...@gm...> <mailto:rei...@gm... <mailto:rei...@gm...>>>: > > > > Hi Will, > > > > How about the attached test program? This almost always causes the problem in my environment. > > > > > gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread > > > ./oprofile_multithread_test > > Usage: oprofile_multithread_test <number of spawns> <number of threads> <number of operations per thread> > > > ./oprofile_multithread_test -1 16 100000 > > > > In this example, oprofile_multithread_test spawns threads infinitely but runs maximum 16 threads simultaneously. Each thread performs addition 100000 times and then completes. Please use ^C to end this program if you specify -1 to the number of spawns. > > > > If you profile this program with operf --pid, I expect you will not be able to finish operf by a single ^C. > > > > > > Regards, > > Rei Odaira > > Hi, > > I was able to get the failure mentioned with the example code on a rhel7 machine. > > $ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi` > operf: Press Ctl-c or 'kill -SIGINT 8976' to stop profiling > operf: Profiler started > Unable to collect samples for forked process 9569. Process may have ended before recording could be started. > Unable to collect samples for forked process 9570. Process may have ended before recording could be started. > Unable to collect samples for forked process 9571. Process may have ended before recording could be started. > Unable to collect samples for forked process 9572. Process may have ended before recording could be started. > Unable to collect samples for forked process 9573. Process may have ended before recording could be started. > Unable to collect samples for forked process 9574. Process may have ended before recording could be started. > Unable to collect samples for forked process 9576. Process may have ended before recording could be started. > Unable to collect samples for forked process 9577. Process may have ended before recording could be started. > Unable to collect samples for forked process 9578. Process may have ended before recording could be started. > Unable to collect samples for forked process 9579. Process may have ended before recording could be started. > Unable to collect samples for forked process 9580. Process may have ended before recording could be started. > Unable to collect samples for forked process 9581. Process may have ended before recording could be started. > Unable to collect samples for forked process 9582. Process may have ended before recording could be started. > Unable to collect samples for forked process 9583. Process may have ended before recording could be started. > Unable to collect samples for forked process 9584. Process may have ended before recording could be started. > Unable to collect samples for forked process 9585. Process may have ended before recording could be started. > Unable to collect samples for forked process 9586. Process may have ended before recording could be started. > Unable to collect samples for forked process 9587. Process may have ended before recording could be started. > Unable to collect samples for forked process 9588. Process may have ended before recording could be started. > Unable to collect samples for forked process 9589. Process may have ended before recording could be started. > Unable to collect samples for forked process 9590. Process may have ended before recording could be started. > Unable to collect samples for forked process 9591. Process may have ended before recording could be started. > Unable to collect samples for forked process 9592. Process may have ended before recording could be started. > Unable to collect samples for forked process 9593. Process may have ended before recording could be started. > Unable to collect samples for forked process 9594. Process may have ended before recording could be started. > Unable to collect samples for forked process 9595. Process may have ended before recording could be started. > Unable to collect samples for forked process 9596. Process may have ended before recording could be started. > Unable to collect samples for forked process 9597. Process may have ended before recording could be started. > Unable to collect samples for forked process 9598. Process may have ended before recording could be started. > Unable to collect samples for forked process 9599. Process may have ended before recording could be started. > Unable to collect samples for forked process 9600. Process may have ended before recording could be started. > Unable to collect samples for forked process 9601. Process may have ended before recording could be started. > ^C^Cwaitpid for operf-record process failed: Interrupted system call > ^Cwaitpid for operf-read process failed: Interrupted system call > Error running profiler > > Threads are being created and destroyed very often in the example code. It took multiple times to get operf to connect up to all the threads. Many times I get messages like the following: > > $ /usr/local/bin/operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi` > !!!! No samples collected !!! > The target program/command ended before profiling was started. > operf record init failed > usage: operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ] ] ] > Error running profiler > > However, it operf gets started, it seems to reliably fail with the ctl-c. > > -Will > > > > > 2015-10-01 15:42 GMT-05:00 William Cohen <wc...@re... <mailto:wc...@re...> <mailto:wc...@re... <mailto:wc...@re...>>>: > > > > On 09/30/2015 06:07 PM, 大平怜 wrote: > > > Hello again, > > > > > > When using --pid, I have occasionally seen operf does not end by hitting ^C once. By hitting ^C multiple times, operf ends with error messages: > > > > Hi, > > > > I tried to replicate this on my local machine, but haven't seen it occur yet. How often does it happen? Also does it make a difference when the event sampling rate is changed? There is just one monitored process and it isn't spawning new processes? > > > > > > > > ----- > > > operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f CassandraDaemon` > > > Kernel profiling is not possible with current system config. > > > Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples. > > > operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling > > > operf: Profiler started > > > ^C^Cwaitpid for operf-record process failed: Interrupted system call > > > ^Cwaitpid for operf-read process failed: Interrupted system call > > > Error running profiler > > > ----- > > > > > > I am using the master branch in the git repository. > > > > > > Here is what I found: > > > The operf-read process was waiting for a read of a sample ID from the operf-record process to return: > > > (gdb) bt > > > #0 0x00007fd90e0fa480 in __read_nocancel () > > > at ../sysdeps/unix/syscall-template.S:81 > > > #1 0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620, > > > __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44 > > > #2 __handle_fork_event (event=0x98e860) at operf_utils.cpp:125 > > > #3 OP_perf_utils::op_write_event (event=event@entry=0x98e860, > > > sample_type=<optimized out>) at operf_utils.cpp:834 > > > #4 0x0000000000417250 in operf_read::convertPerfData ( > > > this=this@entry=0x648000 <operfRead>) at operf_counter.cpp:1147 > > > #5 0x000000000040a4cb in convert_sample_data () at operf.cpp:947 > > > #6 0x0000000000407482 in _run () at operf.cpp:625 > > > #7 main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539 > > > > > > The operf-record process was waiting for a write of sample data to the operf-read process to complete. Why did the write of the sample data block? My guess is that the sample_data_pipe was full: > > > (gdb) bt > > > #0 0x00007fbe0dc9f4e0 in __write_nocancel () > > > at ../sysdeps/unix/syscall-template.S:81 > > > #1 0x000000000040cd0e in OP_perf_utils::op_write_output (output=6, > > > buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989 > > > #2 0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data ( > > > md=0xd3c7f0, pr=pr@entry=0xd07900) at operf_utils.cpp:1443 > > > #3 0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900) > > > at operf_counter.cpp:846 > > > #4 0x00000000004098b8 in start_profiling () at operf.cpp:402 > > > #5 0x0000000000406305 in _run () at operf.cpp:596 > > > #6 main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539 > > > > > > As a result, when I hit ^C, the operf main process sent SIGUSR1 to the operf-record process, in which the write returned with EINTR and simply got retried. Since the operf-record process did not end, the operf main process waited at waitpid(2) forever. > > > > > > Do you think my guess makes sense? What would be a fundamental solution? Simply extending the pipe size would not be appropriate.... > > > > I am wondering if there are any other nuggets of information that can be gathered by using "--verbose debug,misc" and other "--verbose" variations on the operf command line. It would be worthwhile to take a close look at the code in operf.cpp and see how ctl-c is being handled. There could be a problem with the order that things are shutdown, causing the problem. I noticed around line 406 and of operf.cpp there is the following code: > > > > catch (const runtime_error & re) { > > /* If the user does ctl-c, the operf-record process may get interrupted > > * in a system call, causing problems with writes to the sample data pipe. > > * So we'll ignore such errors unless the user requests debug info. > > */ > > if (!ctl_c || (cverb << vmisc)) { > > cerr << "Caught runtime_error: " << re.what() << endl; > > exit_code = EXIT_FAILURE; > > } > > goto fail_out; > > } > > > > -Will > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > oprofile-list mailing list > > opr...@li... <mailto:opr...@li...> > > https://lists.sourceforge.net/lists/listinfo/oprofile-list > > > > > > > > ------------------------------------------------------------------------------ > Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San > Francisco, CA to explore cutting-edge tech and listen to tech luminaries > present their vision of the future. This family event has something for > everyone, including kids. Get more information and register today. > http://sdm.link/attshape > > > > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list > |