From: Maynard J. <may...@us...> - 2014-05-23 17:43:17
|
Improve sample collection in multi-threaded apps when using "--pid" option See oprofile bug # 260 (https://sourceforge.net/p/oprofile/bugs/260/) for reference. operf may fail to collect samples for new threads or forked processes in a multi-threaded app under certain conditions: - If operf is started on a multi-threaded app (i.e., one that uses pthreads) with the "--pid" option - If said app has already started at least one thread - It the app either creates new threads or forks new processes after operf has started then no samples are collected for those new threads/processes. This patch fixes that problem (mostly) and documents the limitations and issues when using the "--pid" option. The limitations and issues are: 1. When using "--pid" to profile a multi-threaded application that also forks new processes, samples for processes that are forked before profiling is started may not be recorded (depending on timing of thread creation and when operf is started) 2. The "--lazy-conversion" option is not recommended to be used in conjunction with the --pid option for profiling multi-threaded processes. Depending on the order of thread creation (or forking of new processes), you may not get any samples for the new threads/processes. Signed-off-by: Maynard Johnson <may...@us...> --- doc/operf.1.in | 35 ++++++--- libperf_events/operf_counter.cpp | 151 ++++++++++++++++++++++++++++++++------ libperf_events/operf_counter.h | 24 ++++-- libperf_events/operf_utils.cpp | 39 +++++++++- pe_profiling/operf.cpp | 58 ++++++++++++-- 5 files changed, 254 insertions(+), 53 deletions(-) diff --git a/doc/operf.1.in b/doc/operf.1.in index 72a46ed..5ce43c0 100644 --- a/doc/operf.1.in +++ b/doc/operf.1.in @@ -35,21 +35,19 @@ and lost samples, are written to the operf.log file that can be found in the <session_dir>/samples directory. .br -.SH OPTIONS +.SH RUN MODES +One (and only one) of the following +.SB run modes +must be specified: .TP .BI command [args] The command or application to be profiled. .I args -are the input arguments that the command or application requires. One (and only one) of either -.I command -, -.I --pid -or -.I --system-wide -is required. +are the input arguments that the command or application requires. .br .TP .BI "--pid / -p " PID +.RS This option enables operf to profile a running application. .I PID should be the process ID of the process you wish to profile. When @@ -65,7 +63,13 @@ data it has collected. Use .BI -SIGINT .BI <operf-PID> for this purpose. -.br +.P +.B Limitation: +When using this option to profile a multi-threaded application that also forks +new processes, be aware that samples for processes that are forked before profiling +is started may not be recorded (depending on timing of thread creation and when +operf is started). +.RE .TP .BI "--system-wide / -s" This option is for performing a system-wide profile. You must @@ -86,6 +90,7 @@ that when running operf with this option, the user's current working directory should be /root or a subdirectory of /root to avoid storing sample data files in locations accessible by regular users. .br +.SH OTHER OPTIONS .TP .BI "--vmlinux / -k " vmlinux_path .RS @@ -188,6 +193,7 @@ directory on the current path. .br .TP .BI "--lazy-conversion / -l" +.RS Use this option to reduce the overhead of .BI operf during profiling. Normally, profile data received from the kernel is converted @@ -200,7 +206,16 @@ particularly on busy multi-processor systems. The option directs .BI operf to wait until profiling is completed to do the conversion of profile data. -.br +.P +.B Note: +This option is +.B not +recommended to be used in conjunction with the +.I --pid +option for profiling multi-threaded processes. Depending +on the order of thread creation (or forking of new processes), +you may not get any samples for the new threads/processes. +.RE .TP .BI "--append / -a" By default, diff --git a/libperf_events/operf_counter.cpp b/libperf_events/operf_counter.cpp index dd738aa..9986576 100644 --- a/libperf_events/operf_counter.cpp +++ b/libperf_events/operf_counter.cpp @@ -1,5 +1,5 @@ /** - * @file pe_profiling/operf_counter.cpp + * @file libperf_events/operf_counter.cpp * C++ class implementation that abstracts the user-to-kernel interface * for using Linux Performance Events Subsystem. * @@ -11,7 +11,7 @@ * (C) Copyright IBM Corp. 2011 * * Modified by Maynard Johnson <may...@us...> - * (C) Copyright IBM Corporation 2012 + * (C) Copyright IBM Corporation 2012, 2014 * */ @@ -50,6 +50,7 @@ extern bool throttled; extern size_t mmap_size; extern size_t pg_sz; extern bool use_cpu_minus_one; +extern bool track_new_forks; namespace { @@ -233,7 +234,7 @@ operf_counter::~operf_counter() { } -int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) +int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec, bool print_error) { struct { u64 count; @@ -249,16 +250,21 @@ int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) int ret = -1; cverb << vrecord << "perf_event_open failed: " << strerror(errno) << endl; if (errno == EBUSY) { - cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl - << "On some architectures, tools such as oprofile and perf being used in system-wide " - << "mode can cause this problem." << endl; + if (print_error) { + cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl + << "On some architectures, tools such as oprofile and perf being used in system-wide " + << "mode can cause this problem." << endl; + } ret = OP_PERF_HANDLED_ERROR; } else if (errno == ESRCH) { - cerr << "!!!! No samples collected !!!" << endl; - cerr << "The target program/command ended before profiling was started." << endl; + if (print_error) { + cerr << "!!!! No samples collected !!!" << endl; + cerr << "The target program/command ended before profiling was started." << endl; + } ret = OP_PERF_HANDLED_ERROR; } else { - cerr << "perf_event_open failed with " << strerror(errno) << endl; + if (print_error) + cerr << "perf_event_open failed with " << strerror(errno) << endl; } return ret; } @@ -299,7 +305,8 @@ operf_record::~operf_record() operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_running, vector<operf_event_t> & events, vmlinux_info_t vi, bool do_cg, - bool separate_by_cpu, bool out_fd_is_file) + bool separate_by_cpu, bool out_fd_is_file, + int _convert_read_pipe, int _convert_write_pipe) { struct sigaction sa; sigset_t ss; @@ -317,6 +324,8 @@ operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_ru valid = false; poll_data = NULL; output_fd = out_fd; + read_comm_pipe = _convert_read_pipe; + write_comm_pipe = _convert_write_pipe; write_to_file = out_fd_is_file; opHeader.data_size = 0; num_cpus = -1; @@ -657,6 +666,7 @@ void operf_record::setup() * "inherit" on the perf_event_attr we pass to perf_event_open when cpu is '-1'. */ bool inherit = !profile_process_group; + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); for (unsigned proc_idx = 0; proc_idx < num_procs; proc_idx++) { for (unsigned event = 0; event < evts.size(); event++) { /* For a parent process, comm.tid==comm.pid, but for child @@ -664,14 +674,17 @@ void operf_record::setup() * we must use comm.tid for the perf_event_open call. So * we can use comm.tid for all cases. */ - pid_t pid_for_open = profile_process_group ? procs[proc_idx].tid - : pid_to_profile; + pid_t pid_for_open; + if (profile_process_group) + pid_for_open = proc_it++->second.tid; + else + pid_for_open = pid_to_profile; operf_counter op_ctr(operf_counter(evts[event], (!pid_started && !system_wide), callgraph, separate_cpu, inherit, event)); if ((rc = op_ctr.perf_event_open(pid_for_open, - real_cpu, this)) < 0) { + real_cpu, this, true)) < 0) { err_msg = "Internal Error. Perf event setup failed."; goto error; } @@ -716,24 +729,93 @@ void operf_record::record_process_info(void) { map<unsigned int, unsigned int> pids_mapped; pid_t last_tgid = -1; - for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++) + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); + for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++, proc_it++) { - int num = OP_perf_utils::op_write_output(output_fd, &procs[proc_idx], - procs[proc_idx].header.size); + struct comm_event ce = proc_it->second; + int num = OP_perf_utils::op_write_output(output_fd, &ce, ce.header.size); add_to_total(num); if (cverb << vrecord) - cout << "Created COMM event for " << procs[proc_idx].comm << endl; + cout << "Created COMM event for " << ce.comm << endl; - if (((pid_t)(procs[proc_idx].pid) == last_tgid) || - (pids_mapped.find(procs[proc_idx].pid) != pids_mapped.end())) + if (((pid_t)(ce.pid) == last_tgid) || + (pids_mapped.find(ce.pid) != pids_mapped.end())) continue; - OP_perf_utils::op_record_process_exec_mmaps(procs[proc_idx].tid, - procs[proc_idx].pid, + OP_perf_utils::op_record_process_exec_mmaps(ce.tid, + ce.pid, output_fd, this); - pids_mapped[procs[proc_idx].pid] = last_tgid = procs[proc_idx].pid; + pids_mapped[ce.pid] = last_tgid = ce.pid; + } +} + +int operf_record::_start_recoding_new_thread(pid_t id) +{ + string err_msg; + int num_mmaps, rc, fd_for_set_output = -1; + struct comm_event ce; + u64 sample_id; + struct pollfd * old_polldata = poll_data; + + num_mmaps = sizeof(poll_data)/sizeof(poll_data[0]); + num_mmaps++; + poll_data = new struct pollfd [num_mmaps]; + // Copy only the existing pollfd objects from the array. The new pollfd will + // be filled in via the call to _prepare_to_record_one_fd. + for (int i = 0; i < num_mmaps - 1; i++) + poll_data[i] = old_polldata[i]; + delete[] old_polldata; + // Make a pseudo comm_event object. At this point, the + // only field we need to set is tid. + memset(&ce, sizeof(ce), 0); + ce.tid = id; + add_process(ce); + + for (unsigned event = 0; event < evts.size(); event++) { + operf_counter op_ctr(operf_counter(evts[event], + (!pid_started && !system_wide), + callgraph, separate_cpu, + false, event)); + if (op_ctr.perf_event_open(id, -1, this, false) < 0) { + sample_id = OP_PERF_NO_SAMPLE_ID; + // Send special value to convert process to indicate failure + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); + if (len < 0) + perror("Internal error on convert write_comm_pipe"); + return -1; + } + perfCounters.push_back(op_ctr); + int fd = op_ctr.get_fd(); + if (event == 0) { + rc = _prepare_to_record_one_fd(num_mmaps - 1, fd); + fd_for_set_output = fd; + } else { + if ((rc = ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, + fd_for_set_output)) < 0) + perror("_start_recoding_new_thread: ioctl #1 failed"); + } + + if (rc < 0) + return rc; + + if ((rc = ioctl(fd, PERF_EVENT_IOC_ENABLE)) < 0) { + perror("_start_recoding_new_thread: ioctl #2 failed"); + return rc; + } + + sample_id = opHeader.h_attrs[event].ids.back(); + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); + if (len < 0) + perror("Internal error on convert write_comm_pipe"); + else if (len != sizeof(sample_id)) + cerr << "Incomplete write convert to write_comm_pipe" << endl; + else + cverb << vrecord << "Sent sample_id " << sample_id << " to convert process" << endl; } + + return 0; } + void operf_record::recordPerfData(void) { bool disabled = false; @@ -744,6 +826,8 @@ void operf_record::recordPerfData(void) cerr << "operf: Profiler started" << endl; while (1) { int prev = sample_reads; + pid_t pi; + ssize_t len; for (size_t i = 0; i < samples_array.size(); i++) { if (samples_array[i].base) @@ -755,6 +839,20 @@ void operf_record::recordPerfData(void) if (prev == sample_reads) { (void)poll(poll_data, poll_count, -1); } + if (!quit && track_new_forks && procs.size() > 1) { + len = read(read_comm_pipe, &pi, sizeof(pi)); + + if (len < 0 && errno != EAGAIN) { + cverb << vrecord << "Non-fatal error: read_comm_pipe returned too few bytes" << endl; + } else if (len == sizeof(pi) && (procs.find(pi) == procs.end())) { + // Start profiling this new thread + cverb << vrecord << "Start recording for new thread " << pi << endl; + // Don't treat as fatal error if it doesn't work + if (_start_recoding_new_thread(pi) < 0) + cerr << "Unable to collect samples for forked process " << pi + << ". Process may have ended before recording could be started." << endl; + } + } if (quit) { for (unsigned int i = 0; i < perfCounters.size(); i++) @@ -768,9 +866,11 @@ void operf_record::recordPerfData(void) } void operf_read::init(int sample_data_pipe_fd, string input_filename, string samples_loc, op_cpu cputype, - bool systemwide) + bool systemwide, int _record_write_pipe, int _record_read_pipe) { sample_data_fd = sample_data_pipe_fd; + read_comm_pipe = _record_read_pipe; + write_comm_pipe = _record_write_pipe; inputFname = input_filename; sampledir = samples_loc; cpu_type = cputype; @@ -782,6 +882,11 @@ operf_read::~operf_read() evts.clear(); } +void operf_read::add_sample_id_to_opHeader(u64 sample_id) +{ + for (unsigned int i = 0; i < evts.size(); i++) + opHeader.h_attrs[i].ids.push_back(sample_id); +} int operf_read::_read_header_info_with_ifstream(void) { diff --git a/libperf_events/operf_counter.h b/libperf_events/operf_counter.h index 0874616..dfb4336 100644 --- a/libperf_events/operf_counter.h +++ b/libperf_events/operf_counter.h @@ -1,5 +1,5 @@ /** - * @file pe_profiling/operf_counter.h + * @file libperf_events/operf_counter.h * C++ class definition that abstracts the user-to-kernel interface * for using Linux Performance Events Subsystem. * @@ -11,7 +11,7 @@ * (C) Copyright IBM Corp. 2011 * * Modified by Maynard Johnson <may...@us...> - * (C) Copyright IBM Corporation 2012 + * (C) Copyright IBM Corporation 2012, 2014 * */ @@ -51,6 +51,7 @@ op_perf_event_open(struct perf_event_attr * attr, } #define OP_PERF_HANDLED_ERROR -101 +#define OP_PERF_NO_SAMPLE_ID 0xdeadbeefdeadbeefULL class operf_counter { @@ -58,7 +59,7 @@ public: operf_counter(operf_event_t & evt, bool enable_on_exec, bool callgraph, bool separate_by_cpu, bool inherit, int event_number); ~operf_counter(); - int perf_event_open(pid_t pid, int cpu, operf_record * pr); + int perf_event_open(pid_t pid, int cpu, operf_record * pr, bool print_error); const struct perf_event_attr * the_attr(void) const { return &attr; } int get_fd(void) const { return fd; } int get_id(void) const { return id; } @@ -82,12 +83,13 @@ public: */ operf_record(int output_fd, bool sys_wide, pid_t the_pid, bool pid_running, std::vector<operf_event_t> & evts, OP_perf_utils::vmlinux_info_t vi, - bool callgraph, bool separate_by_cpu, bool output_fd_is_file); + bool callgraph, bool separate_by_cpu, bool output_fd_is_file, + int _convert_read_pipe, int _convert_write_pipe); ~operf_record(); void recordPerfData(void); int out_fd(void) const { return output_fd; } void add_to_total(int n) { total_bytes_recorded += n; } - void add_process(struct comm_event proc) { procs.push_back(proc); } + void add_process(struct comm_event proc) { procs[proc.tid] = proc; } unsigned int get_total_bytes_recorded(void) const { return total_bytes_recorded; } void register_perf_event_id(unsigned counter, u64 id, perf_event_attr evt_attr); bool get_valid(void) { return valid; } @@ -97,11 +99,14 @@ private: void setup(void); int prepareToRecord(void); int _prepare_to_record_one_fd(int idx, int fd); + int _start_recoding_new_thread(pid_t id); void record_process_info(void); void write_op_header_info(void); int _write_header_to_file(void); int _write_header_to_pipe(void); int output_fd; + int read_comm_pipe; + int write_comm_pipe; bool write_to_file; // Array of size 'num_cpus_used_for_perf_event_open * num_pids * num_events' struct pollfd * poll_data; @@ -113,7 +118,7 @@ private: * that information in a collection of type 'comm_event'. We'll use this collection * for synthesizing PERF_RECORD_COMM events into the profile data stream. */ - std::vector<struct comm_event> procs; + std::map<u32, struct comm_event> procs; bool pid_started; bool system_wide; bool callgraph; @@ -134,16 +139,21 @@ public: : sample_data_fd(-1), inputFname(""), evts(_evts), cpu_type(CPU_NO_GOOD) { valid = syswide = false;} void init(int sample_data_pipe_fd, std::string input_filename, std::string samples_dir, op_cpu cputype, - bool systemwide); + bool systemwide, int _record_write_pipe, int _record_read_pipe); ~operf_read(); int readPerfHeader(void); unsigned int convertPerfData(void); bool is_valid(void) {return valid; } int get_eventnum_by_perf_event_id(u64 id) const; inline const operf_event_t * get_event_by_counter(u32 counter) { return &evts[counter]; } + int get_write_comm_pipe(void) { return write_comm_pipe; } + int get_read_comm_pipe(void) { return read_comm_pipe; } + void add_sample_id_to_opHeader(u64 sample_id); private: int sample_data_fd; + int write_comm_pipe; + int read_comm_pipe; std::string inputFname; std::string sampledir; std::ifstream istrm; diff --git a/libperf_events/operf_utils.cpp b/libperf_events/operf_utils.cpp index 68010c1..54a0ace 100644 --- a/libperf_events/operf_utils.cpp +++ b/libperf_events/operf_utils.cpp @@ -10,7 +10,7 @@ * (C) Copyright IBM Corp. 2011 * * Modified by Maynard Johnson <may...@us...> - * (C) Copyright IBM Corporation 2012, 2013 + * (C) Copyright IBM Corporation 2012, 2013, 2014 * */ @@ -48,6 +48,7 @@ extern pid_t app_PID; extern verbose vrecord; extern verbose vconvert; extern void __set_event_throttled(int index); +extern bool track_new_forks; using namespace std; @@ -87,6 +88,8 @@ static void __handle_fork_event(event_t * event) operf_process_info * parent = NULL; operf_process_info * forked_proc = NULL; + // First, see if we already have a proc_info object for the parent process + // that did the fork it = process_map.find(event->fork.ppid); if (it != process_map.end()) { parent = it->second; @@ -101,14 +104,42 @@ static void __handle_fork_event(event_t * event) process_map[event->fork.ppid] = parent; } - /* If the forked process's pid is the same as the parent's, we simply ignore the FORK - * event. This is because operf_process_info objects are stored in the map collection - * by pid, meaning that the forked process and its parent reference the same + /* If the user requested to profile by "--pid", then we must notify the + * recording process whenever we see a fork event. If the record process + * isn't already recording samples for this thread/process, it will start + * recording now. + */ + if (track_new_forks) { + if (cverb << vconvert) + cout << "Inform record process of new pid/tid " + << event->fork.pid << "/" << event->fork.tid << endl; + pid_t id = (event->fork.pid == event->fork.ppid) ? event->fork.tid : + event->fork.pid; + ssize_t len = write(operfRead.get_write_comm_pipe(), &id, sizeof(id)); + if (len < 0) + perror("Internal error on record write_comm_pipe"); + else if (len != sizeof(id)) + cerr << "Incomplete write to record write_comm_pipe" << endl; + u64 sample_id; + // get sample id from recording process + len = read(operfRead.get_read_comm_pipe(), &sample_id, sizeof(sample_id)); + if (sample_id == OP_PERF_NO_SAMPLE_ID) { + cverb << vconvert << "convert: No sample_id from record process" << endl; + } else { + cverb << vconvert << "Add sample_id " << sample_id << " to opHeader" << endl; + operfRead.add_sample_id_to_opHeader(sample_id); + } + } + + /* If the forked process's pid is the same as the parent's, we simply ignore + * the FORK event. This is because operf_process_info objects are stored in the map + * collection by pid, meaning that the forked process and its parent reference the same * operf_process_info object. */ if (event->fork.pid == event->fork.ppid) return; + // Now try to find a proc_info for the forked process itself. it = process_map.find(event->fork.pid); if (it == process_map.end()) { forked_proc = new operf_process_info(event->fork.pid, NULL, false, false); diff --git a/pe_profiling/operf.cpp b/pe_profiling/operf.cpp index ec85f6c..51f55a5 100644 --- a/pe_profiling/operf.cpp +++ b/pe_profiling/operf.cpp @@ -11,7 +11,7 @@ * (C) Copyright IBM Corp. 2011 * * Modified by Maynard Johnson <may...@us...> - * (C) Copyright IBM Corporation 2012, 2013 + * (C) Copyright IBM Corporation 2012, 2013, 2014 * */ @@ -77,12 +77,19 @@ int kptr_restrict; char * start_time_human_readable; std::vector<operf_event_t> events; operf_read operfRead(events); +/* With certain operf options, we have to take extra steps to track new threads + * and processes that an app may create via pthread_create, fork, etc. Note that + * any such thread or process creation APIs will result in a PERF_RECORD_FORK event, + * so we handle these new threads/processes in operf_utils::__handle_fork_event. + */ +bool track_new_forks; #define DEFAULT_OPERF_OUTFILE "operf.data" #define KERN_ADDR_SPACE_START_SYMBOL "_text" #define KERN_ADDR_SPACE_END_SYMBOL "_etext" +static operf_record * operfRecord = NULL; static char * app_name_SAVE = NULL; static char ** app_args = NULL; static pid_t jitconv_pid = -1; @@ -96,6 +103,14 @@ static char start_time_str[32]; static bool jit_conversion_running; static void convert_sample_data(void); static int sample_data_pipe[2]; +static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; +// The operf_convert_record_write_pipe is used for the convert process to send +// forked PID data to the record process. +static int operf_convert_record_write_pipe[2]; +// The operf_record_convert_write_pipe is used for the record process to send +// data to the convert process in response to the forked PID data. +static int operf_record_convert_write_pipe[2]; + bool ctl_c = false; bool pipe_closed = false; @@ -225,8 +240,6 @@ void set_signals_for_parent(void) } } -static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; - static string args_to_string(void) { string ret; @@ -310,6 +323,14 @@ int start_profiling(void) perror("Internal error: could not create pipe"); return -1; } + if (pipe2(operf_convert_record_write_pipe, O_NONBLOCK) < 0) { + perror("Internal error: could not create pipe"); + return -1; + } + if (pipe(operf_record_convert_write_pipe) < 0) { + perror("Internal error: could not create pipe"); + return -1; + } operf_record_pid = fork(); if (operf_record_pid < 0) { return -1; @@ -318,6 +339,8 @@ int start_profiling(void) int exit_code = EXIT_SUCCESS; _set_basic_SIGINT_handler_for_child(); close(operf_record_ready_pipe[0]); + close(operf_convert_record_write_pipe[1]); + close(operf_record_convert_write_pipe[0]); if (!operf_options::post_conversion) close(sample_data_pipe[0]); /* @@ -343,11 +366,12 @@ int start_profiling(void) } else { outfd = sample_data_pipe[1]; } - operf_record operfRecord(outfd, operf_options::system_wide, app_PID, + operfRecord = new operf_record(outfd, operf_options::system_wide, app_PID, (operf_options::pid == app_PID), events, vi, operf_options::callgraph, - operf_options::separate_cpu, operf_options::post_conversion); - if (operfRecord.get_valid() == false) { + operf_options::separate_cpu, operf_options::post_conversion, + operf_convert_record_write_pipe[0], operf_record_convert_write_pipe[1]); + if (operfRecord->get_valid() == false) { /* If valid is false, it means that one of the "known" errors has * occurred: * - profiled process has already ended @@ -370,9 +394,10 @@ int start_profiling(void) } // start recording - operfRecord.recordPerfData(); + operfRecord->recordPerfData(); cverb << vdebug << "Total bytes recorded from perf events: " << dec - << operfRecord.get_total_bytes_recorded() << endl; + << operfRecord->get_total_bytes_recorded() << endl; + delete operfRecord; } catch (const runtime_error & re) { /* If the user does ctl-c, the operf-record process may get interrupted * in a system call, causing problems with writes to the sample data pipe. @@ -388,6 +413,9 @@ int start_profiling(void) _exit(exit_code); fail_out: + if (operfRecord) + delete operfRecord; + if (!ready){ /* ready==0 means we've not yet told parent we're ready, * but the parent is reading our pipe. So we tell the @@ -615,6 +643,10 @@ static end_code_t _run(void) // parent close(sample_data_pipe[0]); close(sample_data_pipe[1]); + close(operf_convert_record_write_pipe[0]); + close(operf_convert_record_write_pipe[1]); + close(operf_record_convert_write_pipe[0]); + close(operf_record_convert_write_pipe[1]); } } @@ -909,7 +941,11 @@ static void convert_sample_data(void) inputfd = sample_data_pipe[0]; inputfname = ""; } - operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, operf_options::system_wide); + close(operf_record_convert_write_pipe[1]); + close(operf_convert_record_write_pipe[0]); + operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, + operf_options::system_wide, operf_convert_record_write_pipe[1], + operf_record_convert_write_pipe[0]); if ((rc = operfRead.readPerfHeader()) < 0) { if (rc != OP_PERF_HANDLED_ERROR) cerr << "Error: Cannot create read header info for sample data " << endl; @@ -1389,6 +1425,10 @@ static void process_args(int argc, char * const argv[]) string startEnd = _process_vmlinux(operf_options::vmlinux); operf_create_vmlinux(operf_options::vmlinux.c_str(), startEnd.c_str()); } + if (operf_options::pid && !operf_options::post_conversion) + track_new_forks = true; + else + track_new_forks = false; return; } -- 1.7.1 |
From: Maynard J. <may...@us...> - 2014-06-09 14:51:40
|
On 05/23/2014 12:43 PM, Maynard Johnson wrote: > Improve sample collection in multi-threaded apps when using "--pid" option To all, I would greatly appreciate any feedback on this patch. Has anyone in the oprofile community encountered the problem which this patch fixes? I imagine a lot of people have encountered the problem, but may not have been aware. For those who have run into this issue, this fix plugs a big gaping hole in the operf sample collection. As such, I view this as a critical fix that I want to commit. But since the patch is non-trivial, I would prefer to have some reviewer(s) take a look at it. Thanks. -Maynard > > See oprofile bug # 260 (https://sourceforge.net/p/oprofile/bugs/260/) for reference. > > operf may fail to collect samples for new threads or forked > processes in a multi-threaded app under certain conditions: > > - If operf is started on a multi-threaded app (i.e., one that uses > pthreads) with the "--pid" option > - If said app has already started at least one thread > - It the app either creates new threads or forks new processes > after operf has started > > then no samples are collected for those new threads/processes. > > This patch fixes that problem (mostly) and documents the limitations and > issues when using the "--pid" option. The limitations and issues are: > > 1. When using "--pid" to profile a multi-threaded application that also forks > new processes, samples for processes that are forked before profiling is > started may not be recorded (depending on timing of thread creation and > when operf is started) > > 2. The "--lazy-conversion" option is not recommended to be used in conjunction > with the --pid option for profiling multi-threaded processes. Depending on > the order of thread creation (or forking of new processes), you may not get > any samples for the new threads/processes. > > Signed-off-by: Maynard Johnson <may...@us...> > --- > doc/operf.1.in | 35 ++++++--- > libperf_events/operf_counter.cpp | 151 ++++++++++++++++++++++++++++++++------ > libperf_events/operf_counter.h | 24 ++++-- > libperf_events/operf_utils.cpp | 39 +++++++++- > pe_profiling/operf.cpp | 58 ++++++++++++-- > 5 files changed, 254 insertions(+), 53 deletions(-) > > diff --git a/doc/operf.1.in b/doc/operf.1.in > index 72a46ed..5ce43c0 100644 > --- a/doc/operf.1.in > +++ b/doc/operf.1.in > @@ -35,21 +35,19 @@ and lost samples, are written to the operf.log file that can be found in the > <session_dir>/samples directory. > .br > > -.SH OPTIONS > +.SH RUN MODES > +One (and only one) of the following > +.SB run modes > +must be specified: > .TP > .BI command [args] > The command or application to be profiled. > .I args > -are the input arguments that the command or application requires. One (and only one) of either > -.I command > -, > -.I --pid > -or > -.I --system-wide > -is required. > +are the input arguments that the command or application requires. > .br > .TP > .BI "--pid / -p " PID > +.RS > This option enables operf to profile a running application. > .I PID > should be the process ID of the process you wish to profile. When > @@ -65,7 +63,13 @@ data it has collected. Use > .BI -SIGINT > .BI <operf-PID> > for this purpose. > -.br > +.P > +.B Limitation: > +When using this option to profile a multi-threaded application that also forks > +new processes, be aware that samples for processes that are forked before profiling > +is started may not be recorded (depending on timing of thread creation and when > +operf is started). > +.RE > .TP > .BI "--system-wide / -s" > This option is for performing a system-wide profile. You must > @@ -86,6 +90,7 @@ that when running operf with this option, the user's current working > directory should be /root or a subdirectory of /root to avoid > storing sample data files in locations accessible by regular users. > .br > +.SH OTHER OPTIONS > .TP > .BI "--vmlinux / -k " vmlinux_path > .RS > @@ -188,6 +193,7 @@ directory on the current path. > .br > .TP > .BI "--lazy-conversion / -l" > +.RS > Use this option to reduce the overhead of > .BI operf > during profiling. Normally, profile data received from the kernel is converted > @@ -200,7 +206,16 @@ particularly on busy multi-processor systems. The > option directs > .BI operf > to wait until profiling is completed to do the conversion of profile data. > -.br > +.P > +.B Note: > +This option is > +.B not > +recommended to be used in conjunction with the > +.I --pid > +option for profiling multi-threaded processes. Depending > +on the order of thread creation (or forking of new processes), > +you may not get any samples for the new threads/processes. > +.RE > .TP > .BI "--append / -a" > By default, > diff --git a/libperf_events/operf_counter.cpp b/libperf_events/operf_counter.cpp > index dd738aa..9986576 100644 > --- a/libperf_events/operf_counter.cpp > +++ b/libperf_events/operf_counter.cpp > @@ -1,5 +1,5 @@ > /** > - * @file pe_profiling/operf_counter.cpp > + * @file libperf_events/operf_counter.cpp > * C++ class implementation that abstracts the user-to-kernel interface > * for using Linux Performance Events Subsystem. > * > @@ -11,7 +11,7 @@ > * (C) Copyright IBM Corp. 2011 > * > * Modified by Maynard Johnson <may...@us...> > - * (C) Copyright IBM Corporation 2012 > + * (C) Copyright IBM Corporation 2012, 2014 > * > */ > > @@ -50,6 +50,7 @@ extern bool throttled; > extern size_t mmap_size; > extern size_t pg_sz; > extern bool use_cpu_minus_one; > +extern bool track_new_forks; > > namespace { > > @@ -233,7 +234,7 @@ operf_counter::~operf_counter() { > } > > > -int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) > +int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec, bool print_error) > { > struct { > u64 count; > @@ -249,16 +250,21 @@ int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) > int ret = -1; > cverb << vrecord << "perf_event_open failed: " << strerror(errno) << endl; > if (errno == EBUSY) { > - cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl > - << "On some architectures, tools such as oprofile and perf being used in system-wide " > - << "mode can cause this problem." << endl; > + if (print_error) { > + cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl > + << "On some architectures, tools such as oprofile and perf being used in system-wide " > + << "mode can cause this problem." << endl; > + } > ret = OP_PERF_HANDLED_ERROR; > } else if (errno == ESRCH) { > - cerr << "!!!! No samples collected !!!" << endl; > - cerr << "The target program/command ended before profiling was started." << endl; > + if (print_error) { > + cerr << "!!!! No samples collected !!!" << endl; > + cerr << "The target program/command ended before profiling was started." << endl; > + } > ret = OP_PERF_HANDLED_ERROR; > } else { > - cerr << "perf_event_open failed with " << strerror(errno) << endl; > + if (print_error) > + cerr << "perf_event_open failed with " << strerror(errno) << endl; > } > return ret; > } > @@ -299,7 +305,8 @@ operf_record::~operf_record() > > operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_running, > vector<operf_event_t> & events, vmlinux_info_t vi, bool do_cg, > - bool separate_by_cpu, bool out_fd_is_file) > + bool separate_by_cpu, bool out_fd_is_file, > + int _convert_read_pipe, int _convert_write_pipe) > { > struct sigaction sa; > sigset_t ss; > @@ -317,6 +324,8 @@ operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_ru > valid = false; > poll_data = NULL; > output_fd = out_fd; > + read_comm_pipe = _convert_read_pipe; > + write_comm_pipe = _convert_write_pipe; > write_to_file = out_fd_is_file; > opHeader.data_size = 0; > num_cpus = -1; > @@ -657,6 +666,7 @@ void operf_record::setup() > * "inherit" on the perf_event_attr we pass to perf_event_open when cpu is '-1'. > */ > bool inherit = !profile_process_group; > + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); > for (unsigned proc_idx = 0; proc_idx < num_procs; proc_idx++) { > for (unsigned event = 0; event < evts.size(); event++) { > /* For a parent process, comm.tid==comm.pid, but for child > @@ -664,14 +674,17 @@ void operf_record::setup() > * we must use comm.tid for the perf_event_open call. So > * we can use comm.tid for all cases. > */ > - pid_t pid_for_open = profile_process_group ? procs[proc_idx].tid > - : pid_to_profile; > + pid_t pid_for_open; > + if (profile_process_group) > + pid_for_open = proc_it++->second.tid; > + else > + pid_for_open = pid_to_profile; > operf_counter op_ctr(operf_counter(evts[event], > (!pid_started && !system_wide), > callgraph, separate_cpu, > inherit, event)); > if ((rc = op_ctr.perf_event_open(pid_for_open, > - real_cpu, this)) < 0) { > + real_cpu, this, true)) < 0) { > err_msg = "Internal Error. Perf event setup failed."; > goto error; > } > @@ -716,24 +729,93 @@ void operf_record::record_process_info(void) > { > map<unsigned int, unsigned int> pids_mapped; > pid_t last_tgid = -1; > - for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++) > + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); > + for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++, proc_it++) > { > - int num = OP_perf_utils::op_write_output(output_fd, &procs[proc_idx], > - procs[proc_idx].header.size); > + struct comm_event ce = proc_it->second; > + int num = OP_perf_utils::op_write_output(output_fd, &ce, ce.header.size); > add_to_total(num); > if (cverb << vrecord) > - cout << "Created COMM event for " << procs[proc_idx].comm << endl; > + cout << "Created COMM event for " << ce.comm << endl; > > - if (((pid_t)(procs[proc_idx].pid) == last_tgid) || > - (pids_mapped.find(procs[proc_idx].pid) != pids_mapped.end())) > + if (((pid_t)(ce.pid) == last_tgid) || > + (pids_mapped.find(ce.pid) != pids_mapped.end())) > continue; > - OP_perf_utils::op_record_process_exec_mmaps(procs[proc_idx].tid, > - procs[proc_idx].pid, > + OP_perf_utils::op_record_process_exec_mmaps(ce.tid, > + ce.pid, > output_fd, this); > - pids_mapped[procs[proc_idx].pid] = last_tgid = procs[proc_idx].pid; > + pids_mapped[ce.pid] = last_tgid = ce.pid; > + } > +} > + > +int operf_record::_start_recoding_new_thread(pid_t id) > +{ > + string err_msg; > + int num_mmaps, rc, fd_for_set_output = -1; > + struct comm_event ce; > + u64 sample_id; > + struct pollfd * old_polldata = poll_data; > + > + num_mmaps = sizeof(poll_data)/sizeof(poll_data[0]); > + num_mmaps++; > + poll_data = new struct pollfd [num_mmaps]; > + // Copy only the existing pollfd objects from the array. The new pollfd will > + // be filled in via the call to _prepare_to_record_one_fd. > + for (int i = 0; i < num_mmaps - 1; i++) > + poll_data[i] = old_polldata[i]; > + delete[] old_polldata; > + // Make a pseudo comm_event object. At this point, the > + // only field we need to set is tid. > + memset(&ce, sizeof(ce), 0); > + ce.tid = id; > + add_process(ce); > + > + for (unsigned event = 0; event < evts.size(); event++) { > + operf_counter op_ctr(operf_counter(evts[event], > + (!pid_started && !system_wide), > + callgraph, separate_cpu, > + false, event)); > + if (op_ctr.perf_event_open(id, -1, this, false) < 0) { > + sample_id = OP_PERF_NO_SAMPLE_ID; > + // Send special value to convert process to indicate failure > + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); > + if (len < 0) > + perror("Internal error on convert write_comm_pipe"); > + return -1; > + } > + perfCounters.push_back(op_ctr); > + int fd = op_ctr.get_fd(); > + if (event == 0) { > + rc = _prepare_to_record_one_fd(num_mmaps - 1, fd); > + fd_for_set_output = fd; > + } else { > + if ((rc = ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, > + fd_for_set_output)) < 0) > + perror("_start_recoding_new_thread: ioctl #1 failed"); > + } > + > + if (rc < 0) > + return rc; > + > + if ((rc = ioctl(fd, PERF_EVENT_IOC_ENABLE)) < 0) { > + perror("_start_recoding_new_thread: ioctl #2 failed"); > + return rc; > + } > + > + sample_id = opHeader.h_attrs[event].ids.back(); > + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); > + if (len < 0) > + perror("Internal error on convert write_comm_pipe"); > + else if (len != sizeof(sample_id)) > + cerr << "Incomplete write convert to write_comm_pipe" << endl; > + else > + cverb << vrecord << "Sent sample_id " << sample_id << " to convert process" << endl; > } > + > + return 0; > } > > + > void operf_record::recordPerfData(void) > { > bool disabled = false; > @@ -744,6 +826,8 @@ void operf_record::recordPerfData(void) > cerr << "operf: Profiler started" << endl; > while (1) { > int prev = sample_reads; > + pid_t pi; > + ssize_t len; > > for (size_t i = 0; i < samples_array.size(); i++) { > if (samples_array[i].base) > @@ -755,6 +839,20 @@ void operf_record::recordPerfData(void) > if (prev == sample_reads) { > (void)poll(poll_data, poll_count, -1); > } > + if (!quit && track_new_forks && procs.size() > 1) { > + len = read(read_comm_pipe, &pi, sizeof(pi)); > + > + if (len < 0 && errno != EAGAIN) { > + cverb << vrecord << "Non-fatal error: read_comm_pipe returned too few bytes" << endl; > + } else if (len == sizeof(pi) && (procs.find(pi) == procs.end())) { > + // Start profiling this new thread > + cverb << vrecord << "Start recording for new thread " << pi << endl; > + // Don't treat as fatal error if it doesn't work > + if (_start_recoding_new_thread(pi) < 0) > + cerr << "Unable to collect samples for forked process " << pi > + << ". Process may have ended before recording could be started." << endl; > + } > + } > > if (quit) { > for (unsigned int i = 0; i < perfCounters.size(); i++) > @@ -768,9 +866,11 @@ void operf_record::recordPerfData(void) > } > > void operf_read::init(int sample_data_pipe_fd, string input_filename, string samples_loc, op_cpu cputype, > - bool systemwide) > + bool systemwide, int _record_write_pipe, int _record_read_pipe) > { > sample_data_fd = sample_data_pipe_fd; > + read_comm_pipe = _record_read_pipe; > + write_comm_pipe = _record_write_pipe; > inputFname = input_filename; > sampledir = samples_loc; > cpu_type = cputype; > @@ -782,6 +882,11 @@ operf_read::~operf_read() > evts.clear(); > } > > +void operf_read::add_sample_id_to_opHeader(u64 sample_id) > +{ > + for (unsigned int i = 0; i < evts.size(); i++) > + opHeader.h_attrs[i].ids.push_back(sample_id); > +} > > int operf_read::_read_header_info_with_ifstream(void) > { > diff --git a/libperf_events/operf_counter.h b/libperf_events/operf_counter.h > index 0874616..dfb4336 100644 > --- a/libperf_events/operf_counter.h > +++ b/libperf_events/operf_counter.h > @@ -1,5 +1,5 @@ > /** > - * @file pe_profiling/operf_counter.h > + * @file libperf_events/operf_counter.h > * C++ class definition that abstracts the user-to-kernel interface > * for using Linux Performance Events Subsystem. > * > @@ -11,7 +11,7 @@ > * (C) Copyright IBM Corp. 2011 > * > * Modified by Maynard Johnson <may...@us...> > - * (C) Copyright IBM Corporation 2012 > + * (C) Copyright IBM Corporation 2012, 2014 > * > */ > > @@ -51,6 +51,7 @@ op_perf_event_open(struct perf_event_attr * attr, > } > > #define OP_PERF_HANDLED_ERROR -101 > +#define OP_PERF_NO_SAMPLE_ID 0xdeadbeefdeadbeefULL > > > class operf_counter { > @@ -58,7 +59,7 @@ public: > operf_counter(operf_event_t & evt, bool enable_on_exec, bool callgraph, > bool separate_by_cpu, bool inherit, int event_number); > ~operf_counter(); > - int perf_event_open(pid_t pid, int cpu, operf_record * pr); > + int perf_event_open(pid_t pid, int cpu, operf_record * pr, bool print_error); > const struct perf_event_attr * the_attr(void) const { return &attr; } > int get_fd(void) const { return fd; } > int get_id(void) const { return id; } > @@ -82,12 +83,13 @@ public: > */ > operf_record(int output_fd, bool sys_wide, pid_t the_pid, bool pid_running, > std::vector<operf_event_t> & evts, OP_perf_utils::vmlinux_info_t vi, > - bool callgraph, bool separate_by_cpu, bool output_fd_is_file); > + bool callgraph, bool separate_by_cpu, bool output_fd_is_file, > + int _convert_read_pipe, int _convert_write_pipe); > ~operf_record(); > void recordPerfData(void); > int out_fd(void) const { return output_fd; } > void add_to_total(int n) { total_bytes_recorded += n; } > - void add_process(struct comm_event proc) { procs.push_back(proc); } > + void add_process(struct comm_event proc) { procs[proc.tid] = proc; } > unsigned int get_total_bytes_recorded(void) const { return total_bytes_recorded; } > void register_perf_event_id(unsigned counter, u64 id, perf_event_attr evt_attr); > bool get_valid(void) { return valid; } > @@ -97,11 +99,14 @@ private: > void setup(void); > int prepareToRecord(void); > int _prepare_to_record_one_fd(int idx, int fd); > + int _start_recoding_new_thread(pid_t id); > void record_process_info(void); > void write_op_header_info(void); > int _write_header_to_file(void); > int _write_header_to_pipe(void); > int output_fd; > + int read_comm_pipe; > + int write_comm_pipe; > bool write_to_file; > // Array of size 'num_cpus_used_for_perf_event_open * num_pids * num_events' > struct pollfd * poll_data; > @@ -113,7 +118,7 @@ private: > * that information in a collection of type 'comm_event'. We'll use this collection > * for synthesizing PERF_RECORD_COMM events into the profile data stream. > */ > - std::vector<struct comm_event> procs; > + std::map<u32, struct comm_event> procs; > bool pid_started; > bool system_wide; > bool callgraph; > @@ -134,16 +139,21 @@ public: > : sample_data_fd(-1), inputFname(""), evts(_evts), cpu_type(CPU_NO_GOOD) > { valid = syswide = false;} > void init(int sample_data_pipe_fd, std::string input_filename, std::string samples_dir, op_cpu cputype, > - bool systemwide); > + bool systemwide, int _record_write_pipe, int _record_read_pipe); > ~operf_read(); > int readPerfHeader(void); > unsigned int convertPerfData(void); > bool is_valid(void) {return valid; } > int get_eventnum_by_perf_event_id(u64 id) const; > inline const operf_event_t * get_event_by_counter(u32 counter) { return &evts[counter]; } > + int get_write_comm_pipe(void) { return write_comm_pipe; } > + int get_read_comm_pipe(void) { return read_comm_pipe; } > + void add_sample_id_to_opHeader(u64 sample_id); > > private: > int sample_data_fd; > + int write_comm_pipe; > + int read_comm_pipe; > std::string inputFname; > std::string sampledir; > std::ifstream istrm; > diff --git a/libperf_events/operf_utils.cpp b/libperf_events/operf_utils.cpp > index 68010c1..54a0ace 100644 > --- a/libperf_events/operf_utils.cpp > +++ b/libperf_events/operf_utils.cpp > @@ -10,7 +10,7 @@ > * (C) Copyright IBM Corp. 2011 > * > * Modified by Maynard Johnson <may...@us...> > - * (C) Copyright IBM Corporation 2012, 2013 > + * (C) Copyright IBM Corporation 2012, 2013, 2014 > * > */ > > @@ -48,6 +48,7 @@ extern pid_t app_PID; > extern verbose vrecord; > extern verbose vconvert; > extern void __set_event_throttled(int index); > +extern bool track_new_forks; > > using namespace std; > > @@ -87,6 +88,8 @@ static void __handle_fork_event(event_t * event) > operf_process_info * parent = NULL; > operf_process_info * forked_proc = NULL; > > + // First, see if we already have a proc_info object for the parent process > + // that did the fork > it = process_map.find(event->fork.ppid); > if (it != process_map.end()) { > parent = it->second; > @@ -101,14 +104,42 @@ static void __handle_fork_event(event_t * event) > process_map[event->fork.ppid] = parent; > } > > - /* If the forked process's pid is the same as the parent's, we simply ignore the FORK > - * event. This is because operf_process_info objects are stored in the map collection > - * by pid, meaning that the forked process and its parent reference the same > + /* If the user requested to profile by "--pid", then we must notify the > + * recording process whenever we see a fork event. If the record process > + * isn't already recording samples for this thread/process, it will start > + * recording now. > + */ > + if (track_new_forks) { > + if (cverb << vconvert) > + cout << "Inform record process of new pid/tid " > + << event->fork.pid << "/" << event->fork.tid << endl; > + pid_t id = (event->fork.pid == event->fork.ppid) ? event->fork.tid : > + event->fork.pid; > + ssize_t len = write(operfRead.get_write_comm_pipe(), &id, sizeof(id)); > + if (len < 0) > + perror("Internal error on record write_comm_pipe"); > + else if (len != sizeof(id)) > + cerr << "Incomplete write to record write_comm_pipe" << endl; > + u64 sample_id; > + // get sample id from recording process > + len = read(operfRead.get_read_comm_pipe(), &sample_id, sizeof(sample_id)); > + if (sample_id == OP_PERF_NO_SAMPLE_ID) { > + cverb << vconvert << "convert: No sample_id from record process" << endl; > + } else { > + cverb << vconvert << "Add sample_id " << sample_id << " to opHeader" << endl; > + operfRead.add_sample_id_to_opHeader(sample_id); > + } > + } > + > + /* If the forked process's pid is the same as the parent's, we simply ignore > + * the FORK event. This is because operf_process_info objects are stored in the map > + * collection by pid, meaning that the forked process and its parent reference the same > * operf_process_info object. > */ > if (event->fork.pid == event->fork.ppid) > return; > > + // Now try to find a proc_info for the forked process itself. > it = process_map.find(event->fork.pid); > if (it == process_map.end()) { > forked_proc = new operf_process_info(event->fork.pid, NULL, false, false); > diff --git a/pe_profiling/operf.cpp b/pe_profiling/operf.cpp > index ec85f6c..51f55a5 100644 > --- a/pe_profiling/operf.cpp > +++ b/pe_profiling/operf.cpp > @@ -11,7 +11,7 @@ > * (C) Copyright IBM Corp. 2011 > * > * Modified by Maynard Johnson <may...@us...> > - * (C) Copyright IBM Corporation 2012, 2013 > + * (C) Copyright IBM Corporation 2012, 2013, 2014 > * > */ > > @@ -77,12 +77,19 @@ int kptr_restrict; > char * start_time_human_readable; > std::vector<operf_event_t> events; > operf_read operfRead(events); > +/* With certain operf options, we have to take extra steps to track new threads > + * and processes that an app may create via pthread_create, fork, etc. Note that > + * any such thread or process creation APIs will result in a PERF_RECORD_FORK event, > + * so we handle these new threads/processes in operf_utils::__handle_fork_event. > + */ > +bool track_new_forks; > > > #define DEFAULT_OPERF_OUTFILE "operf.data" > #define KERN_ADDR_SPACE_START_SYMBOL "_text" > #define KERN_ADDR_SPACE_END_SYMBOL "_etext" > > +static operf_record * operfRecord = NULL; > static char * app_name_SAVE = NULL; > static char ** app_args = NULL; > static pid_t jitconv_pid = -1; > @@ -96,6 +103,14 @@ static char start_time_str[32]; > static bool jit_conversion_running; > static void convert_sample_data(void); > static int sample_data_pipe[2]; > +static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; > +// The operf_convert_record_write_pipe is used for the convert process to send > +// forked PID data to the record process. > +static int operf_convert_record_write_pipe[2]; > +// The operf_record_convert_write_pipe is used for the record process to send > +// data to the convert process in response to the forked PID data. > +static int operf_record_convert_write_pipe[2]; > + > bool ctl_c = false; > bool pipe_closed = false; > > @@ -225,8 +240,6 @@ void set_signals_for_parent(void) > } > } > > -static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; > - > static string args_to_string(void) > { > string ret; > @@ -310,6 +323,14 @@ int start_profiling(void) > perror("Internal error: could not create pipe"); > return -1; > } > + if (pipe2(operf_convert_record_write_pipe, O_NONBLOCK) < 0) { > + perror("Internal error: could not create pipe"); > + return -1; > + } > + if (pipe(operf_record_convert_write_pipe) < 0) { > + perror("Internal error: could not create pipe"); > + return -1; > + } > operf_record_pid = fork(); > if (operf_record_pid < 0) { > return -1; > @@ -318,6 +339,8 @@ int start_profiling(void) > int exit_code = EXIT_SUCCESS; > _set_basic_SIGINT_handler_for_child(); > close(operf_record_ready_pipe[0]); > + close(operf_convert_record_write_pipe[1]); > + close(operf_record_convert_write_pipe[0]); > if (!operf_options::post_conversion) > close(sample_data_pipe[0]); > /* > @@ -343,11 +366,12 @@ int start_profiling(void) > } else { > outfd = sample_data_pipe[1]; > } > - operf_record operfRecord(outfd, operf_options::system_wide, app_PID, > + operfRecord = new operf_record(outfd, operf_options::system_wide, app_PID, > (operf_options::pid == app_PID), events, vi, > operf_options::callgraph, > - operf_options::separate_cpu, operf_options::post_conversion); > - if (operfRecord.get_valid() == false) { > + operf_options::separate_cpu, operf_options::post_conversion, > + operf_convert_record_write_pipe[0], operf_record_convert_write_pipe[1]); > + if (operfRecord->get_valid() == false) { > /* If valid is false, it means that one of the "known" errors has > * occurred: > * - profiled process has already ended > @@ -370,9 +394,10 @@ int start_profiling(void) > } > > // start recording > - operfRecord.recordPerfData(); > + operfRecord->recordPerfData(); > cverb << vdebug << "Total bytes recorded from perf events: " << dec > - << operfRecord.get_total_bytes_recorded() << endl; > + << operfRecord->get_total_bytes_recorded() << endl; > + delete operfRecord; > } catch (const runtime_error & re) { > /* If the user does ctl-c, the operf-record process may get interrupted > * in a system call, causing problems with writes to the sample data pipe. > @@ -388,6 +413,9 @@ int start_profiling(void) > _exit(exit_code); > > fail_out: > + if (operfRecord) > + delete operfRecord; > + > if (!ready){ > /* ready==0 means we've not yet told parent we're ready, > * but the parent is reading our pipe. So we tell the > @@ -615,6 +643,10 @@ static end_code_t _run(void) > // parent > close(sample_data_pipe[0]); > close(sample_data_pipe[1]); > + close(operf_convert_record_write_pipe[0]); > + close(operf_convert_record_write_pipe[1]); > + close(operf_record_convert_write_pipe[0]); > + close(operf_record_convert_write_pipe[1]); > } > } > > @@ -909,7 +941,11 @@ static void convert_sample_data(void) > inputfd = sample_data_pipe[0]; > inputfname = ""; > } > - operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, operf_options::system_wide); > + close(operf_record_convert_write_pipe[1]); > + close(operf_convert_record_write_pipe[0]); > + operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, > + operf_options::system_wide, operf_convert_record_write_pipe[1], > + operf_record_convert_write_pipe[0]); > if ((rc = operfRead.readPerfHeader()) < 0) { > if (rc != OP_PERF_HANDLED_ERROR) > cerr << "Error: Cannot create read header info for sample data " << endl; > @@ -1389,6 +1425,10 @@ static void process_args(int argc, char * const argv[]) > string startEnd = _process_vmlinux(operf_options::vmlinux); > operf_create_vmlinux(operf_options::vmlinux.c_str(), startEnd.c_str()); > } > + if (operf_options::pid && !operf_options::post_conversion) > + track_new_forks = true; > + else > + track_new_forks = false; > > return; > } > |
From: Maynard J. <may...@us...> - 2014-06-23 20:04:31
|
On 06/09/2014 09:51 AM, Maynard Johnson wrote: > On 05/23/2014 12:43 PM, Maynard Johnson wrote: >> Improve sample collection in multi-threaded apps when using "--pid" option > To all, > I would greatly appreciate any feedback on this patch. Has anyone in the oprofile > community encountered the problem which this patch fixes? I imagine a lot of people > have encountered the problem, but may not have been aware. For those who have run > into this issue, this fix plugs a big gaping hole in the operf sample collection. > As such, I view this as a critical fix that I want to commit. But since the patch > is non-trivial, I would prefer to have some reviewer(s) take a look at it. Carl, thanks for testing and reviewing the patch (Carl's comments can be seen at https://sourceforge.net/p/oprofile/bugs/260/). Patch applied. -Maynard > > Thanks. > > -Maynard >> >> See oprofile bug # 260 (https://sourceforge.net/p/oprofile/bugs/260/) for reference. >> >> operf may fail to collect samples for new threads or forked >> processes in a multi-threaded app under certain conditions: >> >> - If operf is started on a multi-threaded app (i.e., one that uses >> pthreads) with the "--pid" option >> - If said app has already started at least one thread >> - It the app either creates new threads or forks new processes >> after operf has started >> >> then no samples are collected for those new threads/processes. >> >> This patch fixes that problem (mostly) and documents the limitations and >> issues when using the "--pid" option. The limitations and issues are: >> >> 1. When using "--pid" to profile a multi-threaded application that also forks >> new processes, samples for processes that are forked before profiling is >> started may not be recorded (depending on timing of thread creation and >> when operf is started) >> >> 2. The "--lazy-conversion" option is not recommended to be used in conjunction >> with the --pid option for profiling multi-threaded processes. Depending on >> the order of thread creation (or forking of new processes), you may not get >> any samples for the new threads/processes. >> >> Signed-off-by: Maynard Johnson <may...@us...> >> --- >> doc/operf.1.in | 35 ++++++--- >> libperf_events/operf_counter.cpp | 151 ++++++++++++++++++++++++++++++++------ >> libperf_events/operf_counter.h | 24 ++++-- >> libperf_events/operf_utils.cpp | 39 +++++++++- >> pe_profiling/operf.cpp | 58 ++++++++++++-- >> 5 files changed, 254 insertions(+), 53 deletions(-) >> >> diff --git a/doc/operf.1.in b/doc/operf.1.in >> index 72a46ed..5ce43c0 100644 >> --- a/doc/operf.1.in >> +++ b/doc/operf.1.in >> @@ -35,21 +35,19 @@ and lost samples, are written to the operf.log file that can be found in the >> <session_dir>/samples directory. >> .br >> >> -.SH OPTIONS >> +.SH RUN MODES >> +One (and only one) of the following >> +.SB run modes >> +must be specified: >> .TP >> .BI command [args] >> The command or application to be profiled. >> .I args >> -are the input arguments that the command or application requires. One (and only one) of either >> -.I command >> -, >> -.I --pid >> -or >> -.I --system-wide >> -is required. >> +are the input arguments that the command or application requires. >> .br >> .TP >> .BI "--pid / -p " PID >> +.RS >> This option enables operf to profile a running application. >> .I PID >> should be the process ID of the process you wish to profile. When >> @@ -65,7 +63,13 @@ data it has collected. Use >> .BI -SIGINT >> .BI <operf-PID> >> for this purpose. >> -.br >> +.P >> +.B Limitation: >> +When using this option to profile a multi-threaded application that also forks >> +new processes, be aware that samples for processes that are forked before profiling >> +is started may not be recorded (depending on timing of thread creation and when >> +operf is started). >> +.RE >> .TP >> .BI "--system-wide / -s" >> This option is for performing a system-wide profile. You must >> @@ -86,6 +90,7 @@ that when running operf with this option, the user's current working >> directory should be /root or a subdirectory of /root to avoid >> storing sample data files in locations accessible by regular users. >> .br >> +.SH OTHER OPTIONS >> .TP >> .BI "--vmlinux / -k " vmlinux_path >> .RS >> @@ -188,6 +193,7 @@ directory on the current path. >> .br >> .TP >> .BI "--lazy-conversion / -l" >> +.RS >> Use this option to reduce the overhead of >> .BI operf >> during profiling. Normally, profile data received from the kernel is converted >> @@ -200,7 +206,16 @@ particularly on busy multi-processor systems. The >> option directs >> .BI operf >> to wait until profiling is completed to do the conversion of profile data. >> -.br >> +.P >> +.B Note: >> +This option is >> +.B not >> +recommended to be used in conjunction with the >> +.I --pid >> +option for profiling multi-threaded processes. Depending >> +on the order of thread creation (or forking of new processes), >> +you may not get any samples for the new threads/processes. >> +.RE >> .TP >> .BI "--append / -a" >> By default, >> diff --git a/libperf_events/operf_counter.cpp b/libperf_events/operf_counter.cpp >> index dd738aa..9986576 100644 >> --- a/libperf_events/operf_counter.cpp >> +++ b/libperf_events/operf_counter.cpp >> @@ -1,5 +1,5 @@ >> /** >> - * @file pe_profiling/operf_counter.cpp >> + * @file libperf_events/operf_counter.cpp >> * C++ class implementation that abstracts the user-to-kernel interface >> * for using Linux Performance Events Subsystem. >> * >> @@ -11,7 +11,7 @@ >> * (C) Copyright IBM Corp. 2011 >> * >> * Modified by Maynard Johnson <may...@us...> >> - * (C) Copyright IBM Corporation 2012 >> + * (C) Copyright IBM Corporation 2012, 2014 >> * >> */ >> >> @@ -50,6 +50,7 @@ extern bool throttled; >> extern size_t mmap_size; >> extern size_t pg_sz; >> extern bool use_cpu_minus_one; >> +extern bool track_new_forks; >> >> namespace { >> >> @@ -233,7 +234,7 @@ operf_counter::~operf_counter() { >> } >> >> >> -int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) >> +int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec, bool print_error) >> { >> struct { >> u64 count; >> @@ -249,16 +250,21 @@ int operf_counter::perf_event_open(pid_t pid, int cpu, operf_record * rec) >> int ret = -1; >> cverb << vrecord << "perf_event_open failed: " << strerror(errno) << endl; >> if (errno == EBUSY) { >> - cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl >> - << "On some architectures, tools such as oprofile and perf being used in system-wide " >> - << "mode can cause this problem." << endl; >> + if (print_error) { >> + cerr << "The performance monitoring hardware reports EBUSY. Is another profiling tool in use?" << endl >> + << "On some architectures, tools such as oprofile and perf being used in system-wide " >> + << "mode can cause this problem." << endl; >> + } >> ret = OP_PERF_HANDLED_ERROR; >> } else if (errno == ESRCH) { >> - cerr << "!!!! No samples collected !!!" << endl; >> - cerr << "The target program/command ended before profiling was started." << endl; >> + if (print_error) { >> + cerr << "!!!! No samples collected !!!" << endl; >> + cerr << "The target program/command ended before profiling was started." << endl; >> + } >> ret = OP_PERF_HANDLED_ERROR; >> } else { >> - cerr << "perf_event_open failed with " << strerror(errno) << endl; >> + if (print_error) >> + cerr << "perf_event_open failed with " << strerror(errno) << endl; >> } >> return ret; >> } >> @@ -299,7 +305,8 @@ operf_record::~operf_record() >> >> operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_running, >> vector<operf_event_t> & events, vmlinux_info_t vi, bool do_cg, >> - bool separate_by_cpu, bool out_fd_is_file) >> + bool separate_by_cpu, bool out_fd_is_file, >> + int _convert_read_pipe, int _convert_write_pipe) >> { >> struct sigaction sa; >> sigset_t ss; >> @@ -317,6 +324,8 @@ operf_record::operf_record(int out_fd, bool sys_wide, pid_t the_pid, bool pid_ru >> valid = false; >> poll_data = NULL; >> output_fd = out_fd; >> + read_comm_pipe = _convert_read_pipe; >> + write_comm_pipe = _convert_write_pipe; >> write_to_file = out_fd_is_file; >> opHeader.data_size = 0; >> num_cpus = -1; >> @@ -657,6 +666,7 @@ void operf_record::setup() >> * "inherit" on the perf_event_attr we pass to perf_event_open when cpu is '-1'. >> */ >> bool inherit = !profile_process_group; >> + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); >> for (unsigned proc_idx = 0; proc_idx < num_procs; proc_idx++) { >> for (unsigned event = 0; event < evts.size(); event++) { >> /* For a parent process, comm.tid==comm.pid, but for child >> @@ -664,14 +674,17 @@ void operf_record::setup() >> * we must use comm.tid for the perf_event_open call. So >> * we can use comm.tid for all cases. >> */ >> - pid_t pid_for_open = profile_process_group ? procs[proc_idx].tid >> - : pid_to_profile; >> + pid_t pid_for_open; >> + if (profile_process_group) >> + pid_for_open = proc_it++->second.tid; >> + else >> + pid_for_open = pid_to_profile; >> operf_counter op_ctr(operf_counter(evts[event], >> (!pid_started && !system_wide), >> callgraph, separate_cpu, >> inherit, event)); >> if ((rc = op_ctr.perf_event_open(pid_for_open, >> - real_cpu, this)) < 0) { >> + real_cpu, this, true)) < 0) { >> err_msg = "Internal Error. Perf event setup failed."; >> goto error; >> } >> @@ -716,24 +729,93 @@ void operf_record::record_process_info(void) >> { >> map<unsigned int, unsigned int> pids_mapped; >> pid_t last_tgid = -1; >> - for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++) >> + std::map<u32, struct comm_event>::iterator proc_it = procs.begin(); >> + for (unsigned int proc_idx = 0; proc_idx < procs.size(); proc_idx++, proc_it++) >> { >> - int num = OP_perf_utils::op_write_output(output_fd, &procs[proc_idx], >> - procs[proc_idx].header.size); >> + struct comm_event ce = proc_it->second; >> + int num = OP_perf_utils::op_write_output(output_fd, &ce, ce.header.size); >> add_to_total(num); >> if (cverb << vrecord) >> - cout << "Created COMM event for " << procs[proc_idx].comm << endl; >> + cout << "Created COMM event for " << ce.comm << endl; >> >> - if (((pid_t)(procs[proc_idx].pid) == last_tgid) || >> - (pids_mapped.find(procs[proc_idx].pid) != pids_mapped.end())) >> + if (((pid_t)(ce.pid) == last_tgid) || >> + (pids_mapped.find(ce.pid) != pids_mapped.end())) >> continue; >> - OP_perf_utils::op_record_process_exec_mmaps(procs[proc_idx].tid, >> - procs[proc_idx].pid, >> + OP_perf_utils::op_record_process_exec_mmaps(ce.tid, >> + ce.pid, >> output_fd, this); >> - pids_mapped[procs[proc_idx].pid] = last_tgid = procs[proc_idx].pid; >> + pids_mapped[ce.pid] = last_tgid = ce.pid; >> + } >> +} >> + >> +int operf_record::_start_recoding_new_thread(pid_t id) >> +{ >> + string err_msg; >> + int num_mmaps, rc, fd_for_set_output = -1; >> + struct comm_event ce; >> + u64 sample_id; >> + struct pollfd * old_polldata = poll_data; >> + >> + num_mmaps = sizeof(poll_data)/sizeof(poll_data[0]); >> + num_mmaps++; >> + poll_data = new struct pollfd [num_mmaps]; >> + // Copy only the existing pollfd objects from the array. The new pollfd will >> + // be filled in via the call to _prepare_to_record_one_fd. >> + for (int i = 0; i < num_mmaps - 1; i++) >> + poll_data[i] = old_polldata[i]; >> + delete[] old_polldata; >> + // Make a pseudo comm_event object. At this point, the >> + // only field we need to set is tid. >> + memset(&ce, sizeof(ce), 0); >> + ce.tid = id; >> + add_process(ce); >> + >> + for (unsigned event = 0; event < evts.size(); event++) { >> + operf_counter op_ctr(operf_counter(evts[event], >> + (!pid_started && !system_wide), >> + callgraph, separate_cpu, >> + false, event)); >> + if (op_ctr.perf_event_open(id, -1, this, false) < 0) { >> + sample_id = OP_PERF_NO_SAMPLE_ID; >> + // Send special value to convert process to indicate failure >> + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); >> + if (len < 0) >> + perror("Internal error on convert write_comm_pipe"); >> + return -1; >> + } >> + perfCounters.push_back(op_ctr); >> + int fd = op_ctr.get_fd(); >> + if (event == 0) { >> + rc = _prepare_to_record_one_fd(num_mmaps - 1, fd); >> + fd_for_set_output = fd; >> + } else { >> + if ((rc = ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, >> + fd_for_set_output)) < 0) >> + perror("_start_recoding_new_thread: ioctl #1 failed"); >> + } >> + >> + if (rc < 0) >> + return rc; >> + >> + if ((rc = ioctl(fd, PERF_EVENT_IOC_ENABLE)) < 0) { >> + perror("_start_recoding_new_thread: ioctl #2 failed"); >> + return rc; >> + } >> + >> + sample_id = opHeader.h_attrs[event].ids.back(); >> + ssize_t len = write(write_comm_pipe, &sample_id, sizeof(sample_id)); >> + if (len < 0) >> + perror("Internal error on convert write_comm_pipe"); >> + else if (len != sizeof(sample_id)) >> + cerr << "Incomplete write convert to write_comm_pipe" << endl; >> + else >> + cverb << vrecord << "Sent sample_id " << sample_id << " to convert process" << endl; >> } >> + >> + return 0; >> } >> >> + >> void operf_record::recordPerfData(void) >> { >> bool disabled = false; >> @@ -744,6 +826,8 @@ void operf_record::recordPerfData(void) >> cerr << "operf: Profiler started" << endl; >> while (1) { >> int prev = sample_reads; >> + pid_t pi; >> + ssize_t len; >> >> for (size_t i = 0; i < samples_array.size(); i++) { >> if (samples_array[i].base) >> @@ -755,6 +839,20 @@ void operf_record::recordPerfData(void) >> if (prev == sample_reads) { >> (void)poll(poll_data, poll_count, -1); >> } >> + if (!quit && track_new_forks && procs.size() > 1) { >> + len = read(read_comm_pipe, &pi, sizeof(pi)); >> + >> + if (len < 0 && errno != EAGAIN) { >> + cverb << vrecord << "Non-fatal error: read_comm_pipe returned too few bytes" << endl; >> + } else if (len == sizeof(pi) && (procs.find(pi) == procs.end())) { >> + // Start profiling this new thread >> + cverb << vrecord << "Start recording for new thread " << pi << endl; >> + // Don't treat as fatal error if it doesn't work >> + if (_start_recoding_new_thread(pi) < 0) >> + cerr << "Unable to collect samples for forked process " << pi >> + << ". Process may have ended before recording could be started." << endl; >> + } >> + } >> >> if (quit) { >> for (unsigned int i = 0; i < perfCounters.size(); i++) >> @@ -768,9 +866,11 @@ void operf_record::recordPerfData(void) >> } >> >> void operf_read::init(int sample_data_pipe_fd, string input_filename, string samples_loc, op_cpu cputype, >> - bool systemwide) >> + bool systemwide, int _record_write_pipe, int _record_read_pipe) >> { >> sample_data_fd = sample_data_pipe_fd; >> + read_comm_pipe = _record_read_pipe; >> + write_comm_pipe = _record_write_pipe; >> inputFname = input_filename; >> sampledir = samples_loc; >> cpu_type = cputype; >> @@ -782,6 +882,11 @@ operf_read::~operf_read() >> evts.clear(); >> } >> >> +void operf_read::add_sample_id_to_opHeader(u64 sample_id) >> +{ >> + for (unsigned int i = 0; i < evts.size(); i++) >> + opHeader.h_attrs[i].ids.push_back(sample_id); >> +} >> >> int operf_read::_read_header_info_with_ifstream(void) >> { >> diff --git a/libperf_events/operf_counter.h b/libperf_events/operf_counter.h >> index 0874616..dfb4336 100644 >> --- a/libperf_events/operf_counter.h >> +++ b/libperf_events/operf_counter.h >> @@ -1,5 +1,5 @@ >> /** >> - * @file pe_profiling/operf_counter.h >> + * @file libperf_events/operf_counter.h >> * C++ class definition that abstracts the user-to-kernel interface >> * for using Linux Performance Events Subsystem. >> * >> @@ -11,7 +11,7 @@ >> * (C) Copyright IBM Corp. 2011 >> * >> * Modified by Maynard Johnson <may...@us...> >> - * (C) Copyright IBM Corporation 2012 >> + * (C) Copyright IBM Corporation 2012, 2014 >> * >> */ >> >> @@ -51,6 +51,7 @@ op_perf_event_open(struct perf_event_attr * attr, >> } >> >> #define OP_PERF_HANDLED_ERROR -101 >> +#define OP_PERF_NO_SAMPLE_ID 0xdeadbeefdeadbeefULL >> >> >> class operf_counter { >> @@ -58,7 +59,7 @@ public: >> operf_counter(operf_event_t & evt, bool enable_on_exec, bool callgraph, >> bool separate_by_cpu, bool inherit, int event_number); >> ~operf_counter(); >> - int perf_event_open(pid_t pid, int cpu, operf_record * pr); >> + int perf_event_open(pid_t pid, int cpu, operf_record * pr, bool print_error); >> const struct perf_event_attr * the_attr(void) const { return &attr; } >> int get_fd(void) const { return fd; } >> int get_id(void) const { return id; } >> @@ -82,12 +83,13 @@ public: >> */ >> operf_record(int output_fd, bool sys_wide, pid_t the_pid, bool pid_running, >> std::vector<operf_event_t> & evts, OP_perf_utils::vmlinux_info_t vi, >> - bool callgraph, bool separate_by_cpu, bool output_fd_is_file); >> + bool callgraph, bool separate_by_cpu, bool output_fd_is_file, >> + int _convert_read_pipe, int _convert_write_pipe); >> ~operf_record(); >> void recordPerfData(void); >> int out_fd(void) const { return output_fd; } >> void add_to_total(int n) { total_bytes_recorded += n; } >> - void add_process(struct comm_event proc) { procs.push_back(proc); } >> + void add_process(struct comm_event proc) { procs[proc.tid] = proc; } >> unsigned int get_total_bytes_recorded(void) const { return total_bytes_recorded; } >> void register_perf_event_id(unsigned counter, u64 id, perf_event_attr evt_attr); >> bool get_valid(void) { return valid; } >> @@ -97,11 +99,14 @@ private: >> void setup(void); >> int prepareToRecord(void); >> int _prepare_to_record_one_fd(int idx, int fd); >> + int _start_recoding_new_thread(pid_t id); >> void record_process_info(void); >> void write_op_header_info(void); >> int _write_header_to_file(void); >> int _write_header_to_pipe(void); >> int output_fd; >> + int read_comm_pipe; >> + int write_comm_pipe; >> bool write_to_file; >> // Array of size 'num_cpus_used_for_perf_event_open * num_pids * num_events' >> struct pollfd * poll_data; >> @@ -113,7 +118,7 @@ private: >> * that information in a collection of type 'comm_event'. We'll use this collection >> * for synthesizing PERF_RECORD_COMM events into the profile data stream. >> */ >> - std::vector<struct comm_event> procs; >> + std::map<u32, struct comm_event> procs; >> bool pid_started; >> bool system_wide; >> bool callgraph; >> @@ -134,16 +139,21 @@ public: >> : sample_data_fd(-1), inputFname(""), evts(_evts), cpu_type(CPU_NO_GOOD) >> { valid = syswide = false;} >> void init(int sample_data_pipe_fd, std::string input_filename, std::string samples_dir, op_cpu cputype, >> - bool systemwide); >> + bool systemwide, int _record_write_pipe, int _record_read_pipe); >> ~operf_read(); >> int readPerfHeader(void); >> unsigned int convertPerfData(void); >> bool is_valid(void) {return valid; } >> int get_eventnum_by_perf_event_id(u64 id) const; >> inline const operf_event_t * get_event_by_counter(u32 counter) { return &evts[counter]; } >> + int get_write_comm_pipe(void) { return write_comm_pipe; } >> + int get_read_comm_pipe(void) { return read_comm_pipe; } >> + void add_sample_id_to_opHeader(u64 sample_id); >> >> private: >> int sample_data_fd; >> + int write_comm_pipe; >> + int read_comm_pipe; >> std::string inputFname; >> std::string sampledir; >> std::ifstream istrm; >> diff --git a/libperf_events/operf_utils.cpp b/libperf_events/operf_utils.cpp >> index 68010c1..54a0ace 100644 >> --- a/libperf_events/operf_utils.cpp >> +++ b/libperf_events/operf_utils.cpp >> @@ -10,7 +10,7 @@ >> * (C) Copyright IBM Corp. 2011 >> * >> * Modified by Maynard Johnson <may...@us...> >> - * (C) Copyright IBM Corporation 2012, 2013 >> + * (C) Copyright IBM Corporation 2012, 2013, 2014 >> * >> */ >> >> @@ -48,6 +48,7 @@ extern pid_t app_PID; >> extern verbose vrecord; >> extern verbose vconvert; >> extern void __set_event_throttled(int index); >> +extern bool track_new_forks; >> >> using namespace std; >> >> @@ -87,6 +88,8 @@ static void __handle_fork_event(event_t * event) >> operf_process_info * parent = NULL; >> operf_process_info * forked_proc = NULL; >> >> + // First, see if we already have a proc_info object for the parent process >> + // that did the fork >> it = process_map.find(event->fork.ppid); >> if (it != process_map.end()) { >> parent = it->second; >> @@ -101,14 +104,42 @@ static void __handle_fork_event(event_t * event) >> process_map[event->fork.ppid] = parent; >> } >> >> - /* If the forked process's pid is the same as the parent's, we simply ignore the FORK >> - * event. This is because operf_process_info objects are stored in the map collection >> - * by pid, meaning that the forked process and its parent reference the same >> + /* If the user requested to profile by "--pid", then we must notify the >> + * recording process whenever we see a fork event. If the record process >> + * isn't already recording samples for this thread/process, it will start >> + * recording now. >> + */ >> + if (track_new_forks) { >> + if (cverb << vconvert) >> + cout << "Inform record process of new pid/tid " >> + << event->fork.pid << "/" << event->fork.tid << endl; >> + pid_t id = (event->fork.pid == event->fork.ppid) ? event->fork.tid : >> + event->fork.pid; >> + ssize_t len = write(operfRead.get_write_comm_pipe(), &id, sizeof(id)); >> + if (len < 0) >> + perror("Internal error on record write_comm_pipe"); >> + else if (len != sizeof(id)) >> + cerr << "Incomplete write to record write_comm_pipe" << endl; >> + u64 sample_id; >> + // get sample id from recording process >> + len = read(operfRead.get_read_comm_pipe(), &sample_id, sizeof(sample_id)); >> + if (sample_id == OP_PERF_NO_SAMPLE_ID) { >> + cverb << vconvert << "convert: No sample_id from record process" << endl; >> + } else { >> + cverb << vconvert << "Add sample_id " << sample_id << " to opHeader" << endl; >> + operfRead.add_sample_id_to_opHeader(sample_id); >> + } >> + } >> + >> + /* If the forked process's pid is the same as the parent's, we simply ignore >> + * the FORK event. This is because operf_process_info objects are stored in the map >> + * collection by pid, meaning that the forked process and its parent reference the same >> * operf_process_info object. >> */ >> if (event->fork.pid == event->fork.ppid) >> return; >> >> + // Now try to find a proc_info for the forked process itself. >> it = process_map.find(event->fork.pid); >> if (it == process_map.end()) { >> forked_proc = new operf_process_info(event->fork.pid, NULL, false, false); >> diff --git a/pe_profiling/operf.cpp b/pe_profiling/operf.cpp >> index ec85f6c..51f55a5 100644 >> --- a/pe_profiling/operf.cpp >> +++ b/pe_profiling/operf.cpp >> @@ -11,7 +11,7 @@ >> * (C) Copyright IBM Corp. 2011 >> * >> * Modified by Maynard Johnson <may...@us...> >> - * (C) Copyright IBM Corporation 2012, 2013 >> + * (C) Copyright IBM Corporation 2012, 2013, 2014 >> * >> */ >> >> @@ -77,12 +77,19 @@ int kptr_restrict; >> char * start_time_human_readable; >> std::vector<operf_event_t> events; >> operf_read operfRead(events); >> +/* With certain operf options, we have to take extra steps to track new threads >> + * and processes that an app may create via pthread_create, fork, etc. Note that >> + * any such thread or process creation APIs will result in a PERF_RECORD_FORK event, >> + * so we handle these new threads/processes in operf_utils::__handle_fork_event. >> + */ >> +bool track_new_forks; >> >> >> #define DEFAULT_OPERF_OUTFILE "operf.data" >> #define KERN_ADDR_SPACE_START_SYMBOL "_text" >> #define KERN_ADDR_SPACE_END_SYMBOL "_etext" >> >> +static operf_record * operfRecord = NULL; >> static char * app_name_SAVE = NULL; >> static char ** app_args = NULL; >> static pid_t jitconv_pid = -1; >> @@ -96,6 +103,14 @@ static char start_time_str[32]; >> static bool jit_conversion_running; >> static void convert_sample_data(void); >> static int sample_data_pipe[2]; >> +static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; >> +// The operf_convert_record_write_pipe is used for the convert process to send >> +// forked PID data to the record process. >> +static int operf_convert_record_write_pipe[2]; >> +// The operf_record_convert_write_pipe is used for the record process to send >> +// data to the convert process in response to the forked PID data. >> +static int operf_record_convert_write_pipe[2]; >> + >> bool ctl_c = false; >> bool pipe_closed = false; >> >> @@ -225,8 +240,6 @@ void set_signals_for_parent(void) >> } >> } >> >> -static int app_ready_pipe[2], start_app_pipe[2], operf_record_ready_pipe[2]; >> - >> static string args_to_string(void) >> { >> string ret; >> @@ -310,6 +323,14 @@ int start_profiling(void) >> perror("Internal error: could not create pipe"); >> return -1; >> } >> + if (pipe2(operf_convert_record_write_pipe, O_NONBLOCK) < 0) { >> + perror("Internal error: could not create pipe"); >> + return -1; >> + } >> + if (pipe(operf_record_convert_write_pipe) < 0) { >> + perror("Internal error: could not create pipe"); >> + return -1; >> + } >> operf_record_pid = fork(); >> if (operf_record_pid < 0) { >> return -1; >> @@ -318,6 +339,8 @@ int start_profiling(void) >> int exit_code = EXIT_SUCCESS; >> _set_basic_SIGINT_handler_for_child(); >> close(operf_record_ready_pipe[0]); >> + close(operf_convert_record_write_pipe[1]); >> + close(operf_record_convert_write_pipe[0]); >> if (!operf_options::post_conversion) >> close(sample_data_pipe[0]); >> /* >> @@ -343,11 +366,12 @@ int start_profiling(void) >> } else { >> outfd = sample_data_pipe[1]; >> } >> - operf_record operfRecord(outfd, operf_options::system_wide, app_PID, >> + operfRecord = new operf_record(outfd, operf_options::system_wide, app_PID, >> (operf_options::pid == app_PID), events, vi, >> operf_options::callgraph, >> - operf_options::separate_cpu, operf_options::post_conversion); >> - if (operfRecord.get_valid() == false) { >> + operf_options::separate_cpu, operf_options::post_conversion, >> + operf_convert_record_write_pipe[0], operf_record_convert_write_pipe[1]); >> + if (operfRecord->get_valid() == false) { >> /* If valid is false, it means that one of the "known" errors has >> * occurred: >> * - profiled process has already ended >> @@ -370,9 +394,10 @@ int start_profiling(void) >> } >> >> // start recording >> - operfRecord.recordPerfData(); >> + operfRecord->recordPerfData(); >> cverb << vdebug << "Total bytes recorded from perf events: " << dec >> - << operfRecord.get_total_bytes_recorded() << endl; >> + << operfRecord->get_total_bytes_recorded() << endl; >> + delete operfRecord; >> } catch (const runtime_error & re) { >> /* If the user does ctl-c, the operf-record process may get interrupted >> * in a system call, causing problems with writes to the sample data pipe. >> @@ -388,6 +413,9 @@ int start_profiling(void) >> _exit(exit_code); >> >> fail_out: >> + if (operfRecord) >> + delete operfRecord; >> + >> if (!ready){ >> /* ready==0 means we've not yet told parent we're ready, >> * but the parent is reading our pipe. So we tell the >> @@ -615,6 +643,10 @@ static end_code_t _run(void) >> // parent >> close(sample_data_pipe[0]); >> close(sample_data_pipe[1]); >> + close(operf_convert_record_write_pipe[0]); >> + close(operf_convert_record_write_pipe[1]); >> + close(operf_record_convert_write_pipe[0]); >> + close(operf_record_convert_write_pipe[1]); >> } >> } >> >> @@ -909,7 +941,11 @@ static void convert_sample_data(void) >> inputfd = sample_data_pipe[0]; >> inputfname = ""; >> } >> - operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, operf_options::system_wide); >> + close(operf_record_convert_write_pipe[1]); >> + close(operf_convert_record_write_pipe[0]); >> + operfRead.init(inputfd, inputfname, current_sampledir, cpu_type, >> + operf_options::system_wide, operf_convert_record_write_pipe[1], >> + operf_record_convert_write_pipe[0]); >> if ((rc = operfRead.readPerfHeader()) < 0) { >> if (rc != OP_PERF_HANDLED_ERROR) >> cerr << "Error: Cannot create read header info for sample data " << endl; >> @@ -1389,6 +1425,10 @@ static void process_args(int argc, char * const argv[]) >> string startEnd = _process_vmlinux(operf_options::vmlinux); >> operf_create_vmlinux(operf_options::vmlinux.c_str(), startEnd.c_str()); >> } >> + if (operf_options::pid && !operf_options::post_conversion) >> + track_new_forks = true; >> + else >> + track_new_forks = false; >> >> return; >> } >> > > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://www.hpccsystems.com > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list > |