Share

LaunchMON

Tracker: Bugs

5 race condition in thread debug handling - ID: 2889642
Last Update: Settings changed ( dongahn )

I was playing around with STATBench, which uses Launchmon's
launchandspawndaemons and there may be a race condition. The session below
was run on 8 nodes of atlas:

bash-3.2$ STATBench
#######################################
# STATBench: STAT emulation Benchmark #
#######################################
This benchmark emulates STAT, the Stack Trace
Analysis Tool and can determine the expected
performance for a specified machine architecture
and application profile.

Running 5 iterations with 128 tasks, 3 traces per task,
7 max call depth after main, 2 function fanout, and -1
equivalence classes.

Launching tool daemons...
<Oct 30 09:06:03> SDBG_TRACER_ERROR (ERROR): [filename:
sdbg_linux_ptracer.hxx, linenum: 81] [linux_ptracer_t::tracer_attach] error
returned from ptrace No such process
<Oct 30 09:06:03> SDBG_TRACER_ERROR (ERROR): BACKTRACE:
launchmon(_ZN24linux_tracer_exception_tC1ERKSs14tracer_error_e+0x152)
[0x420952]
launchmon(_ZN15linux_ptracer_tImmm16user_regs_struct18user_fpregs_struct10t
d_thrinfo11elf_wrapperE13tracer_attachER14process_base_tImmmS0_S1_S2_S3_Ebi
+0x13d) [0x42124d]
launchmon(_ZN21linux_thread_tracer_tImmm16user_regs_struct18user_fpregs_str
uctE23linux_thread_callback_t35ttracer_thread_iter_attach_callbackEPK12td_t
hrhandlePv+0x15e) [0x41d6be]
/lib64/libthread_db.so.1 [0x2aaaabbcafa6]
launchmon(_ZN21linux_thread_tracer_tImmm16user_regs_struct18user_fpregs_str
uctE14ttracer_attachER14process_base_tImmmS0_S1_10td_thrinfo11elf_wrapperE+
0x1b5) [0x426865]
launchmon(_ZN17linux_launchmon_t25handle_thrcreate_bp_eventER14process_base
_tImmm16user_regs_struct18user_fpregs_struct10td_thrinfo11elf_wrapperE+0x9c
) [0x41842c]
launchmon(_ZN15event_manager_tImmm16user_regs_struct18user_fpregs_struct10t
d_thrinfo11elf_wrapperE14poll_processesER16launchmon_base_tImmmS0_S1_S2_S3_
E+0x136) [0x411e16]
launchmon(_ZN15event_manager_tImmm16user_regs_struct18user_fpregs_struct10t
d_thrinfo11elf_wrapperE16multiplex_eventsER14process_base_tImmmS0_S1_S2_S3_
ER16launchmon_base_tImmmS0_S1_S2_S3_E+0x24) [0x411f94]
launchmon(_ZN13driver_base_tImmm16user_regs_struct18user_fpregs_struct10td_
thrinfo11elf_wrapperE12drive_engineEP11opts_args_t+0x130) [0x4120e0]
launchmon(_ZN13driver_base_tImmm16user_regs_struct18user_fpregs_struct10td_
thrinfo11elf_wrapperE5driveEiPPc+0x44) [0x412664]
launchmon(_ZN14linux_driver_tImmm16user_regs_struct18user_fpregs_structE11d
river_mainEiPPc+0x74) [0x4126f4]
launchmon(main+0x3f) [0x40ecaf]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x2aaaab88e974]
launchmon(__gxx_personality_v0+0x101) [0x40ea79]

<Oct 30 09:06:03> <LMON FE API> (ERROR): read_lmonp_msg returned a negative
return code



Sometimes I get this error and sometimes it works OK. This is running
against the 0.7X build in /usr/global for CHAOS 4. I was not able to
reproduce this with only 1 node on atlas.

-Greg


Nobody/Anonymous ( nobody ) - 2009-10-30 19:04

5

Closed

Fixed

Nobody/Anonymous

None

None

Public


Comment ( 1 )

Date: 2009-10-30 19:47
Sender: dongahnProject Admin

This appears to be caused by a bug in NPTL thread debug library's
td_thr_get_info call which has a conditional based on an uninitialized
value. Because this condition occurs only when the thread is the main
thread and we don't readlly need to call td_thr_get_info, I added a main
thread check before calling td_thr_get_info.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2009-10-30 19:47 dongahn
resolution_id None 2009-10-30 19:47 dongahn
allow_comments 1 2009-10-30 19:47 dongahn
close_date - 2009-10-30 19:47 dongahn