From: Bernard L. <be...@va...> - 2007-07-19 17:54:04
|
Hi Andrea: What is the version of glibc on your server? Have you tried updating it? I figure since libpthread comes from glibc, if there is a bug in libpthread maybe it's fixed in updated versions. The thing is, I have not had this issue under CentOS 4 -- not the exact setup no, but I am quite surprised nobody else have encountered this (but I guess not everybody has updated to 3.0.4 yet...) Just for reference the version of glibc on CentOS 4.4 is glibc-2.3.4-2.25. Cheers, Bernard On 7/19/07, Andrea Capriotti <a.c...@ci...> wrote: > Il giorno gio, 19/07/2007 alle 03.16 -0700, Martin Knoblauch ha scritto: > > > do you have a chance to run gmetad under control of a debugger to see > > where exactely the segfault happens? Apparently the pointer that is > > NULLified by the patch for bz#56 gets referenced later on, leading to > > the problem. > > Yes. > > # gdb --args ./gmetad -d1 > GNU gdb 6.3 > [..] > This GDB was configured as "i586-suse-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1". > > (gdb) r > Starting program: /tmp/ganglia-3.0.4/gmetad/gmetad -d1 > [Thread debugging using libthread_db enabled] > [New Thread 1075674112 (LWP 18165)] > Sources are ... > Source: [Cray_XD1_Linux_Cluster, step 25] has 1 sources > xxx.xxx.xxx.xxx > Source: [Front_End_Cluster, step 25] has 1 sources > xxx.xxx.xxx.xxx > Source: [GNU_Linux_Cluster, step 25] has 1 sources > xxx.xxx.xxx.xxx > Source: [BCX_Linux_Cluster, step 25] has 1 sources > xxx.xxx.xxx.xxx > Source: [SP5, step 25] has 1 sources > xxx.xxx.xxx.xxx > Source: [BCC_Linux_Cluster, step 25] has 1 sources > xxx.xxx.xxx.xxx > [New Thread 1077779376 (LWP 18168)] > [New Thread 1079880624 (LWP 18169)] > [New Thread 1081981872 (LWP 18170)] > [New Thread 1084083120 (LWP 18171)] > [New Thread 1086184368 (LWP 18172)] > [New Thread 1088285616 (LWP 18173)] > [New Thread 1090386864 (LWP 18174)] > Data thread 1090386864 is monitoring [Cray_XD1_Linux_Cluster] data source > [New Thread 1092488112 (LWP 18175)] > xxx.xxx.xxx.xxx > Data thread 1092488112 is monitoring [Front_End_Cluster] data source > xxx.xxx.xxx.xxx > [New Thread 1094589360 (LWP 18176)] > Data thread 1094589360 is monitoring [GNU_Linux_Cluster] data source > xxx.xxx.xxx.xxx > [New Thread 1096690608 (LWP 18177)] > Data thread 1096690608 is monitoring [BCX_Linux_Cluster] data source > xxx.xxx.xxx.xxx > [New Thread 1099959216 (LWP 18178)] > poll() error in data_thread for [GNU_Linux_Cluster] data source after 8120 bytes read > [New Thread 1102060464 (LWP 18179)] > Data thread 1099959216 is monitoring [SP5] data source > xxx.xxx.xxx.xxx > [New Thread 1104161712 (LWP 18180)] > Data thread 1102060464 is monitoring [BCC_Linux_Cluster] data source > xxx.xxx.xxx.xxx > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1092488112 (LWP 18175)] > 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > > (gdb) info thread > 14 Thread 1104161712 (LWP 18180) 0xffffe410 in __kernel_vsyscall () > 13 Thread 1102060464 (LWP 18179) 0x40124d57 in strlen () from /lib/tls/libc.so.6 > 12 Thread 1099959216 (LWP 18178) 0xffffe410 in __kernel_vsyscall () > 11 Thread 1096690608 (LWP 18177) 0xffffe410 in __kernel_vsyscall () > 10 Thread 1094589360 (LWP 18176) 0xffffe410 in __kernel_vsyscall () > * 9 Thread 1092488112 (LWP 18175) 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > 8 Thread 1090386864 (LWP 18174) 0xffffe410 in __kernel_vsyscall () > 7 Thread 1088285616 (LWP 18173) 0xffffe410 in __kernel_vsyscall () > 6 Thread 1086184368 (LWP 18172) 0xffffe410 in __kernel_vsyscall () > 5 Thread 1084083120 (LWP 18171) 0xffffe410 in __kernel_vsyscall () > 4 Thread 1081981872 (LWP 18170) 0xffffe410 in __kernel_vsyscall () > 3 Thread 1079880624 (LWP 18169) 0xffffe410 in __kernel_vsyscall () > 2 Thread 1077779376 (LWP 18168) 0xffffe410 in __kernel_vsyscall () > 1 Thread 1075674112 (LWP 18165) 0xffffe410 in __kernel_vsyscall () > > (gdb) where > #0 0x400b4256 in __pthread_mutex_unlock_usercnt () > from /lib/tls/libpthread.so.0 > #1 0x411dede8 in ?? () > #2 0x400b42e0 in pthread_mutex_unlock () from /lib/tls/libpthread.so.0 > #3 0x43223d45 in ?? () > #4 0x43454e49 in ?? () > #5 0x41202241 in ?? () > #6 0x4f485455 in ?? () > #7 0x59544952 in ?? () > #8 0x7468223d in ?? () > #9 0x2f3a7074 in ?? () > #10 0x6e61742f in ?? () > > And so on... > > The stack seems to be corrupted. > > So I removed the comment on line 1081 of process_xml.c and added an > error message on line 1079: > > # diff -u process_xml.c.old process_xml.c > --- process_xml.c.old 2007-07-11 20:10:00.000000000 +0200 > +++ process_xml.c 2007-07-19 17:16:45.371327396 +0200 > @@ -972,8 +972,9 @@ > summary = xmldata->source.metric_summary; > > /* Release the partial sum mutex */ > + err_msg("%s before releasing lock", xmldata->sourcename); > pthread_mutex_unlock(source->sum_finished); > - /*err_msg("%s releasing lock", xmldata->sourcename);*/ > + err_msg("%s releasing lock", xmldata->sourcename); > > hashkey.data = (void*) xmldata->sourcename; > hashkey.size = strlen(xmldata->sourcename) + 1; > > # gdb --args ./gmetad -d1 > [..] > (gdb) run > Starting program: /tmp/ganglia-3.0.4/gmetad/gmetad -d1 > [Thread debugging using libthread_db enabled] > [New Thread 1075674112 (LWP 19051)] > Sources are ... > [..] > [New Thread 1104161712 (LWP 19066)] > Front_End_Cluster before releasing lock > Front_End_Cluster releasing lock > Front_End_Cluster before releasing lock > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1092488112 (LWP 19061)] > 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > > (gdb) where > #0 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > #1 0x411dede8 in ?? () > #2 0x400b42e0 in pthread_mutex_unlock () from /lib/tls/libpthread.so.0 > Previous frame inner to this frame (corrupt stack?) > > (gdb) thread apply 1-14 bt > > Thread 1 (Thread 1075674112 (LWP 19051)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x40145d66 in __nanosleep_nocancel () from /lib/tls/libc.so.6 > #2 0x40145b51 in sleep () from /lib/tls/libc.so.6 > #3 0x0804acdb in main (argc=2, argv=0xbfffdfd4) at gmetad.c:418 > > Thread 2 (Thread 1077779376 (LWP 19054)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b7768 in accept () from /lib/tls/libpthread.so.0 > #2 0x0804c1b4 in server_thread (arg=0x0) at server.c:528 > #3 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #4 0x4017721e in clone () from /lib/tls/libc.so.6 > #5 0x403d9bb0 in ?? () > > Thread 3 (Thread 1079880624 (LWP 19055)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x00000000 in ?? () > #5 0x400bac10 in __JCR_LIST__ () from /lib/tls/libpthread.so.0 > #6 0x00000000 in ?? () > #7 0x405dabb0 in ?? () > #8 0x405da438 in ?? () > #9 0x0804c197 in server_thread (arg=0x8074ddc) at server.c:527 > #10 0x0804c197 in server_thread (arg=0x0) at server.c:527 > #11 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #12 0x4017721e in clone () from /lib/tls/libc.so.6 > #13 0x405dabb0 in ?? () > > Thread 4 (Thread 1081981872 (LWP 19056)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b7768 in accept () from /lib/tls/libpthread.so.0 > #2 0x0804c162 in server_thread (arg=0x1) at server.c:522 > #3 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #4 0x4017721e in clone () from /lib/tls/libc.so.6 > #5 0x407dbbb0 in ?? () > > Thread 5 (Thread 1084083120 (LWP 19057)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x00000000 in ?? () > #5 0x400bac10 in __JCR_LIST__ () from /lib/tls/libpthread.so.0 > #6 0x00000000 in ?? () > ---Type <return> to continue, or q <return> to quit--- > #7 0x409dcbb0 in ?? () > #8 0x409dc438 in ?? () > #9 0x0804c145 in server_thread (arg=0x8074df4) at server.c:521 > #10 0x0804c145 in server_thread (arg=0x1) at server.c:521 > #11 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #12 0x4017721e in clone () from /lib/tls/libc.so.6 > #13 0x409dcbb0 in ?? () > > Thread 6 (Thread 1086184368 (LWP 19058)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x00000000 in ?? () > #5 0x400bac10 in __JCR_LIST__ () from /lib/tls/libpthread.so.0 > #6 0x00000000 in ?? () > #7 0x40bddbb0 in ?? () > #8 0x40bdd438 in ?? () > #9 0x0804c145 in server_thread (arg=0x8074df4) at server.c:521 > #10 0x0804c145 in server_thread (arg=0x1) at server.c:521 > #11 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #12 0x4017721e in clone () from /lib/tls/libc.so.6 > #13 0x40bddbb0 in ?? () > > Thread 7 (Thread 1088285616 (LWP 19059)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x00000000 in ?? () > #5 0x400bac10 in __JCR_LIST__ () from /lib/tls/libpthread.so.0 > #6 0x00000000 in ?? () > #7 0x40ddebb0 in ?? () > #8 0x40dde438 in ?? () > #9 0x0804c145 in server_thread (arg=0x8074df4) at server.c:521 > #10 0x0804c145 in server_thread (arg=0x1) at server.c:521 > #11 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #12 0x4017721e in clone () from /lib/tls/libc.so.6 > #13 0x40ddebb0 in ?? () > > Thread 8 (Thread 1090386864 (LWP 19060)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x400ff0bb in sprintf () from /lib/tls/libc.so.6 > #4 0x0804e491 in RRD_update ( > rrd=0x40fdcda0 "/dev/shm/ganglia/rrds/Cray_XD1_Linux_Cluster/ch476-n6.xd1.cineca.it/mem_free.rrd", > ---Type <return> to continue, or q <return> to quit--- > sum=0x41e010d0 "97288", num=0x0, process_time=1184858251) at rrd_helpers.c:49 > #5 0x0804e813 in push_data_to_rrd ( > rrd=0x40fdcda0 "/dev/shm/ganglia/rrds/Cray_XD1_Linux_Cluster/ch476-n6.xd1.cineca.it/mem_free.rrd", > sum=0x41e010d0 "97288", num=0x0, step=25, process_time=1184858251) at rrd_helpers.c:149 > #6 0x0804e99f in write_data_to_rrd (source=0x41e01568 "Cray_XD1_Linux_Cluster", host=0x41e19bb0 "ch476-n6.xd1.cineca.it", > metric=0x41e010c7 "mem_free", sum=0x41e010d0 "97288", num=0x0, step=25, process_time=1184858251) at rrd_helpers.c:184 > #7 0x0804d794 in startElement_METRIC (data=0x40fde070, el=0x41e010c0 "METRIC", attr=0x80797a8) at process_xml.c:630 > #8 0x0804deaa in start (data=0x40fde070, el=0x41e010c0 "METRIC", attr=0x80797a8) at process_xml.c:872 > #9 0x0805b640 in doContent (parser=0x8079380, startTagLevel=0, enc=0x806e9c0, > s=0x41f0f048 "<METRIC NAME=\"mem_free\" VAL=\"97288\" TYPE=\"uint32\" UNITS=\"KB\" TN=\"79\" TMAX=\"180\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>\n<METRIC NAME=\"cpu_system\" VAL=\"1.5\" TYPE=\"float\" UNITS=\"%\" TN=\"5\" TMAX=\"90\" DMAX="..., end=0x41f4635b "", nextPtr=0x0) at xmlparse.c:1697 > #10 0x0805cfb9 in doProlog (parser=0x8079380, enc=0x806e9c0, > s=0x41f00920 "<GANGLIA_XML VERSION=\"3.0.2\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"Cray_XD1_Linux_Cluster\" LOCALTIME=\"1184858251\" OWNER="..., > end=0x41f4635b "", tok=29, > next=0x41f00920 "<GANGLIA_XML VERSION=\"3.0.2\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"Cray_XD1_Linux_Cluster\" LOCALTIME=\"1184858251\" OWNER="..., > nextPtr=0x0) at xmlparse.c:2692 > #11 0x0805dafa in prologProcessor (parser=0x8079380, > s=0x41f00008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"..., > end=0x41f4635b "", nextPtr=0x0) at xmlparse.c:2528 > #12 0x080579f8 in XML_ParseBuffer (parser=0x8079380, len=-4, isFinal=1) at xmlparse.c:1155 > #13 0x0804e33c in process_xml (d=0x80784f0, > buf=0x41d5d008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"...) > at process_xml.c:1054 > #14 0x0804b4c3 in data_thread (arg=0x80784f0) at data_thread.c:158 > #15 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #16 0x4017721e in clone () from /lib/tls/libc.so.6 > #17 0x40fdfbb0 in ?? () > > Thread 9 (Thread 1092488112 (LWP 19061)): > #0 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > #1 0x411dede8 in ?? () > #2 0x400b42e0 in pthread_mutex_unlock () from /lib/tls/libpthread.so.0 > Previous frame inner to this frame (corrupt stack?) > > Thread 10 (Thread 1094589360 (LWP 19062)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x400b42e9 in _L_mutex_unlock_38 () from /lib/tls/libpthread.so.0 > #5 0x413e0070 in ?? () > #6 0x080b8b94 in ?? () > #7 0x080d3a60 in ?? () > ---Type <return> to continue, or q <return> to quit--- > #8 0x413ded78 in ?? () > #9 0x0804e9b7 in my_mkdir (dir=0x8074e0c "\002") at rrd_helpers.c:23 > #10 0x0804e9b7 in my_mkdir (dir=0x413deda0 "/dev/shm/ganglia/rrds/GNU_Linux_Cluster") at rrd_helpers.c:23 > #11 0x0804e894 in write_data_to_rrd (source=0x80ba038 "GNU_Linux_Cluster", host=0x80d2678 "node381.clx.cineca.it", > metric=0x80b9b97 "disk_total", sum=0x80b9ba2 "12949.450", num=0x0, step=25, process_time=1184858236) at rrd_helpers.c:166 > #12 0x0804d794 in startElement_METRIC (data=0x413e0070, el=0x80b9b90 "METRIC", attr=0x80d3a60) at process_xml.c:630 > #13 0x0804deaa in start (data=0x413e0070, el=0x80b9b90 "METRIC", attr=0x80d3a60) at process_xml.c:872 > #14 0x0805b640 in doContent (parser=0x80b8968, startTagLevel=0, enc=0x806e9c0, > s=0x42a12de0 "<METRIC NAME=\"disk_total\" VAL=\"12949.450\" TYPE=\"double\" UNITS=\"GB\" TN=\"3162\" TMAX=\"1200\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>\n<METRIC NAME=\"cpu_idle\" VAL=\"0.0\" TYPE=\"float\" UNITS=\"%\" TN=\"81\" TMAX=\"9"..., end=0x42b80239 "", nextPtr=0x0) at xmlparse.c:1697 > #15 0x0805cfb9 in doProlog (parser=0x80b8968, enc=0x806e9c0, > s=0x42a07920 "<GANGLIA_XML VERSION=\"3.0.0\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"GNU_Linux_Cluster\" LOCALTIME=\"1184858236\" OWNER=\"CINE"..., > end=0x42b80239 "", tok=29, > next=0x42a07920 "<GANGLIA_XML VERSION=\"3.0.0\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"GNU_Linux_Cluster\" LOCALTIME=\"1184858236\" OWNER=\"CINE"..., > nextPtr=0x0) at xmlparse.c:2692 > #16 0x0805dafa in prologProcessor (parser=0x80b8968, > s=0x42a07008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"..., > end=0x42b80239 "", nextPtr=0x0) at xmlparse.c:2528 > #17 0x080579f8 in XML_ParseBuffer (parser=0x80b8968, len=-4, isFinal=1) at xmlparse.c:1155 > #18 0x0804e33c in process_xml (d=0x80783c0, > buf=0x421df008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"...) > at process_xml.c:1054 > #19 0x0804b4c3 in data_thread (arg=0x80783c0) at data_thread.c:158 > #20 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #21 0x4017721e in clone () from /lib/tls/libc.so.6 > #22 0x413e1bb0 in ?? () > > Thread 11 (Thread 1096690608 (LWP 19063)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x400ff0bb in sprintf () from /lib/tls/libc.so.6 > #4 0x0804e491 in RRD_update (rrd=0x415dfda0 "/dev/shm/ganglia/rrds/BCX_Linux_Cluster/node0451.bcx.cineca.it/cpu_idle.rrd", > sum=0x80ef9c0 "0.0", num=0x0, process_time=1184858240) at rrd_helpers.c:49 > #5 0x0804e813 in push_data_to_rrd ( > rrd=0x415dfda0 "/dev/shm/ganglia/rrds/BCX_Linux_Cluster/node0451.bcx.cineca.it/cpu_idle.rrd", sum=0x80ef9c0 "0.0", > num=0x0, step=25, process_time=1184858240) at rrd_helpers.c:149 > #6 0x0804e99f in write_data_to_rrd (source=0x80efe58 "BCX_Linux_Cluster", host=0x8108498 "node0451.bcx.cineca.it", > metric=0x80ef9b7 "cpu_idle", sum=0x80ef9c0 "0.0", num=0x0, step=25, process_time=1184858240) at rrd_helpers.c:184 > #7 0x0804d794 in startElement_METRIC (data=0x415e1070, el=0x80ef9b0 "METRIC", attr=0x8109880) at process_xml.c:630 > #8 0x0804deaa in start (data=0x415e1070, el=0x80ef9b0 "METRIC", attr=0x8109880) at process_xml.c:872 > #9 0x0805b640 in doContent (parser=0x80e7768, startTagLevel=0, enc=0x806e9c0, > ---Type <return> to continue, or q <return> to quit--- > s=0x430f1f1c "<METRIC NAME=\"cpu_idle\" VAL=\"0.0\" TYPE=\"float\" UNITS=\"%\" TN=\"39\" TMAX=\"90\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>\n<METRIC NAME=\"cpu_user\" VAL=\"99.9\" TYPE=\"float\" UNITS=\"%\" TN=\"39\" TMAX=\"90\" DMAX=\"0\" S"..., end=0x435c8869 "", nextPtr=0x0) at xmlparse.c:1697 > #10 0x0805cfb9 in doProlog (parser=0x80e7768, enc=0x806e9c0, > s=0x430e9920 "<GANGLIA_XML VERSION=\"3.0.1\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"BCX_Linux_Cluster\" LOCALTIME=\"1184858240\" OWNER=\"CINE"..., > end=0x435c8869 "", tok=29, > next=0x430e9920 "<GANGLIA_XML VERSION=\"3.0.1\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"BCX_Linux_Cluster\" LOCALTIME=\"1184858240\" OWNER=\"CINE"..., > nextPtr=0x0) at xmlparse.c:2692 > #11 0x0805dafa in prologProcessor (parser=0x80e7768, > s=0x430e9008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"..., > end=0x435c8869 "", nextPtr=0x0) at xmlparse.c:2528 > #12 0x080579f8 in XML_ParseBuffer (parser=0x80e7768, len=-4, isFinal=1) at xmlparse.c:1155 > #13 0x0804e33c in process_xml (d=0x8078720, > buf=0x42c08008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"...) > at process_xml.c:1054 > #14 0x0804b4c3 in data_thread (arg=0x8078720) at data_thread.c:158 > #15 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #16 0x4017721e in clone () from /lib/tls/libc.so.6 > #17 0x415e2bb0 in ?? () > > Thread 12 (Thread 1099959216 (LWP 19064)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x400b728e in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 > #2 0x400b4040 in _L_mutex_lock_34 () from /lib/tls/libpthread.so.0 > #3 0x00000000 in ?? () > #4 0x400b42e9 in _L_mutex_unlock_38 () from /lib/tls/libpthread.so.0 > #5 0x418ff070 in ?? () > #6 0x41600704 in ?? () > #7 0x4161b758 in ?? () > #8 0x418fdd78 in ?? () > #9 0x0804e9b7 in my_mkdir (dir=0x8074e0c "\002") at rrd_helpers.c:23 > #10 0x0804e9b7 in my_mkdir (dir=0x418fdda0 "/dev/shm/ganglia/rrds/SP5") at rrd_helpers.c:23 > #11 0x0804e894 in write_data_to_rrd (source=0x41601d58 "SP5", host=0x4161a390 "sp011", metric=0x416018b7 "bytes_out", > sum=0x416018c1 "6777.15", num=0x0, step=25, process_time=1184858236) at rrd_helpers.c:166 > #12 0x0804d794 in startElement_METRIC (data=0x418ff070, el=0x416018b0 "METRIC", attr=0x4161b758) at process_xml.c:630 > #13 0x0804deaa in start (data=0x418ff070, el=0x416018b0 "METRIC", attr=0x4161b758) at process_xml.c:872 > #14 0x0805b640 in doContent (parser=0x416004d8, startTagLevel=0, enc=0x806e9c0, > s=0x4204358c "<METRIC NAME=\"bytes_out\" VAL=\"6777.15\" TYPE=\"float\" UNITS=\"bytes/sec\" TN=\"41\" TMAX=\"300\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>\n<METRIC NAME=\"gexec\" VAL=\"ON\" TYPE=\"string\" UNITS=\"\" TN=\"274\" TMAX=\"300\""..., end=0x42075d21 "", nextPtr=0x0) at xmlparse.c:1697 > #15 0x0805cfb9 in doProlog (parser=0x416004d8, enc=0x806e9c0, > s=0x42035920 "<GANGLIA_XML VERSION=\"3.0.0\" SOURCE=\"gmetad\">\n<GRID NAME=\"unspecified\" AUTHORITY=\"http://master.sp4.cineca.it/ganglia-webfrontend-2.5.7/\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"SP5\" LOCALTIME=\"1184858"..., > ---Type <return> to continue, or q <return> to quit--- > end=0x42075d21 "", tok=29, > next=0x42035920 "<GANGLIA_XML VERSION=\"3.0.0\" SOURCE=\"gmetad\">\n<GRID NAME=\"unspecified\" AUTHORITY=\"http://master.sp4.cineca.it/ganglia-webfrontend-2.5.7/\" LOCALTIME=\"1184858254\">\n<CLUSTER NAME=\"SP5\" LOCALTIME=\"1184858"..., > nextPtr=0x0) at xmlparse.c:2692 > #16 0x0805dafa in prologProcessor (parser=0x416004d8, > s=0x42035008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"..., > end=0x42075d21 "", nextPtr=0x0) at xmlparse.c:2528 > #17 0x080579f8 in XML_ParseBuffer (parser=0x416004d8, len=-4, isFinal=1) at xmlparse.c:1155 > #18 0x0804e33c in process_xml (d=0x8078318, > buf=0x41da4008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"...) > at process_xml.c:1054 > #19 0x0804b4c3 in data_thread (arg=0x8078318) at data_thread.c:158 > #20 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #21 0x4017721e in clone () from /lib/tls/libc.so.6 > #22 0x41900bb0 in ?? () > > Thread 13 (Thread 1102060464 (LWP 19065)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x40168d4b in __open_nocancel () from /lib/tls/libc.so.6 > #2 0x4011b303 in __GI__IO_file_open () from /lib/tls/libc.so.6 > #3 0x4011b523 in _IO_new_file_fopen () from /lib/tls/libc.so.6 > #4 0x40110609 in __fopen_internal () from /lib/tls/libc.so.6 > #5 0x4011065d in fopen@@GLIBC_2.1 () from /lib/tls/libc.so.6 > #6 0x08054b31 in rrd_open () > #7 0x08054ff3 in rrd_update () > #8 0x0804e4bc in RRD_update (rrd=0x41afeda0 "/dev/shm/ganglia/rrds/BCC_Linux_Cluster/node079/cpu_nice.rrd", > sum=0x41e20300 "0.0", num=0x0, process_time=1184858253) at rrd_helpers.c:52 > #9 0x0804e813 in push_data_to_rrd (rrd=0x41afeda0 "/dev/shm/ganglia/rrds/BCC_Linux_Cluster/node079/cpu_nice.rrd", > sum=0x41e20300 "0.0", num=0x0, step=25, process_time=1184858253) at rrd_helpers.c:149 > #10 0x0804e99f in write_data_to_rrd (source=0x41e20798 "BCC_Linux_Cluster", host=0x41e38dd8 "node079", > metric=0x41e202f7 "cpu_nice", sum=0x41e20300 "0.0", num=0x0, step=25, process_time=1184858253) at rrd_helpers.c:184 > #11 0x0804d794 in startElement_METRIC (data=0x41b00070, el=0x41e202f0 "METRIC", attr=0x41e3a1a0) at process_xml.c:630 > #12 0x0804deaa in start (data=0x41b00070, el=0x41e202f0 "METRIC", attr=0x41e3a1a0) at process_xml.c:872 > #13 0x0805b640 in doContent (parser=0x41e1f0f8, startTagLevel=0, enc=0x806e9c0, > s=0x4264697a "<METRIC NAME=\"cpu_nice\" VAL=\"0.0\" TYPE=\"float\" UNITS=\"%\" TN=\"73\" TMAX=\"90\" DMAX=\"0\" SLOPE=\"both\" SOURCE=\"gmond\"/>\n<METRIC NAME=\"cpu_speed\" VAL=\"2193\" TYPE=\"uint32\" UNITS=\"MHz\" TN=\"997\" TMAX=\"1200\" DMA"..., end=0x42744885 "", nextPtr=0x0) at xmlparse.c:1697 > #14 0x0805cfb9 in doProlog (parser=0x41e1f0f8, enc=0x806e9c0, > s=0x4263b920 "<GANGLIA_XML VERSION=\"3.0.1\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858255\">\n<CLUSTER NAME=\"BCC_Linux_Cluster\" LOCALTIME=\"1184858253\" OWNER=\"CINE"..., > end=0x42744885 "", tok=29, > next=0x4263b920 "<GANGLIA_XML VERSION=\"3.0.1\" SOURCE=\"gmetad\">\n<GRID NAME=\"CINECA\" AUTHORITY=\"http://tana.cineca.it/ganglia\" LOCALTIME=\"1184858255\">\n<CLUSTER NAME=\"BCC_Linux_Cluster\" LOCALTIME=\"1184858253\" OWNER=\"CINE"..., > nextPtr=0x0) at xmlparse.c:2692 > #15 0x0805dafa in prologProcessor (parser=0x41e1f0f8, > ---Type <return> to continue, or q <return> to quit--- > s=0x4263b008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"..., > end=0x42744885 "", nextPtr=0x0) at xmlparse.c:2528 > #16 0x080579f8 in XML_ParseBuffer (parser=0x41e1f0f8, len=15, isFinal=1) at xmlparse.c:1155 > #17 0x0804e33c in process_xml (d=0x8078648, > buf=0x42530008 "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n<!DOCTYPE GANGLIA_XML [\n <!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*>\n <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED>\n <!ATTLIST"...) > at process_xml.c:1054 > #18 0x0804b4c3 in data_thread (arg=0x8078648) at data_thread.c:158 > #19 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #20 0x4017721e in clone () from /lib/tls/libc.so.6 > #21 0x41b01bb0 in ?? () > > Thread 14 (Thread 1104161712 (LWP 19066)): > #0 0xffffe410 in __kernel_vsyscall () > #1 0x40145d66 in __nanosleep_nocancel () from /lib/tls/libc.so.6 > #2 0x40145b51 in sleep () from /lib/tls/libc.so.6 > #3 0x0804f7c0 in cleanup_thread (arg=0x0) at cleanup.c:190 > #4 0x400b2cb7 in start_thread () from /lib/tls/libpthread.so.0 > #5 0x4017721e in clone () from /lib/tls/libc.so.6 > #6 0x41d02bb0 in ?? () > #0 0x400b4256 in __pthread_mutex_unlock_usercnt () from /lib/tls/libpthread.so.0 > > In my opinion the problem is that the source->sum_finished mutex is unlocked twice: > > /* Release the partial sum mutex */ > err_msg("%s before releasing lock", xmldata->sourcename); > pthread_mutex_unlock(source->sum_finished); > err_msg("%s releasing lock", xmldata->sourcename); > > hashkey.data = (void*) xmldata->sourcename; > hashkey.size = strlen(xmldata->sourcename) + 1; > > hashval.data = source; > /* Trim structure to the correct length. */ > hashval.size = sizeof(*source) - GMETAD_FRAMESIZE + source->stringslen; > > /* We insert here to get an accurate hosts up/down value. */ > rdatum = hash_insert( &hashkey, &hashval, xmldata->root); > source->sum_finished = NULL; /* remember that we released the lock */ > > Best Regards > -- > Andrea Capriotti > System Management Group - Cineca - www.cineca.it > a.c...@ci... - Tel +39 051 6171890 > > |