Anyone have any ideas what might be causing the following errors? These seem to periodically zap some of our background jobs. Could be related to TCP/IP but I have not tried to isolate it yet. There seem to be handful of common addresses...
Thanks for any suggestions or info,
Bob
%GTM-F-KILLBYSIGSINFO1, GT.M process 7899 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x6D6173AB)
%GTM-F-KILLBYSIGSINFO1, GT.M process 2804 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x267265A0)
%GTM-F-KILLBYSIGSINFO1, GT.M process 2809 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x267265A0)
%GTM-F-KILLBYSIGSINFO1, GT.M process 3881 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 11700 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 19090 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 14891 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x353030AE)
%GTM-F-KILLBYSIGSINFO1, GT.M process 24405 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 28219 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x3638326C)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31319 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x6E6174AF)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31948 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x30653261)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31950 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x30653261)
%GTM-F-KILLBYSIGSINFO1, GT.M process 8331 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x316461AA)
%GTM-F-KILLBYSIGSINFO1, GT.M process 9718 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x504F5291)
%GTM-F-KILLBYSIGSINFO1, GT.M process 348 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x534D4590)
%GTM-F-KILLBYSIGSINFO1, GT.M process 9219 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x53595367)
%GTM-F-KILLBYSIGSINFO1, GT.M process 16770 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x44363261)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, it helps to know which version you were using. If you built you own version,
on what platform, including version.
You can find out where the SEGV (signal 11) occurred by "gdb $gtm_dist/mumps",
"disassemble 0x8082DB1" (or whatever the address reported is.) The vaddr value
is the virtual address whose access caused the SEGment Violation.
There should have been a core file produced. If there was already a core file in the
directory, GT.M will rename the old core file to core1, core2, etc. so if you started
with no core files and have more than one now, core1 is usually the most interesting.
"gdb $gtm_dist/mumps core", "bt" will produce a stack traceback. This isn't always
very useful, especially with Linux and production images. A dbg image is a lot more
useful in tracking down problems. It also has asserts which may catch memory
corruption problems early enough to get a sensible core.
You wouldn't be doing any external calls?
Good luck, Sam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for the initial lack of detail, I was just trying to get a quick read on this before spending much more time on it. I seem to be able to reproduce it now.
GTM version is: GT.M V4.2-002 Linux x86
Standard production install without rebuilding...
Following is the traceback. I believe this happens when the symbol table is stressed so it may be a garbage collection issue? Anyway, I will try this on dbg image when I get a chance. Also, recent core files are only about 200K if you would like one.
Thanks,
Bob
(gdb) bt
#0 0x40075a66 in ?? ()
#1 0x4003f2b4 in ?? ()
#2 0x805f696 in secshr_db_clnup ()
#3 0x8060cba in secshr_db_clnup ()
#4 0x8060df6 in secshr_db_clnup ()
#5 0x806191c in secshr_db_clnup ()
#6 0x806171e in secshr_db_clnup ()
#7 0x80540f6 in parse_glvn ()
#8 0x8052da8 in unw_prof_frame ()
#9 0x806c0f8 in util_format ()
#10 0x8064daa in stp_gcol ()
#11 0x8064e3e in stp_gcol ()
#12 0x4003cc68 in ?? ()
#13 0x40070621 in ?? ()
#14 0x4003e0a1 in ?? ()
#15 0x80594af in op_srchindx ()
#16 0x805b811 in parse_file ()
#17 0x805f75b in secshr_db_clnup ()
#18 0x8060cba in secshr_db_clnup ()
#19 0x8060df6 in secshr_db_clnup ()
#20 0x806191c in secshr_db_clnup ()
#21 0x806171e in secshr_db_clnup ()
#22 0x80540f6 in parse_glvn ()
#23 0x8052da8 in unw_prof_frame ()
#24 0x806c0f8 in util_format ()
#25 0x805c5e6 in load_pattern_table ()
#26 0x805c1ab in load_pattern_table ()
#27 0x4003cc68 in ?? ()
#28 0x805f6e9 in secshr_db_clnup ()
#29 0x8060cba in secshr_db_clnup ()
#30 0x8060df6 in secshr_db_clnup ()
#31 0x806191c in secshr_db_clnup ()
#32 0x806171e in secshr_db_clnup ()
#33 0x80540f6 in parse_glvn ()
#34 0x8052da8 in unw_prof_frame ()
#35 0x806c0f8 in util_format ()
#36 0x8064daa in stp_gcol ()
#37 0x8064e3e in stp_gcol ()
#38 0x4003cc68 in ?? ()
#39 0x400769a4 in ?? ()
#40 0x40076bf0 in ?? ()
#41 0x40072fad in ?? ()
#42 0x400703e4 in ?? ()
#43 0x805f749 in secshr_db_clnup ()
#44 0x8060cba in secshr_db_clnup ()
#45 0x805e939 in secshr_db_clnup ()
#46 0x805e9a4 in secshr_db_clnup ()
#47 0x805e9ce in secshr_db_clnup ()
#48 0x805e480 in s2n ()
#49 0x805e64e in same_device_check ()
#50 0x805e75e in secshr_db_clnup ()
#51 0x80617ab in secshr_db_clnup ()
#52 0x806171e in secshr_db_clnup ()
#53 0x80540f6 in parse_glvn ()
#54 0x8052da8 in unw_prof_frame ()
#55 0x80532d5 in crt_gbl ()
---Type <return> to continue, or q <return> to quit---
#56 0x805267d in pcurrpos ()
#57 0x8053f2b in parse_glvn ()
#58 0x8053e8e in parse_glvn ()
#59 0x8052cd4 in new_prof_frame ()
#60 0x806c0f8 in util_format ()
#61 0x8070664 in wcs_verify ()
#62 0x80547c5 in mprof_tree_find_node ()
#63 0x8054bb7 in mprof_tree_find_node ()
#64 0x80545cc in mprof_tree_walk ()
#65 0x8052da8 in unw_prof_frame ()
#66 0x806c0f8 in util_format ()
#67 0x804afed in cli_is_hex ()
#68 0x804ad20 in tok_string_extract ()
#69 0x804a7c1 in sigemptyset ()
#70 0x400369cb in ?? ()
(gdb)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, with the debug version we get basically the same error:
%GTM-F-KILLBYSIGSINFO1, GT.M process 13771 has been killed by a signal 11 at address 0x808C801
(vaddr 0x32255080)
But the traceback is much more abbreviated (and perhaps useful?):
$ gdb $gtm_dist/mumps core
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `/u/gtm/mumps -direct'.
Program terminated with signal 3, Quit.
Reading symbols from /usr/lib/libncurses.so.4...done.
Loaded symbols for /usr/lib/libncurses.so.4
Reading symbols from /lib/i686/libm.so.6...done.
Loaded symbols for /lib/i686/libm.so.6
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_nisplus.so.2...done.
Loaded symbols for /lib/libnss_nisplus.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
#0 0x400ba801 in __kill () from /lib/i686/libc.so.6
(gdb) bt
#0 0x400ba801 in __kill () from /lib/i686/libc.so.6
#1 0x08091b2f in gtm_dump_core () at /u/gtmsrc/gtm/sr_unix/gtm_dump_core.c:55
#2 0x08092385 in gtm_fork_n_core () at /u/gtmsrc/gtm/sr_unix/gtm_fork_n_core.c:161
#3 0x0808e620 in generic_signal_handler (sig=11, info=0xbffff560, context=0xbffff5e0)
at /u/gtmsrc/gtm/sr_unix/generic_signal_handler.c:269
#4 <signal handler called>
#5 0x0808c801 in fetch (__builtin_va_alist=1) at /u/gtmsrc/gtm/sr_port/fetch.c:41
#6 0x08058bad in op_linefetch () at /u/gtmsrc/gtm/sr_i386/op_linefetch.s:33
#7 0x0804acf7 in main (argc=2, argv=0xbffff9dc, envp=0xbffff9e8) at /u/gtmsrc/gtm/sr_unix/gtm.c:154
#8 0x400a9177 in __libc_start_main (main=0x804ab30 <main>, argc=2, ubp_av=0xbffff9dc,
init=0x8049fe4 <_init>, fini=0x81574ac <_fini>, rtld_fini=0x4000e184 <_dl_fini>,
stack_end=0xbffff9d4) at ../sysdeps/generic/libc-start.c:129
(gdb)
Still trying to do something with the symbol table it looks like.
Core file is about 13MB now.
Note: The debug image was compiled on a RH6.2 system and the test and
core dump were done on a RH7.1 system.
Any ideas?
Thanks again,
Bob
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have migrated this topic to the bugs section with sample code to reproduce it. It appears that it caused by trying to read more than 90K from a socket. Which, of course, should return the data in 32K chunks (since GT.M is thus limited in its' local variable capacity) but certainly should not dump core...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anyone have any ideas what might be causing the following errors? These seem to periodically zap some of our background jobs. Could be related to TCP/IP but I have not tried to isolate it yet. There seem to be handful of common addresses...
Thanks for any suggestions or info,
Bob
%GTM-F-KILLBYSIGSINFO1, GT.M process 7899 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x6D6173AB)
%GTM-F-KILLBYSIGSINFO1, GT.M process 2804 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x267265A0)
%GTM-F-KILLBYSIGSINFO1, GT.M process 2809 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x267265A0)
%GTM-F-KILLBYSIGSINFO1, GT.M process 3881 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 11700 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 19090 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 14891 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x353030AE)
%GTM-F-KILLBYSIGSINFO1, GT.M process 24405 has been killed by a signal 11 at address 0x808540A (vaddr 0x4)
%GTM-F-KILLBYSIGSINFO1, GT.M process 28219 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x3638326C)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31319 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x6E6174AF)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31948 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x30653261)
%GTM-F-KILLBYSIGSINFO1, GT.M process 31950 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x30653261)
%GTM-F-KILLBYSIGSINFO1, GT.M process 8331 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x316461AA)
%GTM-F-KILLBYSIGSINFO1, GT.M process 9718 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x504F5291)
%GTM-F-KILLBYSIGSINFO1, GT.M process 348 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x534D4590)
%GTM-F-KILLBYSIGSINFO1, GT.M process 9219 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x53595367)
%GTM-F-KILLBYSIGSINFO1, GT.M process 16770 has been killed by a signal 11 at address 0x8082DB1 (vaddr 0x44363261)
First, it helps to know which version you were using. If you built you own version,
on what platform, including version.
You can find out where the SEGV (signal 11) occurred by "gdb $gtm_dist/mumps",
"disassemble 0x8082DB1" (or whatever the address reported is.) The vaddr value
is the virtual address whose access caused the SEGment Violation.
There should have been a core file produced. If there was already a core file in the
directory, GT.M will rename the old core file to core1, core2, etc. so if you started
with no core files and have more than one now, core1 is usually the most interesting.
"gdb $gtm_dist/mumps core", "bt" will produce a stack traceback. This isn't always
very useful, especially with Linux and production images. A dbg image is a lot more
useful in tracking down problems. It also has asserts which may catch memory
corruption problems early enough to get a sensible core.
You wouldn't be doing any external calls?
Good luck, Sam
Sorry for the initial lack of detail, I was just trying to get a quick read on this before spending much more time on it. I seem to be able to reproduce it now.
GTM version is: GT.M V4.2-002 Linux x86
Standard production install without rebuilding...
Following is the traceback. I believe this happens when the symbol table is stressed so it may be a garbage collection issue? Anyway, I will try this on dbg image when I get a chance. Also, recent core files are only about 200K if you would like one.
Thanks,
Bob
(gdb) bt
#0 0x40075a66 in ?? ()
#1 0x4003f2b4 in ?? ()
#2 0x805f696 in secshr_db_clnup ()
#3 0x8060cba in secshr_db_clnup ()
#4 0x8060df6 in secshr_db_clnup ()
#5 0x806191c in secshr_db_clnup ()
#6 0x806171e in secshr_db_clnup ()
#7 0x80540f6 in parse_glvn ()
#8 0x8052da8 in unw_prof_frame ()
#9 0x806c0f8 in util_format ()
#10 0x8064daa in stp_gcol ()
#11 0x8064e3e in stp_gcol ()
#12 0x4003cc68 in ?? ()
#13 0x40070621 in ?? ()
#14 0x4003e0a1 in ?? ()
#15 0x80594af in op_srchindx ()
#16 0x805b811 in parse_file ()
#17 0x805f75b in secshr_db_clnup ()
#18 0x8060cba in secshr_db_clnup ()
#19 0x8060df6 in secshr_db_clnup ()
#20 0x806191c in secshr_db_clnup ()
#21 0x806171e in secshr_db_clnup ()
#22 0x80540f6 in parse_glvn ()
#23 0x8052da8 in unw_prof_frame ()
#24 0x806c0f8 in util_format ()
#25 0x805c5e6 in load_pattern_table ()
#26 0x805c1ab in load_pattern_table ()
#27 0x4003cc68 in ?? ()
#28 0x805f6e9 in secshr_db_clnup ()
#29 0x8060cba in secshr_db_clnup ()
#30 0x8060df6 in secshr_db_clnup ()
#31 0x806191c in secshr_db_clnup ()
#32 0x806171e in secshr_db_clnup ()
#33 0x80540f6 in parse_glvn ()
#34 0x8052da8 in unw_prof_frame ()
#35 0x806c0f8 in util_format ()
#36 0x8064daa in stp_gcol ()
#37 0x8064e3e in stp_gcol ()
#38 0x4003cc68 in ?? ()
#39 0x400769a4 in ?? ()
#40 0x40076bf0 in ?? ()
#41 0x40072fad in ?? ()
#42 0x400703e4 in ?? ()
#43 0x805f749 in secshr_db_clnup ()
#44 0x8060cba in secshr_db_clnup ()
#45 0x805e939 in secshr_db_clnup ()
#46 0x805e9a4 in secshr_db_clnup ()
#47 0x805e9ce in secshr_db_clnup ()
#48 0x805e480 in s2n ()
#49 0x805e64e in same_device_check ()
#50 0x805e75e in secshr_db_clnup ()
#51 0x80617ab in secshr_db_clnup ()
#52 0x806171e in secshr_db_clnup ()
#53 0x80540f6 in parse_glvn ()
#54 0x8052da8 in unw_prof_frame ()
#55 0x80532d5 in crt_gbl ()
---Type <return> to continue, or q <return> to quit---
#56 0x805267d in pcurrpos ()
#57 0x8053f2b in parse_glvn ()
#58 0x8053e8e in parse_glvn ()
#59 0x8052cd4 in new_prof_frame ()
#60 0x806c0f8 in util_format ()
#61 0x8070664 in wcs_verify ()
#62 0x80547c5 in mprof_tree_find_node ()
#63 0x8054bb7 in mprof_tree_find_node ()
#64 0x80545cc in mprof_tree_walk ()
#65 0x8052da8 in unw_prof_frame ()
#66 0x806c0f8 in util_format ()
#67 0x804afed in cli_is_hex ()
#68 0x804ad20 in tok_string_extract ()
#69 0x804a7c1 in sigemptyset ()
#70 0x400369cb in ?? ()
(gdb)
Sam:
Well, with the debug version we get basically the same error:
%GTM-F-KILLBYSIGSINFO1, GT.M process 13771 has been killed by a signal 11 at address 0x808C801
(vaddr 0x32255080)
But the traceback is much more abbreviated (and perhaps useful?):
$ gdb $gtm_dist/mumps core
GNU gdb 5.0rh-5 Red Hat Linux 7.1
Copyright 2001 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `/u/gtm/mumps -direct'.
Program terminated with signal 3, Quit.
Reading symbols from /usr/lib/libncurses.so.4...done.
Loaded symbols for /usr/lib/libncurses.so.4
Reading symbols from /lib/i686/libm.so.6...done.
Loaded symbols for /lib/i686/libm.so.6
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_nisplus.so.2...done.
Loaded symbols for /lib/libnss_nisplus.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
#0 0x400ba801 in __kill () from /lib/i686/libc.so.6
(gdb) bt
#0 0x400ba801 in __kill () from /lib/i686/libc.so.6
#1 0x08091b2f in gtm_dump_core () at /u/gtmsrc/gtm/sr_unix/gtm_dump_core.c:55
#2 0x08092385 in gtm_fork_n_core () at /u/gtmsrc/gtm/sr_unix/gtm_fork_n_core.c:161
#3 0x0808e620 in generic_signal_handler (sig=11, info=0xbffff560, context=0xbffff5e0)
at /u/gtmsrc/gtm/sr_unix/generic_signal_handler.c:269
#4 <signal handler called>
#5 0x0808c801 in fetch (__builtin_va_alist=1) at /u/gtmsrc/gtm/sr_port/fetch.c:41
#6 0x08058bad in op_linefetch () at /u/gtmsrc/gtm/sr_i386/op_linefetch.s:33
#7 0x0804acf7 in main (argc=2, argv=0xbffff9dc, envp=0xbffff9e8) at /u/gtmsrc/gtm/sr_unix/gtm.c:154
#8 0x400a9177 in __libc_start_main (main=0x804ab30 <main>, argc=2, ubp_av=0xbffff9dc,
init=0x8049fe4 <_init>, fini=0x81574ac <_fini>, rtld_fini=0x4000e184 <_dl_fini>,
stack_end=0xbffff9d4) at ../sysdeps/generic/libc-start.c:129
(gdb)
Still trying to do something with the symbol table it looks like.
Core file is about 13MB now.
Note: The debug image was compiled on a RH6.2 system and the test and
core dump were done on a RH7.1 system.
Any ideas?
Thanks again,
Bob
I have migrated this topic to the bugs section with sample code to reproduce it. It appears that it caused by trying to read more than 90K from a socket. Which, of course, should return the data in 32K chunks (since GT.M is thus limited in its' local variable capacity) but certainly should not dump core...