From: Mantis B. T. <no...@bu...> - 2011-10-31 09:57:58
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1776 ====================================================================== Reported By: ba...@ip... Assigned To: ====================================================================== Project: bacula Issue ID: 1776 Category: Director Reproducibility: random Severity: crash Priority: immediate Status: new ====================================================================== Date Submitted: 2011-10-31 08:23 UTC Last Modified: 2011-10-31 09:57 UTC ====================================================================== Summary: Segmentation fault Description: The bacula-dir segmentation faults due to a NULL pointer sent to BSOCK::recv(). The problem appears randomly when running backups for a large amount of clients (about 750 clients). What triggers this is very hard to tell. I can see why it crashes but I'm not sure where in the code the bsock is instantiated. Here is the backtrace from gdb taken from the core dump. http://bugs.bacula.org/view.php?id=0 BSOCK::recv (this=0x0) at bsock.c:459 http://bugs.bacula.org/view.php?id=1 0x00007f8902e94e49 in cram_md5_respond (bs=0x0, password=0x7f87881480b8 "b59b365819df83442f902104d60cba4a", tls_remote_need=<value optimized out>, compatible=0x7f87c1daa9c8) at cram-md5.c:153 http://bugs.bacula.org/view.php?id=2 0x000000000040f0ea in authenticate_file_daemon (jcr=0x7f88ec280ef8) at authenticate.c:211 http://bugs.bacula.org/view.php?id=3 0x000000000041caaa in connect_to_file_daemon (jcr=0x7f88ec280ef8, retry_interval=10, max_retry_time=300, verbose=1) at fd_cmds.c:117 http://bugs.bacula.org/view.php?id=4 0x000000000041216c in do_backup (jcr=0x7f88ec280ef8) at backup.c:380 http://bugs.bacula.org/view.php?id=5 0x0000000000423014 in job_thread (arg=<value optimized out>) at job.c:314 http://bugs.bacula.org/view.php?id=6 0x00000000004248b1 in jobq_server (arg=0x6750c0) at jobq.c:450 http://bugs.bacula.org/view.php?id=7 0x00007f89021e19ca in start_thread () from /lib/libpthread.so.0 http://bugs.bacula.org/view.php?id=8 0x00007f8900faa70d in clone () from /lib/libc.so.6 http://bugs.bacula.org/view.php?id=9 0x0000000000000000 in ?? () As you can see, the pointer is NULL (jcr->file_bsock). (gdb) list 454 int32_t BSOCK::recv() 455 { 456 int32_t nbytes; 457 int32_t pktsiz; 458 459 msg[0] = 0; // Here it fails since *this = NULL. 460 msglen = 0; 461 if (errors || is_terminated()) { 462 return BNET_HARDEOF; 463 } (gdb) bt bacula-dir is executed like this: /usr/local/bacula/sbin/bacula-dir -u bacula -g bacula -v -c /usr/local/bacula/etc/bacula-dir.conf Perhaps this is solved in the latest release that was released yesterday? ====================================================================== ---------------------------------------------------------------------- (0006016) kern (administrator) - 2011-10-31 08:59 http://bugs.bacula.org/view.php?id=1776#c6016 ---------------------------------------------------------------------- The traceback that you should does not correspond to Bacula version 5.0.3, so it must either be another version or you have modified the code. Can you clarify? Where did you get your binaries? If you built them yourselves, please show me what compile options you used. If you are running the debugger manually, try the following commands: up up (you should be in authenticate ...) print jcr ---------------------------------------------------------------------- (0006017) ba...@ip... (reporter) - 2011-10-31 09:05 http://bugs.bacula.org/view.php?id=1776#c6017 ---------------------------------------------------------------------- We built it our self and I attached the file from configure with build options. >From GDB: http://bugs.bacula.org/view.php?id=2 0x000000000040f0ea in authenticate_file_daemon (jcr=0x7f88ec280ef8) at authenticate.c:211 211 auth_success = cram_md5_respond(fd, client->password, &tls_remote_need, &compatible); (gdb) print jcr $1 = (JCR *) 0x7f88ec280ef8 (gdb) ---------------------------------------------------------------------- (0006018) kern (administrator) - 2011-10-31 09:46 http://bugs.bacula.org/view.php?id=1776#c6018 ---------------------------------------------------------------------- Zut. I should have asked for "print *jcr" sorry. I want to see what it has in all the fields, and track it back to where the FD socket pointer is zeroed. I'll look at your attachments. ---------------------------------------------------------------------- (0006020) kern (administrator) - 2011-10-31 09:55 http://bugs.bacula.org/view.php?id=1776#c6020 ---------------------------------------------------------------------- After looking at your config.out, providing there were no errors or warnings in the compile, your build should be OK. I am running on the same system + compiler, so that pretty much rules out a compiler/build problem. Can you tell me how many jobs were running at the same time e.g. the output from a "status dir" in bconsole or something similar? Can you explain why your line numbers in authenticate.c are different from mine? If you cannot explain that, it doesn't make much sense for me to continue looking at this, because it means you have a different code base. ---------------------------------------------------------------------- (0006021) ba...@ip... (reporter) - 2011-10-31 09:56 http://bugs.bacula.org/view.php?id=1776#c6021 ---------------------------------------------------------------------- (gdb) print *jcr $21 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, _use_count = 2, m_JobType = 66, m_JobLevel = 73, my_thread_killable = true, link = { next = 0x7f88ec282ff8, prev = 0x7f88ec27eba8}, my_thread_id = 140221049648896, dir_bsock = 0x0, store_bsock = 0x145ac78, file_bsock = 0x12384e8, daemon_free_jcr = 0x421130 <dird_free_jcr(JCR*)>, msg_queue = 0x7f88ec280e48, msg_queue_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, dequeuing_msgs = false, job_end_push = {<SMARTALLOC> = {<No data fields>}, items = 0x0, num_items = 0, max_items = 0, num_grow = 1, cur_item = 0, own_items = false}, VolumeName = 0x7f88ec281940 "Pool-fs06-7d-1104", errmsg = 0x7f88ec280c10 "", Job = "de-1941-v3.d.ipeer.se.2011-10-30_23.55.36_31", '\000' <repeats 83 times>, event = '\000' <repeats 127 times>, eventType = 0, JobId = 8691, VolSessionId = 54, VolSessionTime = 1319925962, JobFiles = 0, JobErrors = 0, JobWarnings = 0, JobBytes = 0, ReadBytes = 0, FileId = 0, JobStatus = 82, JobPriority = 10, sched_time = 1320015336, start_time = 1320015336, run_time = 0, end_time = 0, wait_time_sum = 0, wait_time = 0, client_name = 0x7f88ec282510 "de-1941-v3.d.ipeer.se-fd", JobIds = 0x0, RestoreBootstrap = 0x0, stime = 0x7f88ec282ac0 "2011-10-30 00:47:27", sd_auth_key = 0x12f6918 "KOHF-GLNO-ODKC-EHPD-CEHC-MLIO-JKKO-BINE", jcr_msgs = 0x7f88ec282898, ClientId = 0, where = 0x0, RegexWhere = 0x0, where_bregexp = 0x0, cached_pnl = 0, cached_path = 0x0, prefix_links = false, gui = false, authenticated = false, Python_job = 0x0, Python_events = 0x0, cached_attribute = false, attr = 0x0, db = 0x7f8788228408, db_batch = 0x0, batch_started = false, HasBase = false, nb_base_files = 0, nb_base_files_used = 0, ar = 0x0, id_list = 0x0, accurate = false, plugin_ctx_list = 0x0, plugin_ctx = 0x0, plugin = 0x0, plugin_sp = 0x0, plugin_options = 0x0, cmd_plugin = false, comment = 0x7f88ec281a90 "", SD_msg_chan = 140219600627456, term_wait = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, work_item = 0x0, sd_msg_thread_done = false, ua = 0x0, job = 0x7f87881482e8, verify_job = 0x0, rstorage = 0x0, rstore = 0x0, wstorage = 0x7f88ec281f68, wstore = 0x7f878803c018, client = 0x7f8788148118, pool = 0x7f8788040f38, rpool = 0x0, full_pool = 0x0, inc_pool = 0x0, diff_pool = 0x0, run_pool_override = false, run_full_pool_override = false, run_inc_pool_override = false, run_diff_pool_override = false, sd_canceled = false, fileset = 0x7f8788016ac8, catalog = 0x7f878803ec58, messages = 0x7f878803ff28, SDJobFiles = 0, SDJobBytes = 0, SDErrors = 0, SDJobStatus = 70, FDJobStatus = 0, ExpectedFiles = 0, MediaId = 2207, FileIndex = 0, fname = 0x7f88ec281be0 'U' <repeats 200 times>..., jr = {JobId = 8691, Job = "de-1941-v3.d.ipeer.se.2011-10-30_23.55.36_31", '\000' <repeats 83 times>, Name = "de-1941-v3.d.ipeer.se", '\000' <repeats 106 times>, JobType = 66, JobLevel = 73, JobStatus = 67, ClientId = 35, PoolId = 23, FileSetId = 1, PriorJobId = 0, SchedTime = 1320015336, StartTime = 1320015336, EndTime = 0, RealEndTime = 0, JobTDate = 0, VolSessionId = 0, VolSessionTime = 0, JobFiles = 0, JobErrors = 0, JobMissingFiles = 0, JobBytes = 0, ReadBytes = 0, PurgedFiles = 0, HasBase = 0, FirstIndex = 0, LastIndex = 0, StartFile = 0, EndFile = 0, StartBlock = 0, EndBlock = 0, cSchedTime = '\000' <repeats 49 times>, cStartTime = '\000' <repeats 49 times>, cEndTime = '\000' <repeats 49 times>, cRealEndTime = '\000' <repeats 49 times>, limit = 0, rec_addr = 0, FileIndex = 0}, previous_jr = { JobId = 0, Job = '\000' <repeats 127 times>, Name = '\000' <repeats 127 times>, JobType = 0, JobLevel = 0, JobStatus = 0, ClientId = 0, PoolId = 0, FileSetId = 0, PriorJobId = 0, SchedTime = 0, StartTime = 0, EndTime = 0, RealEndTime = 0, JobTDate = 0, VolSessionId = 0, VolSessionTime = 0, JobFiles = 0, JobErrors = 0, JobMissingFiles = 0, JobBytes = 0, ReadBytes = 0, PurgedFiles = 0, HasBase = 0, FirstIndex = 0, LastIndex = 0, StartFile = 0, EndFile = 0, StartBlock = 0, EndBlock = 0, cSchedTime = '\000' <repeats 49 times>, cStartTime = '\000' <repeats 49 times>, cEndTime = '\000' <repeats 49 times>, cRealEndTime = '\000' <repeats 49 times>, limit = 0, rec_addr = 0, FileIndex = 0}, previous_job = 0x0, mig_jcr = 0x0, FSCreateTime = "2011-06-10 23:05:02", '\000' <repeats 30 times>, since = ", since=2011-10-30 00:47:27", '\000' <repeats 22 times>, {RestoreJobId = 0, MigrateJobId = 0}, client_uname = 0x7f88ec2829f0 "5.0.3 (04Aug10) Linux,Cross-compile,Win64", pool_source = 0x7f88ec281d30 "Job resource", rpool_source = 0x0, rstore_source = 0x0, wstore_source = 0x7f88ec2822c0 "Job resource", catalog_source = 0x7f88ec281fe0 "Client resource", replace = 0, NumVols = 0, reschedule_count = 0, FDVersion = 0, spool_size = 0, spool_data = false, acquired_resource_locks = true, term_wait_inited = true, fn_printed = false, write_part_after_job = true, needs_sd = false, cloned = false, unlink_bsr = false, VSS = false, Encrypt = false, stats_enabled = false, no_maxtime = false, keep_sd_auth_key = false, use_accurate_chksum = false} (gdb) http://bugs.bacula.org/view.php?id=2 0x000000000040f0ea in authenticate_file_daemon (jcr=0x7f88ec280ef8) at authenticate.c:211 211 auth_success = cram_md5_respond(fd, client->password, &tls_remote_need, &compatible); (gdb) print fd $18 = (BSOCK *) 0x12384e8 (gdb) print *fd $19 = {read_seqno = 1, msg = 0x1238630 "3D5gz8+jM/Zd3E/HHG1qwA", errmsg = 0x11c7a10 'U' <repeats 200 times>..., res = 0x7f8788148118, m_spool_fd = 0x0, tls = 0x0, src_addr = 0x0, in_msg_no = 1, out_msg_no = 2, msglen = 23, timer_start = 0, timeout = 518400, m_fd = 1216, b_errno = 0, m_blocking = 1, errors = 0, m_suppress_error_msgs = false, client_addr = {sa_family = 2, sa_data = "#\216?G\t\000\000\000\000\000\000\000"}, peer_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, m_next = 0x0, m_jcr = 0x7f88ec280ef8, m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, m_who = 0x10d6348 "Client: de-1941-v3.d.ipeer.se-fd", m_host = 0x10950e8 "de-1941-v3.d.ipeer.se", m_port = 9102, m_tid = 0x0, m_data_end = 0, m_timed_out = false, m_terminated = false, m_duped = false, m_spool = false, m_use_locking = false} (gdb) ---------------------------------------------------------------------- (0006022) ba...@ip... (reporter) - 2011-10-31 09:57 http://bugs.bacula.org/view.php?id=1776#c6022 ---------------------------------------------------------------------- oh, and the print *jcr was from the requested stack point. 0000002 0x000000000040f0ea in authenticate_file_daemon (jcr=0x7f88ec280ef8) at authenticate.c:211 211 auth_success = cram_md5_respond(fd, client->password, &tls_remote_need, &compatible); Issue History Date Modified Username Field Change ====================================================================== 2011-10-31 08:23 ba...@ip...New Issue 2011-10-31 08:59 kern Note Added: 0006016 2011-10-31 08:59 kern Status new => feedback 2011-10-31 09:03 ba...@ip...File Added: config.out 2011-10-31 09:05 ba...@ip...Note Added: 0006017 2011-10-31 09:05 ba...@ip...Status feedback => new 2011-10-31 09:46 kern Note Added: 0006018 2011-10-31 09:55 kern Note Added: 0006020 2011-10-31 09:56 ba...@ip...Note Added: 0006021 2011-10-31 09:57 ba...@ip...Note Added: 0006022 ====================================================================== |