From: <bac...@li...> - 2007-12-10 21:12:53
|
The following issue has been set as RELATED TO issue 0001005. ====================================================================== http://bugs.bacula.org/view.php?id=897 ====================================================================== Reported By: ebollengier Assigned To: ebollengier ====================================================================== Project: bacula Issue ID: 897 Category: Director Reproducibility: have not tried Severity: tweak Priority: normal Status: closed Resolution: fixed Fixed in Version: SVN (pls. put version in Build field) ====================================================================== Date Submitted: 07-19-2007 04:06 UTC Last Modified: 12-10-2007 21:12 UTC ====================================================================== Summary: The ClientRunScript on RunsOnFailure case isn't logged into bacula.log Description: When a job is aborted with a failure, the ClientRunScript with RunsOnFailure option set doesn't appear into the log. => you don't know if the script have been executed... ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- related to 0001005 ClientRunAfterJob runscript errors aren... ====================================================================== ---------------------------------------------------------------------- kern - 07-22-07 08:42 ---------------------------------------------------------------------- Eric, could you test this again? It may be fixed with my latest changes to the SVN which were not for the same problem but concerned improper termination of the job in the FD when the script failed. ---------------------------------------------------------------------- ebollengier - 07-22-07 14:08 ---------------------------------------------------------------------- I have make a new test, and it fail too. I have added a test to the runscript regress script, it must fail now. in the RUN_FD_FAILED test, it does /bin/echo touching /tmp/RUN_FD_FAILED and /bin/touch /tmp/RUN_FD_FAILED You will not see any of them in the log, but you can do ls /tmp/RUN_FD_FAILED, the file will be present (good). I will try to see how fix this. I think that the fd connection is closed too early. ---------------------------------------------------------------------- ebollengier - 07-23-07 01:55 ---------------------------------------------------------------------- My new test show an error very close (same effect, but with another cause). The regress script fails during before job runscript execution, so backup isn't launched from backup.c if (!send_runscripts_commands(jcr)) { goto bail_out; } ... /* Send backup command */ bnet_fsend(fd, backupcmd); if (!response(jcr, fd, OKbackup, "backup", DISPLAY_ERROR)) { goto bail_out; } IMHO, i think that we don't expect the end of client session in error cases (with something like wait_for_job_termination()). So we miss last messages. I will continue to lookup. ---------------------------------------------------------------------- kern - 07-23-07 03:22 ---------------------------------------------------------------------- >From what I see in the code in the FD, it is normal that the Director stops running. When the scripts are sent off to the FD, it executes them. One of them fails, and the FailOnError (or whatever) is set, so the FD sends back a failure message to the Director rather than an OK. The response() subroutine then fails the Director when the return is NotOK. If you want a different behavior, you most likely need to look at the FD and ensure that it always sends back an OK. By the way, I recently simplified the code around response() in dird/fd_cmds.c. The termination sequence in the FD is rather abrupt when something goes wrong. It simply sends back an error status and if the Dir is in response() it will probably ignore error messages. If you want, we can probably modify that, but you will need to explain it carefully as any change in this code can have very subtle effects. ---------------------------------------------------------------------- ebollengier - 07-23-07 03:30 ---------------------------------------------------------------------- in filed/backup.c, i have replace the wait_for_storage_daemon_termination() by the more global wait_for_job_termination() (which does SD and FD wait termination) and it works. I want to add a new regression test with all errors we can have, to see what will change. Since it's not critical, we can just explain it in documentation for 2.2 release. But, knowing what append with runscript is *very* important for production use. --- build/src/dird/backup.c (révision 5228) +++ build/src/dird/backup.c (copie de travail) @@ -244,11 +244,14 @@ bail_out: set_jcr_job_status(jcr, JS_ErrorTerminated); Dmsg1(400, "wait for sd. use=%d\n", jcr->use_count()); /* Cancel SD */ if (jcr->store_bsock) { jcr->store_bsock->fsend("cancel Job=%s\n", jcr->Job); } - wait_for_storage_daemon_termination(jcr); + + wait_for_job_termination(jcr); + Dmsg1(400, "after wait for sd. use=%d\n", jcr->use_count()); return false; } ---------------------------------------------------------------------- kern - 08-08-07 04:23 ---------------------------------------------------------------------- I think as we discussed via email that the best approach to this is to correct it after version 2.2.0 is released since this requires some design changes that may have unexpected consequences. ---------------------------------------------------------------------- kern - 09-11-07 15:01 ---------------------------------------------------------------------- I'm closing this because I think we can handle it on the regular development list since from the last entry you posted, it looks like you have a fix. ---------------------------------------------------------------------- ebollengier - 11-06-07 05:46 ---------------------------------------------------------------------- The problem comes from that the connection is closed very shortly after a FD error. ie: if (do_something() == error) { dir->bsend("XXX error"); // this closes the connection by the director goto bail_out; } ... dir->bsend(EndJob); // not send in many cases run_scripts(); ... terminate_dir(); So, we can't have the run_script output in this case. I think it's quite hard to fix, we need deep changes. In a first time, i will document this in the manual. A possible workaround is to log job output on the FD. (maybe we can add a message type for runscript ?) append = /tmp/log = runscript To be very clean, i think that the EndJob message MUST be send in all cases. So, before closing the FD connection, the director have to wait for EndJob. ---------------------------------------------------------------------- kern - 11-11-07 15:00 ---------------------------------------------------------------------- This is fixed in the current SVN and will appear in version 3.0.0. It is unlikely to appear in a 2.2.x version unless we are able to do sufficient testing ... ---------------------------------------------------------------------- ebollengier - 12-10-07 21:10 ---------------------------------------------------------------------- fixed in patches/testing/bug_897_1005.patch Issue History Date Modified Username Field Change ====================================================================== 07-19-07 04:06 ebollengier New Issue 07-19-07 04:06 ebollengier Status new => assigned 07-19-07 04:06 ebollengier Assigned To => ebollengier 07-22-07 08:42 kern Note Added: 0002609 07-22-07 14:08 ebollengier Note Added: 0002610 07-23-07 01:55 ebollengier Note Added: 0002611 07-23-07 03:22 kern Note Added: 0002612 07-23-07 03:30 ebollengier Note Added: 0002613 08-08-07 04:23 kern Note Added: 0002660 08-08-07 04:23 kern Status assigned => confirmed 09-11-07 15:01 kern Note Added: 0002745 09-11-07 15:01 kern Status confirmed => closed 09-11-07 15:01 kern Resolution open => suspended 11-02-07 07:25 kern Status closed => feedback 11-06-07 05:45 ebollengier Note Added: 0002932 11-06-07 05:46 ebollengier Note Edited: 0002932 11-11-07 15:00 kern Note Added: 0002942 11-11-07 15:00 kern Status feedback => closed 11-11-07 15:00 kern Resolution suspended => fixed 11-11-07 15:00 kern Fixed in Version => 3.0.0 12-10-07 21:10 ebollengier Status closed => resolved 12-10-07 21:10 ebollengier Fixed in Version 3.0.0 => SVN (pls. put version in Build field) 12-10-07 21:10 ebollengier Note Added: 0003017 12-10-07 21:10 ebollengier Status resolved => feedback 12-10-07 21:10 ebollengier Resolution fixed => reopened 12-10-07 21:11 ebollengier File Added: bug_897_1005.patch 12-10-07 21:11 ebollengier Status feedback => closed 12-10-07 21:12 ebollengier Resolution reopened => fixed 12-10-07 21:12 ebollengier Relationship added related to 0001005 ====================================================================== |