Migrated from http://devel.opensaf.org/ticket/3085
AMF uses opensaf_reboot as a panic operation. Under the hood the script calls the command "reboot -f" which basically does sync() followed by reboot().
First issue is that the reboot command itself can fail if there is e.g. a hard drive failure. fsck on reboot might possibly fix the problem.
Second issue is that sync() can hang forever if 1) there is a bug or corruption in the file system, 2) a network file system server is not responding.
It is suggested that the opensaf_reboot command is time supervised and after a timeout expires fallback to reboot() or "echo b > /proc/sysrq-trigger".
Out of memory situation should also be considered. The supervision mechanism should be safe in the sense that no forks or memory allocation is needed to reboot.
Changed 2 months ago by nagendra
From ticket description, it looks to me as an enhancement. Anybody else comment?
Changed 2 months ago by rameshb
In my opinion, this ticket is a candidate for enhancement-ticket.
Also let be conscious on the inputs that goes into "opensaf_reboot" script, because the content to support for a graceful shutdown depends on the characteristics of the target system and it may not be generic enough that applies for all the systems.
So let this ticket collect and consolidate the data, such that the update to "opensaf_reboot" script is generic enough in MW context.
follow-up: ↓ 4 Changed 2 months ago by nagendra
Let me understand it bit better.
You are proposing reboot has two works to do:
1. syncing part and after sync is over,
2. It does reboot the machine.
opensaf_reboot should be superwised : it comes with supposition that reboot will not reutrn until it finishes step 1 (syncing). Once it syncs up, then reboot command returns before proceeding step2.
Can there be some way to start a script from opensaf_reboot and when it times out it uses "echo b > /proc/sysrq-trigger".
If 4. is required, can watchdog help here. This will come with assumption that when sync hangs, system hangs. Can there be a case when system can respond for watchdog and sync hangs?
in reply to: ↑ 3 ; follow-up: ↓ 5 Changed 2 months ago by hafe
Replying to nagendra:
I am not proposing that, this is what is currently done by invoking "reboot -f". The sync part can hang.
Yes
Yes
yes (I think, no question asked)
Yes, but as I say under low mem conditions it would be best to secure the C function opensaf_reboot()
I have considered stopping the watchdog kicker but there is no general interface for that
in reply to: ↑ 4 ; follow-up: ↓ 6 Changed 2 months ago by nagendra
Replying to hafe:
I am not able to get it. When opensaf_reboot is called, then opensaf_reboot calls a script, which once timeout, it reboot the machine. Are you saying, it will consume some memory and in low mem condition, it may not work. If yes, then opensaf_reboot can't be used under low mem.
I did not get it. If sync hangs, the system will also hang and watchdog will kick off. Did you see this happening in your system?
If watchdog is working, then may be watdog could be used.
in reply to: ↑ 5 Changed 2 months ago by hafe
Replying to nagendra:
opensaf_reboot() calls system() which does fork and exec. fork can fail with ENOMEM and opensaf_reboot() will fail. Best case with a syslog and the system will hang.
Another scenario is embedded systems with overcommit disabled, that increases the risk of a failed fork and a "zoombie" system over which AMF has lost control.
No because the scheduler is still working and the watchdog does not test the file system (which it could).
That is a system integration issue that might/might not be possible. This ticket is an opensaf change that will benefit all users.
follow-up: ↓ 8 Changed 8 weeks ago by mathi
It was discussed in the TLC that even though this is an integration topic, we could provide value add by extending the support during reboot scenarios to cover situations such as sync-hang.
The following options are available to supervise reboot:
1) Use an alarm(OPENSAF_REBOOT_TIMEOUT); the alarm could be started before trigerring the opensaf_reboot script. Upon timeout, the signal handler for this alarm could invoke reboot(); one down side to this signal is, it is not thread safe.
2) to start a timer and post a message to mailbox and process(call reboot()) the timeout serially. Downside is that, it may cause problems when 'opensafd stop' gets invoked during the OS reboot procedure.
3) to keep opensaf_reboot blocking on a select after trigerring the opensaf_reboot script. I dont think our services are coded for this because currently they go back to their poll after invoking opensaf_reboot.
4) What other options are possible here...
in reply to: ↑ 7 Changed 8 weeks ago by hafe
Replying to mathi:
reboot() system call is not available under strict LSB requirements, not syscall() either...
Anders W and I discussed have the main thread do fork+exec of the opensaf_reboot script, sleep(OPENSAF_REBOOT_TIMEOUT) and then do "echo b > /proc/sysrq-trigger" (from C code...). This of course requiring that the kernel is configured with "/proc/sysrq-trigger" which seems to be the case in most kernels nowadays. If not it will be the same as today, no added value.
Some differentiation needs to be taken for panic reboot of the local node and reboots using PLM of other nodes. Ideally I would like to have two different scripts, one for panic reboot of the local node and one for reboots of other nodes (using PLM or whatever).
Changed 8 weeks ago by mathi
Sounds ok. But note that,
when PLM is enabled, reboots of both local and remote nodes are done through PLM.
So there is no real necessity yet to split the opensaf_reboot script. Lets not introduce additional scripts for the reboot interface, instead make the differentiation of PLM v/s no-PLM in the opensaf_reboot script itself, like the way we are doing.
Changed 7 weeks ago by anwi
https://sourceforge.net/mailarchive/message.php?msg_id=30918816
changeset: 4314:9922d5378faf
tag: tip
user: Anders Widell anders.widell@ericsson.com
date: Thu Jun 20 12:42:37 2013 +0200
summary: osaf: Add time supervision of opensaf_reboot [#437]
Node ID 9922d5378fafdd9b4773b96be40c7bee33ee6858
Related
Tickets:
#437changeset: 4323:ff2940708f4c
branch: opensaf-4.3.x
tag: tip
parent: 4321:5eda40762ad8
user: Anders Widell anders.widell@er..
date: Thu Jun 20 12:42:37 2013 +0200
summary: osaf: Add time supervision of opensaf_reboot [#437]
Node ID ff2940708f4cdef35ce8bc4deacff1237a2570fd
Related
Tickets:
#437