Menu

#437 osaf: opensaf_reboot is not safe

4.3.1
fixed
None
enhancement
osaf
-
major
2013-06-27
2013-05-31
No

Migrated from http://devel.opensaf.org/ticket/3085

AMF uses opensaf_reboot as a panic operation. Under the hood the script calls the command "reboot -f" which basically does sync() followed by reboot().

First issue is that the reboot command itself can fail if there is e.g. a hard drive failure. fsck on reboot might possibly fix the problem.

Second issue is that sync() can hang forever if 1) there is a bug or corruption in the file system, 2) a network file system server is not responding.

It is suggested that the opensaf_reboot command is time supervised and after a timeout expires fallback to reboot() or "echo b > /proc/sysrq-trigger".

Out of memory situation should also be considered. The supervision mechanism should be safe in the sense that no forks or memory allocation is needed to reboot.

Related

Tickets: #1066
Tickets: #437

Discussion

  • Anders Widell

    Anders Widell - 2013-05-31

    Changed 2 months ago by nagendra

    From ticket description, it looks to me as an enhancement. Anybody else comment?
    Changed 2 months ago by rameshb

    In my opinion, this ticket is a candidate for enhancement-ticket.

    Also let be conscious on the inputs that goes into "opensaf_reboot" script, because the content to support for a graceful shutdown depends on the characteristics of the target system and it may not be generic enough that applies for all the systems.

    So let this ticket collect and consolidate the data, such that the update to "opensaf_reboot" script is generic enough in MW context.
    follow-up: ↓ 4 Changed 2 months ago by nagendra

    Let me understand it bit better.
    You are proposing reboot has two works to do:
    1. syncing part and after sync is over,
    2. It does reboot the machine.

    1. On some machines, you have found that system is hanging in syncing(because of system issue) and it couldn't go for reboot. Am i right ?
    2. Are you proposing that opensaf_reboot should have ability to monitor such kind of situations.
    3. opensaf_reboot should be superwised : it comes with supposition that reboot will not reutrn until it finishes step 1 (syncing). Once it syncs up, then reboot command returns before proceeding step2.

    4. Can there be some way to start a script from opensaf_reboot and when it times out it uses "echo b > /proc/sysrq-trigger".

    5. If 4. is required, can watchdog help here. This will come with assumption that when sync hangs, system hangs. Can there be a case when system can respond for watchdog and sync hangs?
      in reply to: ↑ 3 ; follow-up: ↓ 5 Changed 2 months ago by hafe

    Replying to nagendra:

    Let me understand it bit better.
    You are proposing reboot has two works to do:
    1. syncing part and after sync is over,
    2. It does reboot the machine.
    

    I am not proposing that, this is what is currently done by invoking "reboot -f". The sync part can hang.

    1. On some machines, you have found that system is hanging in syncing(because of system issue) and it couldn't go for reboot. Am i right ?
    

    Yes

    2. Are you proposing that opensaf_reboot should have ability to monitor such kind of situations.
    

    Yes

    3. opensaf_reboot should be superwised : it comes with supposition that reboot will not reutrn until it finishes step 1 (syncing). Once it syncs up, then reboot command returns before proceeding step2.
    

    yes (I think, no question asked)

    4. Can there be some way to start a script from opensaf_reboot and when it times out it uses "echo b > /proc/sysrq-trigger".
    

    Yes, but as I say under low mem conditions it would be best to secure the C function opensaf_reboot()

    5. If 4. is required, can watchdog help here. This will come with assumption that when sync hangs, system hangs. Can there be a case when system can respond for watchdog and sync hangs?
    

    I have considered stopping the watchdog kicker but there is no general interface for that
    in reply to: ↑ 4 ; follow-up: ↓ 6 Changed 2 months ago by nagendra

    Replying to hafe:

    Replying to nagendra:
    
        Let me understand it bit better.
        You are proposing reboot has two works to do:
        1. syncing part and after sync is over,
        2. It does reboot the machine.
    
    
    I am not proposing that, this is what is currently done by invoking "reboot -f". The sync part can hang.
    
    
        1. On some machines, you have found that system is hanging in syncing(because of system issue) and it couldn't go for reboot. Am i right ?
    
    
    Yes
    
        2. Are you proposing that opensaf_reboot should have ability to monitor such kind of situations.
    
    
    Yes
    
        3. opensaf_reboot should be superwised : it comes with supposition that reboot will not reutrn until it finishes step 1 (syncing). Once it syncs up, then reboot command returns before proceeding step2.
    
    
    yes (I think, no question asked)
    
    
        4. Can there be some way to start a script from opensaf_reboot and when it times out it uses "echo b > /proc/sysrq-trigger".
    
    
    Yes, but as I say under low mem conditions it would be best to secure the C function opensaf_reboot()
    

    I am not able to get it. When opensaf_reboot is called, then opensaf_reboot calls a script, which once timeout, it reboot the machine. Are you saying, it will consume some memory and in low mem condition, it may not work. If yes, then opensaf_reboot can't be used under low mem.

        5. If 4. is required, can watchdog help here. This will come with assumption that when sync hangs, system hangs. Can there be a case when system can respond for watchdog and sync hangs?
    
    
    I have considered stopping the watchdog kicker but there is no general interface for that
    

    I did not get it. If sync hangs, the system will also hang and watchdog will kick off. Did you see this happening in your system?
    If watchdog is working, then may be watdog could be used.

    in reply to: ↑ 5 Changed 2 months ago by hafe

    Replying to nagendra:

        Yes, but as I say under low mem conditions it would be best to secure the C function opensaf_reboot()
    
    I am not able to get it. When opensaf_reboot is called, then opensaf_reboot calls a script, which once timeout, it reboot the machine. Are you saying, it will consume some memory and in low mem condition, it may not work. If yes, then opensaf_reboot can't be used under low mem.
    

    opensaf_reboot() calls system() which does fork and exec. fork can fail with ENOMEM and opensaf_reboot() will fail. Best case with a syslog and the system will hang.

    Another scenario is embedded systems with overcommit disabled, that increases the risk of a failed fork and a "zoombie" system over which AMF has lost control.

            5. If 4. is required, can watchdog help here. This will come with assumption that when sync hangs, system hangs. Can there be a case when system can respond for watchdog and sync hangs?
    
    
        I have considered stopping the watchdog kicker but there is no general interface for that
    
    I did not get it. If sync hangs, the system will also hang and watchdog will kick off. Did you see this happening in your system?
    

    No because the scheduler is still working and the watchdog does not test the file system (which it could).

    If watchdog is working, then may be watdog could be used.
    

    That is a system integration issue that might/might not be possible. This ticket is an opensaf change that will benefit all users.

    follow-up: ↓ 8 Changed 8 weeks ago by mathi

    It was discussed in the TLC that even though this is an integration topic, we could provide value add by extending the support during reboot scenarios to cover situations such as sync-hang.
    The following options are available to supervise reboot:

    1) Use an alarm(OPENSAF_REBOOT_TIMEOUT); the alarm could be started before trigerring the opensaf_reboot script. Upon timeout, the signal handler for this alarm could invoke reboot(); one down side to this signal is, it is not thread safe.

    2) to start a timer and post a message to mailbox and process(call reboot()) the timeout serially. Downside is that, it may cause problems when 'opensafd stop' gets invoked during the OS reboot procedure.

    3) to keep opensaf_reboot blocking on a select after trigerring the opensaf_reboot script. I dont think our services are coded for this because currently they go back to their poll after invoking opensaf_reboot.

    4) What other options are possible here...
    in reply to: ↑ 7 Changed 8 weeks ago by hafe

    Replying to mathi:

    It was discussed in the TLC that even though this is an integration topic, we could provide value add by extending the support during reboot scenarios to cover situations such as sync-hang.
    The following options are available to supervise reboot:
    
    1) Use an alarm(OPENSAF_REBOOT_TIMEOUT); the alarm could be started before trigerring the opensaf_reboot script. Upon timeout, the signal handler for this alarm could invoke reboot(); one down side to this signal is, it is not thread safe.
    
    2) to start a timer and post a message to mailbox and process(call reboot()) the timeout serially. Downside is that, it may cause problems when 'opensafd stop' gets invoked during the OS reboot procedure.
    
    3) to keep opensaf_reboot blocking on a select after trigerring the opensaf_reboot script. I dont think our services are coded for this because currently they go back to their poll after invoking opensaf_reboot.
    
    4) What other options are possible here...
    

    reboot() system call is not available under strict LSB requirements, not syscall() either...

    Anders W and I discussed have the main thread do fork+exec of the opensaf_reboot script, sleep(OPENSAF_REBOOT_TIMEOUT) and then do "echo b > /proc/sysrq-trigger" (from C code...). This of course requiring that the kernel is configured with "/proc/sysrq-trigger" which seems to be the case in most kernels nowadays. If not it will be the same as today, no added value.

    Some differentiation needs to be taken for panic reboot of the local node and reboots using PLM of other nodes. Ideally I would like to have two different scripts, one for panic reboot of the local node and one for reboots of other nodes (using PLM or whatever).
    Changed 8 weeks ago by mathi

    Sounds ok. But note that,
    when PLM is enabled, reboots of both local and remote nodes are done through PLM.
    So there is no real necessity yet to split the opensaf_reboot script. Lets not introduce additional scripts for the reboot interface, instead make the differentiation of PLM v/s no-PLM in the opensaf_reboot script itself, like the way we are doing.
    Changed 7 weeks ago by anwi

    owner set to anwi
    status changed from new to accepted
    
     
  • Anders Widell

    Anders Widell - 2013-05-31
    • summary: leap: opensaf_reboot is not safe --> osaf: opensaf_reboot is not safe
     
  • Anders Widell

    Anders Widell - 2013-05-31
    • status: accepted --> review
     
  • Anders Widell

    Anders Widell - 2013-06-20
    • status: review --> fixed
     
  • Anders Widell

    Anders Widell - 2013-06-20

    changeset: 4314:9922d5378faf
    tag: tip
    user: Anders Widell anders.widell@ericsson.com
    date: Thu Jun 20 12:42:37 2013 +0200
    summary: osaf: Add time supervision of opensaf_reboot [#437]

    Node ID 9922d5378fafdd9b4773b96be40c7bee33ee6858

     

    Related

    Tickets: #437

  • Anders Widell

    Anders Widell - 2013-06-27

    changeset: 4323:ff2940708f4c
    branch: opensaf-4.3.x
    tag: tip
    parent: 4321:5eda40762ad8
    user: Anders Widell anders.widell@er..
    date: Thu Jun 20 12:42:37 2013 +0200
    summary: osaf: Add time supervision of opensaf_reboot [#437]

    Node ID ff2940708f4cdef35ce8bc4deacff1237a2570fd

     

    Related

    Tickets: #437

  • Anders Widell

    Anders Widell - 2013-06-27
    • Milestone: 4.4.FC --> 4.3.1
     

Log in to post a comment.