Menu

#2000 msg: Cluster reset happend due to msgd crashed on both the controller

5.1.RC2
fixed
None
defect
msg
d
5.1.FC
major
2016-09-13
2016-09-06
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & 1PBE enabled with 30K objects )

Summary :

Cluster reset happend due to assertion SA_MAX_UNEXTENDED_NAME_LENGTH failed in msgd

Steps followed & Observed behaviour

  1. Invoked failover
  2. After, few successful failover, New Active Controller rebooted beacuse of Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed in msgd. While previous Active joinig the cluster as a Standby Role resulted cluster reset happend.
    [Timeline: Sep 6 00:13:02 sofo-s2]

Sep 6 00:13:02 sofo-s2 osafimmd[3985]: NO MDS event from svc_id 24 (change:5, dest:13)
Sep 6 00:13:02 sofo-s2 osafmsgd[4145]: osaf_extended_name.c:139: osaf_extended_name_length: Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed.
Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: NO 'safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: ER safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
Sep 6 00:13:02 sofo-s2 opensaf_reboot: Rebooting local node; timeout=60

Notes:
1. Syslog attached
2 msgnd & msgd trace not enabled

2 Attachments

Related

Tickets: #2000

Discussion

  • Ritu Raj

    Ritu Raj - 2016-09-07

    I attahced the bt and msgd trace file, below is the snippet of bt:

    2 0x00007f44089ef197 in osafassert_fail (file=0x7f4408a41987 "osaf_extended_name.c", line=139, func=0x7f4408a419f0 <FUNCTION.2883> "osaf_extended_name_length",
    __assertion=0x7f4408a41960 "length < SA_MAX_UNEXTENDED_NAME_LENGTH") at sysf_def.c:281

    3 0x00007f44089ead1e in osaf_extended_name_length (name=0x67a72c) at osaf_extended_name.c:139

    4 0x00007f44089fe7ff in osaf_encode_sanamet (ub=0x7fff9f4f09d0, name=0x67a72c) at hj_enc.c:403

    5 0x00007f44089eb275 in ncs_edp_sanamet (hdl=0x6654c0, edu_tkn=0x0, ptr=0x67a72c, ptr_data_len=0x7fff9f4eee14, buf_env=0x7fff9f4f0130, op=EDP_OP_TYPE_ENC, o_err=0x7fff9f4f0238) at saf_edu.c:62

    6 0x00007f44089f8ca1 in ncs_edu_run_edp (edu_hdl=0x6654c0, edu_tkn=0x0, rule=0x7fff9f4ef190, edp=0x404f40 ncs_edp_sanamet@plt, ptr=0x67a72c, dcnt=0x7fff9f4eee14, buf_env=0x7fff9f4f0130,
    optype=EDP_OP_TYPE_ENC, o_err=0x7fff9f4f0238) at hj_edu.c:499

    7 0x00007f44089f99b2 in ncs_edu_prfm_enc_on_non_ptr (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238)
    at hj_edu.c:972

    8 0x00007f44089f9302 in ncs_edu_perform_exec_action_on_non_ptr (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, optype=EDP_OP_TYPE_ENC, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364,
    buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238) at hj_edu.c:805

    9 0x00007f44089f92a0 in ncs_edu_perform_exec_action (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, optype=EDP_OP_TYPE_ENC, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364,
    buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238) at hj_edu.c:780

    10 0x00007f44089f9041 in ncs_edu_exec_rule (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, optype=EDP_OP_TYPE_ENC,
    o_err=0x7fff9f4f0238) at hj_edu.c:627

    11 0x00007f44089fa8db in ncs_edu_run_rules_for_enc (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, prog=0x7fff9f4ef150, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238,
    instr_count=4) at hj_edu.c:1666

     

    Last edit: Ritu Raj 2016-09-07
    • Zoran Milinkovic

      Hi Ritu,

      If 'length' in non-extended SaNameT has value 256, then you can see asserts as the one reported in the ticket.
      Extended names feature does not support non-extended SaNameT with length of 256.
      For safe use of value length bigger than 255, use osaf_extended_* functions.

      Please check your test and confirm the statement above.

      Thanks,
      Zoran

      -----Original Message-----
      From: Ritu Raj [mailto:ritu-raj@users.sf.net]
      Sent: den 7 september 2016 07:57
      To: [opensaf:tickets]
      Subject: [opensaf:tickets] #2000 msg: Cluster reset happend due to msgd crashed on both the controller

      I attahced the bt and msgd trace file, below is the snippet of bt:

      2 0x00007f44089ef197 in osafassert_fail (file=0x7f4408a41987 "osaf_extended_name.c", line=139, func=0x7f4408a419f0 <FUNCTION.2883> "osaf_extended_name_length",
      __assertion=0x7f4408a41960 "length < SA_MAX_UNEXTENDED_NAME_LENGTH") at sysf_def.c:281

      3 0x00007f44089ead1e in osaf_extended_name_length (name=0x67a72c) at osaf_extended_name.c:139

      4 0x00007f44089fe7ff in osaf_encode_sanamet (ub=0x7fff9f4f09d0, name=0x67a72c) at hj_enc.c:403

      5 0x00007f44089eb275 in ncs_edp_sanamet (hdl=0x6654c0, edu_tkn=0x0, ptr=0x67a72c, ptr_data_len=0x7fff9f4eee14, buf_env=0x7fff9f4f0130, op=EDP_OP_TYPE_ENC, o_err=0x7fff9f4f0238) at saf_edu.c:62

      6 0x00007f44089f8ca1 in ncs_edu_run_edp (edu_hdl=0x6654c0, edu_tkn=0x0, rule=0x7fff9f4ef190, edp=0x404f40 ncs_edp_sanamet@plt, ptr=0x67a72c, dcnt=0x7fff9f4eee14, buf_env=0x7fff9f4f0130,

      optype=EDP_OP_TYPE_ENC, o_err=0x7fff9f4f0238) at hj_edu.c:499
      

      7 0x00007f44089f99b2 in ncs_edu_prfm_enc_on_non_ptr (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238)

      at hj_edu.c:972
      

      8 0x00007f44089f9302 in ncs_edu_perform_exec_action_on_non_ptr (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, optype=EDP_OP_TYPE_ENC, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364,

      buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238) at hj_edu.c:805
      

      9 0x00007f44089f92a0 in ncs_edu_perform_exec_action (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, optype=EDP_OP_TYPE_ENC, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364,

      buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238) at hj_edu.c:780
      

      10 0x00007f44089f9041 in ncs_edu_exec_rule (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, rule=0x7fff9f4ef190, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, optype=EDP_OP_TYPE_ENC,

      o_err=0x7fff9f4f0238) at hj_edu.c:627
      

      11 0x00007f44089fa8db in ncs_edu_run_rules_for_enc (edu_hdl=0x6654c0, edu_tkn=0x0, hdl_node=0x0, prog=0x7fff9f4ef150, ptr=0x67a72c, ptr_data_len=0x7fff9f4ef364, buf_env=0x7fff9f4f0130, o_err=0x7fff9f4f0238,

      instr_count=4) at hj_edu.c:1666
      

      Attachments:


      [tickets:#2000] msg: Cluster reset happend due to msgd crashed on both the controller

      Status: unassigned
      Milestone: 4.7.2
      Created: Tue Sep 06, 2016 06:04 AM UTC by Ritu Raj Last Updated: Tue Sep 06, 2016 06:04 AM UTC
      Owner: nobody
      Attachments:

      Environment details

      OS : Suse 64bit
      Changeset : 7997 ( 5.1.FC)
      Setup : 4 nodes ( 2 controllers and 2 payloads with headless feature disabled & 1PBE enabled with 30K objects )

      Summary :

      Cluster reset happend due to assertion SA_MAX_UNEXTENDED_NAME_LENGTH failed in msgd

      Steps followed & Observed behaviour

      1. Invoked failover
      2. After, few successful failover, New Active Controller rebooted beacuse of Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed in msgd. While previous Active joinig the cluster as a Standby Role resulted cluster reset happend.
        [Timeline: Sep 6 00:13:02 sofo-s2]

      Sep 6 00:13:02 sofo-s2 osafimmd[3985]: NO MDS event from svc_id 24 (change:5, dest:13) Sep 6 00:13:02 sofo-s2 osafmsgd[4145]: osaf_extended_name.c:139: osaf_extended_name_length: Assertion 'length < SA_MAX_UNEXTENDED_NAME_LENGTH' failed.
      Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: NO 'safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
      Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: ER safComp=MQD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Sep 6 00:13:02 sofo-s2 osafamfnd[4046]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60 Sep 6 00:13:02 sofo-s2 opensaf_reboot: Rebooting local node; timeout=60

      Notes:
      1. Syslog attached
      2 msgnd & msgd trace not enabled


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/2000/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #2000

  • A V Mahesh (AVM)

    Their is no osaf_extended_name supoort for MQSv , please chek
    why osaf_extended_name context is coming hear .

     
  • A V Mahesh (AVM)

    • Component: msg --> osaf
    • Milestone: 4.7.2 --> 5.1.RC1
     
  • A V Mahesh (AVM)

    This looks like Leap changes issue releated to osaf_extended_name

     
  • A V Mahesh (AVM)

    • summary: msg: Cluster reset happend due to msgd crashed on both the controller --> osaf: Cluster reset happend due to msgd crashed on both the controller
     
  • Anders Widell

    Anders Widell - 2016-09-08
    • Component: osaf --> msg
     
  • A V Mahesh (AVM)

    • summary: osaf: Cluster reset happend due to msgd crashed on both the controller --> msg: Cluster reset happend due to msgd crashed on both the controller
    • status: unassigned --> review
    • assigned_to: A V Mahesh (AVM)
     
  • A V Mahesh (AVM)

    • status: review --> fixed
     
  • A V Mahesh (AVM)

    changeset: 8064:99410ba8cc21
    parent: 8061:da089e8f337c
    user: Ramesh ramesh.betham@oracle.com
    date: Tue Sep 13 15:01:43 2016 +0530
    summary: msg: memset ilist_info and track_info to avoid garbage [#2000]

    changeset: 8065:019e617955ef
    branch: opensaf-5.1.x
    tag: tip
    parent: 8063:59a5226122ed
    user: Ramesh ramesh.betham@oracle.com
    date: Tue Sep 13 15:02:23 2016 +0530
    summary: msg: memset ilist_info and track_info to avoid garbage [#2000]

     

    Related

    Tickets: #2000

  • Anders Widell

    Anders Widell - 2016-09-13
    • Milestone: 5.1.RC1 --> 5.1.RC2
     

Log in to post a comment.