Menu

#1815 mds: suspected message loss in large cluster deployments

future
assigned
nobody
None
defect
mds
-
major
False
2017-08-28
2016-05-09
Gary Lee
No

It has been observed that CLM callbacks to amfd can become 'lost'
in a large cluster. It seems to be occurring in MDS, when the callbacks are
sent around the same time as amfd is calling avd_imm_config_get().

It seems avd_imm_config_get() generates a large
amount of traffic through MDS.

Related

Tickets: #1815
Wiki: ChangeLog-5.0.1

Discussion

  • Gary Lee

    Gary Lee - 2016-05-09
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,4 @@
    -On large clusters (75 tested in this case), avd_imm_config_get() can take up to 3 minutes to complete.
    +On large clusters (75 nodes tested in this case), avd_imm_config_get() can take up to 3 minutes to complete.
     The default heartbeat period of 60s between amfnd and amfd does not allow enough time
     for avd_imm_config_get() to finish. The same thread is handling both heartbeat
     and reading of IMM.
    
     
  • Gary Lee

    Gary Lee - 2016-05-09
    • status: accepted --> review
     
  • Gary Lee

    Gary Lee - 2016-05-13
    • status: review --> unassigned
    • assigned_to: Gary Lee --> nobody
    • Component: amf --> mds
    • Part: d --> -
     
  • Gary Lee

    Gary Lee - 2016-05-13
    • summary: amf: heartbeat timeout on large clusters --> mds: suspected message loss in large cluster deployments
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,6 @@
    -On large clusters (75 nodes tested in this case), avd_imm_config_get() can take up to 3 minutes to complete.
    -The default heartbeat period of 60s between amfnd and amfd does not allow enough time
    -for avd_imm_config_get() to finish. The same thread is handling both heartbeat
    -and reading of IMM.
    +It has been observed that CLM callbacks to amfd can become 'lost'
    +in a large cluster. It seems to be occurring in MDS, when the callbacks are
    +sent around the same time as amfd is calling avd_imm_config_get().
    +
    +It seems avd_imm_config_get() generates a large
    +amount of traffic through MDS.
    
     
  • Gary Lee

    Gary Lee - 2016-05-13

    Work around the problem by calling avd_clm_track_start() after avd_imm_config_get() has completed

    changeset: 7633:073d00b1cb8f
    branch: opensaf-5.0.x
    tag: tip
    parent: 7630:bf51ccd1c73d
    user: Gary Lee gary.lee@dektech.com.au
    date: Fri May 13 15:41:44 2016 +1000
    summary: amf: start clm tracking after reading IMM config [#1815]

    changeset: 7632:95b231ff0240
    parent: 7629:61b071011bda
    user: Gary Lee gary.lee@dektech.com.au
    date: Fri May 13 15:38:20 2016 +1000
    summary: amf: start clm tracking after reading IMM config [#1815]

     

    Related

    Tickets: #1815


    Last edit: Gary Lee 2016-05-13
  • Gary Lee

    Gary Lee - 2016-05-13
    • Milestone: 4.7.2 --> 5.0.1
     
  • Gary Lee

    Gary Lee - 2016-05-13

    in the osafclmd trace file, amfd is initialising with clmd:

    Apr 27 22:29:47.391760 osafclmd [9595:clms_evt.c:0252] >> clms_client_new: MDS dest 2010fde2420f8
    Apr 27 22:29:47.391765 osafclmd [9595:clms_evt.c:0278] << clms_client_new: client_id 1

    later on, amfd request CLM tracking

    Apr 27 22:29:47.399803 osafamfd [9609:clma_mds.c:0118] >> clma_enc_track_start_msg
    Apr 27 22:29:47.399810 osafamfd [9609:clma_mds.c:0134] << clma_enc_track_start_msg
    Apr 27 22:29:47.399814 osafamfd [9609:clma_mds.c:0407] << clma_mds_enc
    Apr 27 22:29:47.399832 osafamfd [9609:clma_mds.c:1296] << clma_mds_msg_async_send
    Apr 27 22:29:47.399837 osafamfd [9609:clma_api.c:0455] << clma_send_mds_msg_get_clusternotificationbuf_4

    the callback is sent by clmd, but never received at amfd:

    Apr 27 22:29:47.465842 osafclmd [9595:clms_mds.c:0421] << clms_enc_cluster_ntf_buf_msg
    Apr 27 22:29:47.465844 osafclmd [9595:clmsv_enc_dec.c:0071] >> clmsv_encodeSaNameT
    Apr 27 22:29:47.465847 osafclmd [9595:clmsv_enc_dec.c:0088] << clmsv_encodeSaNameT
    Apr 27 22:29:47.465850 osafclmd [9595:clms_mds.c:0593] << clms_enc_track_cbk_msg
    Apr 27 22:29:47.465862 osafclmd [9595:clms_mds.c:1525] TR rc 1
    Apr 27 22:29:47.465865 osafclmd [9595:clms_mds.c:1527] << clms_mds_msg_send
    Apr 27 22:29:47.465868 osafclmd [9595:clms_evt.c:1764] TR clms_mds_msg_send() sent to 1

    because amfd never learns about the other nodes from clm, it rejects all node ups

    Apr 27 22:30:37 SC-1 osafamfd[9609]: WA avd_msg_sanity_chk: invalid node ID (2230f)
    Apr 27 22:30:37 SC-1 osafamfd[9609]: WA avd_msg_sanity_chk: invalid node ID (2230f)

     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 5.0.1 --> 5.0.2
     
  • A V Mahesh (AVM)

    • status: unassigned --> assigned
    • assigned_to: A V Mahesh (AVM)
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     
  • A V Mahesh (AVM)

    • assigned_to: A V Mahesh (AVM) --> nobody
    • Blocker: --> False
     

Log in to post a comment.