Menu

#1467 cpav:two apps on same node simultaneously failed to write checkpoint and app hangs

4.5.2
fixed
None
defect
ckpt
-
4.5
major
2015-09-14
2015-08-31
No

Steps to reproduce :

Step -1 :

  • Add sleep(30) in cpnd_evt_proc_nd2nd_ckpt_active_data_access_req() function of file
    osaf/services/saf/cpsv/cpnd/cpnd_evt.c at line no :3037 And build, install and bringup Opensaf
    on all 4 nodes SC-1, SC-2, PL-3, SC-4 ).

static uint32_t cpnd_evt_proc_nd2nd_ckpt_active_data_access_req(CPND_CB cb, CPND_EVT evt, CPSV_SEND_INFO *sinfo)
{
sleep(30);
uint32_t rc = NCSCC_RC_SUCCESS;

Step -2
Build Cpsv test_3opens_app_A.c & test_3opens_app_B.c application on all 4 nodes SC-1, SC-2, PL-3, SC-4 ( attached to SR)

SC-1:# gcc test_3opens_app_A.c -o node_A -lSaCkpt;
SC-1:# gcc test_3opens_app_B.c -o node_B -lSaCkpt;

Step -3 : Bring up Opensaf on all 4 nodes SC-1, SC-2, PL-3, SC-4

Step -4 : Run checkpoint application ./node_A In all 4 nodes SC-1, SC-2, PL-3, PL-4
and don`t Press <Enter> key.

SC-1:# ./node_A
0 saCkptCheckpointOpen returned checkpointHandle 626e60
1 saCkptCheckpointOpen returned checkpointHandle 626fe0
2 saCkptCheckpointOpen returned checkpointHandle 627270
3 saCkptCheckpointOpen returned checkpointHandle 6273f0
4 saCkptCheckpointOpen returned checkpointHandle 627570
CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
saCkptCheckpointWrite Press <Enter> key to continue...
====================================================

Step -5 :Run checkpoint application ./node_B only on 2 nodes SC-1 & SC-2
and don`t Press <Enter> key.
====================================================
SC-1:# ./node_B
0 saCkptCheckpointOpen returned checkpointHandle 626e60
1 saCkptCheckpointOpen returned checkpointHandle 626fe0
2 saCkptCheckpointOpen returned checkpointHandle 627270
3 saCkptCheckpointOpen returned checkpointHandle 6273f0
4 saCkptCheckpointOpen returned checkpointHandle 627570
CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
saCkptCheckpointWrite Press <Enter> key to continue...
====================================================

Step -6 : Press <Enter> key for ./node_A ./node_B application quickly to write simultaneously on SC-1 only,
then for node_B checkpoint application you will will see /node_B application failed to write checkpoint
====================================================
SC-1: # ./node_A
0 saCkptCheckpointOpen returned checkpointHandle 626e60
1 saCkptCheckpointOpen returned checkpointHandle 626fe0
2 saCkptCheckpointOpen returned checkpointHandle 627270
3 saCkptCheckpointOpen returned checkpointHandle 6273f0
4 saCkptCheckpointOpen returned checkpointHandle 627570
CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
saCkptCheckpointWrite Press <Enter> key to continue...

1 saCkptCheckpointWrite checkpointHandle 626e60
2 saCkptCheckpointWrite checkpointHandle 626e60
3 saCkptCheckpointWrite checkpointHandle 626e60
4 saCkptCheckpointWrite checkpointHandle 626e60
222 saCkptCheckpointWrite checkpointHandle 626e60
saCkptCheckpointRead Waiting to Read from Checkpoint ....
saCkptCheckpointRead Press <Enter> key to continue...

SC-1:# ./node_B
0 saCkptCheckpointOpen returned checkpointHandle 626e60
1 saCkptCheckpointOpen returned checkpointHandle 626fe0
2 saCkptCheckpointOpen returned checkpointHandle 627270
3 saCkptCheckpointOpen returned checkpointHandle 6273f0
4 saCkptCheckpointOpen returned checkpointHandle 627570
CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....

====================================================

1 Attachments

Related

Tickets: #1467
Wiki: ChangeLog-4.5.2
Wiki: ChangeLog-4.6.1

Discussion

  • A V Mahesh (AVM)

     
  • A V Mahesh (AVM)

    Bug Analysis :

    checkpointHandle - is A pointer to the checkpoint handle, allocated in the
    address space of the invoking process (CPA/application) . CPA stores into the
    memory area of CPA/application/process uses to access the checkpoint in
    subsequent invocations of the functions of the Checkpoint Service API.
    In the case of saCkptCheckpointOpenAsync() , saCkptCheckpointWrite() ,
    this handle is returned in the corresponding response message.

    Eevn though saCkptCheckpointWrite() is Sync request checkpointHandle is used by CPND for tracking
    CPND<--->CPND messaging invoking activity on request of saCkptCheckpointWrite() .

    If the ckpoint is SA_CKPT_CHECKPOINT_COLLOCATED & SA_CKPT_WR_ALL_REPLICAS checkpoint ,
    and the checkpoint is opened on multiple nodes , and saCkptCheckpointWrite() are beeing requested by
    multiple CPAs from same Node , then the local CPND to update the all other CPNDS ,
    whic has opened the same Ckpt, to track the pending invocations/event a
    teperory cpnd_evt entry added with key checkpointHandle and after a successful write responserevived
    from all the CPND's , using this cpnd_evt entry the local CPND will response back to CPA/application with result,
    and then deleted cpnd_evt entry from local CPND ( temporary only for tracking response message of peer CPND )

    In current code CPA is using malloc() return value as checkpointHandle for
    unique reference key where malloc() returns virtual memory specific to processes,
    so malloc() can return the same pointer value in separate CPA processes .

    If we are running two ckpt app (2 processes) try to write data into two
    difference/same checkpoints , it is possible that the checkpointHandle is being passed as
    same from both checkpoint application processes , as same checkpointHandle is being shared to CPND by both CPA
    as key reference, the CPND can miss behave because of ambiguous same reference key.

    Solution :

    As malloc() is standred call and the Checkpoint Service API Specification says
    checkpointHandle should be pointer allocated in the address space of the invoking
    process/CPA , CPND will return try again before the saCkptCheckpointWrite() call ,
    if multiple saCkptCheckpointWrite() request come to CPND with SAME
    checkpointHandle ( same virtual memory address) from different CPA's at the same time.

    Patch will be published soon.

     
  • A V Mahesh (AVM)

    • status: accepted --> fixed
    • Version: --> 4.5
     
  • A V Mahesh (AVM)

    changeset: 6802:a9b393bb1c66
    tag: tip
    parent: 6799:93e72338e78f
    user: A V Mahesh mahesh.valla@oracle.com
    date: Mon Sep 14 12:14:30 2015 +0530
    summary: cpsv: prevented multiple write request with same checkpointHandle at cpnd [#1467]

    changeset: 6801:bd1ac2c47a02
    branch: opensaf-4.6.x
    parent: 6798:de765110343d
    user: A V Mahesh mahesh.valla@oracle.com
    date: Mon Sep 14 12:13:57 2015 +0530
    summary: cpsv: prevented multiple write request with same checkpointHandle at cpnd [#1467]

    changeset: 6800:47e4745bd82a
    branch: opensaf-4.5.x
    parent: 6797:84dee6e50f1e
    user: A V Mahesh mahesh.valla@oracle.com
    date: Mon Sep 14 12:12:59 2015 +0530
    summary: cpsv: prevented multiple write request with same checkpointHandle at cpnd [#1467]

     

    Related

    Tickets: #1467


Log in to post a comment.