From: Mohan K. <mohan@GetHighAvailability.com> - 2022-03-03 11:34:15
|
Hi Sergio, We are not able to reproduce the issue as per the steps shared by you on version 5.22.01. So, can you please send us the immd , immnd , ckptd , ckptnd , syslog and mdslog of all the nodes of the cluster. Thanks & Regards Mohan Kanakam | 91-8333082448 Senior Software Engineer High Availability Solutions www.GetHighAvailability.com Get High Availability Today ! NJ, USA: 1 508-507-6507 | Hyderabad, India: 91 798-992-5293 From: Mohan Kanakam [mailto:mohan@GetHighAvailability.com] Sent: 28 February 2022 20:55 To: 'Sérgio Marques' Cc: 'Nagendra Kumar' Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to async call. Hi Sergio, Thanks for the information. We will try to reproduce and get back to you. Thanks & Regards Mohan Kanakam | 91-8333082448 Senior Software Engineer High Availability Solutions www.GetHighAvailability.com Get High Availability Today ! NJ, USA: 1 508-507-6507 | Hyderabad, India: 91 798-992-5293 From: Sérgio Marques [mailto:ser...@al...] Sent: 28 February 2022 20:46 To: Mohan Kanakam Cc: 'Nagendra Kumar' Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to async call. Hi Mohan, I believe I have finally found a way for you to reproduce the problem: Please try the following steps: 1. Start 2 controllers with SC-2 Active and SC-1 Standby and 2 payloads, PL-1 and PL-2. 2. At PL-2 create a checkpoint, a section and write on it. 3. At PL-1 create exactly the same checkpoint created in PL-2 and try to create the same section as previously created. You will receive a SA_AIS_ERR_EXIST. Do a SectionOverwrite. 4. At SC-1 perform a si-swap (amf-adm -t 10 si-swap safSi=SC-2N,safApp=OpenSAF) and reboot SC-1 and PL-1 nodes. 5. Wait for SC-1 and PL-1 to rejoin the cluster. 6. At SC-2 perform a si-swap (amf-adm -t 10 si-swap safSi=SC-2N,safApp=OpenSAF) and then list the checkpoint using immlist. Thanks and regards, Sérgio Marques From: Mohan Kanakam <mohan@GetHighAvailability.com> Sent: 18 de fevereiro de 2022 13:40 To: Sérgio Marques <ser...@al...> Cc: 'Nagendra Kumar' <nagendra@GetHighAvailability.com>; ope...@li... Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to async call. Atenção: Este email foi originado fora da Altice Portugal. Por favor, não clique em links nem abra anexos, a não ser que conheça o remetente e saiba que o seu conteúdo é seguro. Hi Sergio, Thanks for the testing and sharing the results. We try to reproduce the issue in our lab setup, unfortunately we are not able to reproduce. These are the steps we followed : 1. Start 2 controllers with SC-1 Act and SC-2 Standby and PL-3 payload 2. Create checkpoints by applications running on payload 3. Reboot SC-1 (Act). SC-2 becomes Active. And SC-1 joins as Standby. 4. Now perform si-swap. SC-2 becomes Standby and SC-1 becomes Active 5. Reboot SC-1 again. 6. While it is rebooting, perform immlist on checkpoints created. Here we got the output of immlist. Can you please confirm, this is the way to reproduce it or not? Did this issue continue after rebooted controller joined the cluster i.e., immlist worked after rebooted controller joined the cluster? I was thinking that, this could be a transient issue. Can you please share immd, immnd, amfd, amfnd, ckptd, ckptnd, mds.log and syslog from all the nodes. Thanks & Regards Mohan Kanakam | 91-8333082448 Senior Software Engineer High Availability Solutions www.GetHighAvailability.com Get High Availability Today ! NJ, USA: 1 508-507-6507 | Hyderabad, India: 91 798-992-5293 From: Sérgio Marques [mailto:ser...@al...] Sent: 17 February 2022 22:53 To: mohan@GetHighAvailability.com Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to async call. Hi Mohan, I’ve done a small change in your patch to be able of compiling it. Where you have “sinfo->ctxt->length” I’ve changed it to “sinfo->ctxt.length”. It resolves the problem. Now, there is no “MDS_SND_RCV: Invalid Sync CTXT Len” events being registered in mds.log. Thanks! Unfortunately, this does not resolve another issue that we also have and were hoping to resolve it with this patch as well. We set a cluster with 2 controller and 2 payload nodes, then we create a checkpoint like the following one: [root@OLT2T4-UNICOM-2~]# immlist safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService Name Type Value(s) ======================================================================== safCkpt SA_STRING_T safCkpt=CKPT_BACKPLANE_CONTROL saCkptCheckpointUsedSize SA_UINT64_T 2024 (0x7e8) saCkptCheckpointSize SA_UINT64_T 2024 (0x7e8) saCkptCheckpointRetDuration SA_TIME_T 9223372036854775807 (0x7fffffffffffffff, Sat Jan 27 10:50:44 1990) saCkptCheckpointNumWriters SA_UINT32_T 7 (0x7) saCkptCheckpointNumSections SA_UINT32_T 22 (0x16) saCkptCheckpointNumReplicas SA_UINT32_T 2 (0x2) saCkptCheckpointNumReaders SA_UINT32_T 7 (0x7) saCkptCheckpointNumOpeners SA_UINT32_T 7 (0x7) saCkptCheckpointNumCorruptSections SA_UINT32_T 0 (0x0) saCkptCheckpointMaxSections SA_UINT32_T 22 (0x16) saCkptCheckpointMaxSectionSize SA_UINT64_T 92 (0x5c) saCkptCheckpointMaxSectionIdSize SA_UINT64_T 1 (0x1) saCkptCheckpointCreationTimestamp SA_TIME_T 1645097377000000000 (0x16d48f5929030a00, Thu Feb 17 11:29:37 2022) saCkptCheckpointCreationFlags SA_UINT32_T 2 (0x2) SaImmAttrImplementerName SA_STRING_T safCheckPointService SaImmAttrClassName SA_STRING_T SaCkptCheckpoint SaImmAttrAdminOwnerName SA_STRING_T <Empty> After swapping (amf-adm -t 10 si-swap safSi=SC-2N,safApp=OpenSAF) and rebooting the active controller node for the second time, immlist starts returning SA_AIS_ERR_NO_RESOURCES: [root@OLT2T4-UNICOM-2~]# immlist safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService error - saImmOmAccessorGet_2 FAILED: SA_AIS_ERR_NO_RESOURCES (18) The checkpoint can be found using immfind but not listed width immlist neither accessed via the libSaCkpt.so library. [root@OLT2T4-UNICOM-2~]# immfind safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService safReplica=safNode=CC-1\,safCluster=myClmCluster,safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService safReplica=safNode=CC-2\,safCluster=myClmCluster,safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService If I only perform the swap command, without the reboot, this issue is not reproduced. I don’t have this issue with the 4.5.2 OpenSAF version. Do you have an idea of what could cause such thing and how should we debug this issue? Many thanks and regards, Sérgio Marques From: Mohan Kanakam <moh...@us...> Sent: 17 de fevereiro de 2022 10:51 To: [opensaf:tickets] <33...@ti...> Subject: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to async call. Atenção: Este email foi originado fora da Altice Portugal. Por favor, não clique em links nem abra anexos, a não ser que conheça o remetente e saiba que o seu conteúdo é seguro. Hi Sergio, can you please test the attached patch for your scenario and share your observations. thanks Attachments: * mds_error.patch <https://sourceforge.net/p/opensaf/tickets/_discuss/thread/04984c7ecf/8052/attachment/mds_error.patch> (703 Bytes; application/octet-stream) _____ [tickets:#3306] <https://sourceforge.net/p/opensaf/tickets/3306/> ckpt: checkpoint node director responding to async call. Status: accepted Milestone: 5.22.04 Created: Thu Feb 17, 2022 10:46 AM UTC by Mohan Kanakam Last Updated: Thu Feb 17, 2022 10:46 AM UTC Owner: Mohan Kanakam During section create, one ckptnd sends async request(normal mds send) to another ckptnd. But, another ckptnd is responding to the request in assumption that it received the sync request and it has to respond to the sender ckptnd. In few cases, it is needed to respond when a sync req comes to ckptnd, but in few cases, it receives async req and it needn't respond async request. We are getting the following messages in mds log when creating the section: sc1-VirtualBox osafckptnd 27692 mds.log [meta sequenceId="2"] MDS_SND_RCV: Invalid Sync CTXT Len _____ Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/3306/ To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ |