|
From: Anders B. <and...@er...> - 2014-12-08 07:34:01
|
Ok with me. /AndersBj -----Original Message----- From: Neelakanta Reddy [mailto:red...@or...] Sent: den 8 december 2014 08:43 To: Anders Björnerstedt Cc: ope...@li... Subject: Re: [PATCH 0 of 1] Review Request for imm:Donot reply to OM client if the class create/delete times out in waiting on reply from PBE [#1091] Hi AndersBj, As OM-client gets ERR_TIMEOUT sooner then expected is not a problem, then the problem reported in the ticket is as per current design. Moving the status of the ticket to wontfix. /Neel. On Friday 05 December 2014 07:47 PM, Anders Bjornerstedt wrote: > Thanks Neel for the thorough analysis. > > My conclusion is that we dont need to do anything. > > The OM-application gets ERR_TIMEOUT. > While that is not he best kind of reply to get, since the application > then does not know if the request was executed or not, it is an > allowed reply if it only occurs exceptionally and in particular when > the file system layer is not appropriate for this functionality. > > The fact that the om-client here gets the ERR_TIMEOUT sooner than > expected (by 'expected' here means the syncr-timeout) is not really > any problem. > The same client logic is executed as if the syncr-timeout had expired > and I can not see that the client will have any problems with the > "early" timeout. > I also dont see any point in adding logic to force the om-client to > wait the full syncr-timeout when the outcome is already known. > > If there is anything that could be improved it would be to have PBE-A > return OK to imm-ram in this case. > PBE-A says it is ignoring the timeout, which is ok because the slave > PBE-B is designed to do-or-die with the request. Same do-or-die logic > holds for PBE-A. Since PBE-A succeeds in persistifying and since > imm-ram succeeds in applying the class-create, I dont see why I made > it to return TRY_AGAIN in this case. > TRY_AGAIN here signals more uncertainty than needed since we are > certain that the class-create did get processed in imm-ram. If the > class-create did not get successfully processed in either or bothPBEs > then either ot both should be exiting. Finally if the class-create is > taking too long to complete in either PBE then the imm should be > restarting them (the hung PBE logic). A restart of a PBE when there > are pending PRT operations must lead to it being restarted with > regeneration of the imm.db file. > > So to repeat. > A return of TRY_AGAIN is not wrong but not ideal. > An "early" return of TRY_AGAIN is still not wrong and one could even > argue is a tiny bit better than a later TRY_AGAIN. > > But a return of OK would be better and should be possible ... I think. > > /AndersBj > > > Neelakanta Reddy wrote: >> Hi AndersBj, >> >> Analyzed the logs following is the cause of the problem: >> >> PBE-A: >> >> 1. class create request from PL-3: >> Nov 13 16:13:47.877377 osafimmnd [16876:immnd_evt.c:8594] >> >> immnd_evt_proc_fevs_rcv Nov 13 16:13:47.877422 osafimmnd >> [16876:immnd_evt.c:8612] T2 REMOTE FEVS received. Messages from me >> still pending:0 Nov 13 16:13:47.877473 osafimmnd >> [16876:immsv_evt.c:5382] T8 >> Received: IMMND_EVT_A2ND_CLASS_CREATE (27) from 0 Nov 13 >> 16:13:47.877499 osafimmnd [16876:immnd_evt.c:5110] TR We expect there >> to be a PBE Nov 13 16:13:47.877523 osafimmnd [16876:ImmModel.cc:2833] >> >> >> classCreate: cont:0x7fffe6414338 connp:0x7fffe6414330 >> nodep:0x7fffe6414334 >> Nov 13 16:13:47.877602 osafimmnd [16876:ImmModel.cc:2862] T5 CREATE >> CLASS 'testMA_verifyCreateCallback_101' category:1 >> >> 2. PBE-A sends admin operation towards PBE-B for class create >> >> Nov 13 16:13:47.878221 osafimmnd [16876:immnd_evt.c:5163] T2 MAKING >> PBE-IMPLEMENTER PERSISTENT CLASS CREATE upcall Nov 13 16:13:47.878385 >> osafimmpbed [16904:imma_proc.c:1328] TR ** Event type:25 Nov 13 >> 16:13:47.878430 osafimmpbed [16904:imma_proc.c:0360] T3 PBE-OI >> received PBE admin operation Nov 13 16:13:47.878481 osafimmpbed >> [16904:imma_proc.c:1224] >> imma_proc_free_pointers Nov 13 >> 16:13:47.878507 osafimmpbed [16904:imma_proc.c:1314] << >> imma_proc_free_pointers Nov 13 16:13:47.878553 osafimmpbed >> [16904:immpbe_daemon.cc:2273] TR ##@-PBE MAIN thead continues after >> poll ret: 1 Nov 13 16:13:47.878581 osafimmpbed >> [16904:imma_oi_api.c:0467] >> saImmOiDispatch Nov 13 16:13:47.878619 >> osafimmpbed [16904:imma_db.c:0187] >> imma_oi_ccb_record_find Nov 13 >> 16:13:47.878641 osafimmpbed [16904:imma_db.c:0198] << >> imma_oi_ccb_record_find Nov 13 16:13:47.878660 osafimmpbed >> [16904:imma_proc.c:1842] >> imma_process_callback_info Nov 13 >> 16:13:47.878709 osafimmpbed [16904:imma_proc.c:1886] TR PBE Admin OP >> callback Nov 13 16:13:47.878730 osafimmpbed >> [16904:immpbe_daemon.cc:0391] >> saImmOiAdminOperationCallback Nov 13 >> 16:13:47.878755 osafimmpbed [16904:immpbe_daemon.cc:0420] TR >> paramName: className paramType: 9 >> >> Nov 13 16:13:47.878780 osafimmpbed [16904:imma_om_api.c:3461] >> >> admin_op_invoke_common Nov 13 16:13:47.878808 osafimmpbed >> [16904:imma_om_api.c:3593] TR >> immInvocations:6 >> Nov 13 16:13:47.878834 osafimmpbed [16904:imma_om_api.c:3614] TR >> PARAM:className Nov 13 16:13:47.878858 osafimmpbed >> [16904:imma_om_api.c:3614] TR PARAM:ccbId >> >> 3. PBE-A Admin operation timesout after 15 seconds >> >> Nov 13 16:14:02.954402 osafimmpbed [16904:imma_om_api.c:3663] TR Fevs >> send RETURNED:5 Nov 13 16:14:02.954466 osafimmpbed >> [16904:imma_om_api.c:3812] << admin_op_invoke_common >> >> Nov 13 16:14:02 SLES-64BIT-SLOT1 osafimmpbed: WA Primary PBE failed >> to create class towards slave PBE. Library or immsv replied Rc:5 - >> ignoring >> >> >> PBEB: >> >> 1. class create admin operation at PBE-B >> >> Nov 13 16:13:48.167916 osafimmpbed [12841:imma_proc.c:1314] << >> imma_proc_free_pointers Nov 13 16:13:48.167976 osafimmpbed >> [12841:immpbe_daemon.cc:2150] TR ##@-PBE RUNTIME thread continues >> after poll ret: 1 Nov 13 16:13:48.168013 osafimmpbed >> [12841:imma_oi_api.c:0467] >> saImmOiDispatch Nov 13 16:13:48.168049 >> osafimmpbed [12841:imma_db.c:0187] >> imma_oi_ccb_record_find Nov 13 >> 16:13:48.168066 osafimmpbed [12841:imma_db.c:0198] << >> imma_oi_ccb_record_find Nov 13 16:13:48.168078 osafimmpbed >> [12841:imma_proc.c:1842] >> imma_process_callback_info Nov 13 >> 16:13:48.168096 osafimmpbed [12841:immpbe_daemon.cc:0391] >> >> saImmOiAdminOperationCallback Nov 13 16:13:48.168111 osafimmpbed >> [12841:immpbe_daemon.cc:0420] TR >> paramName: ccbId paramType: 4 >> Nov 13 16:13:48.168122 osafimmpbed [12841:immpbe_daemon.cc:0420] TR >> paramName: className paramType: 9 >> >> 2. For completion of the admin operation has taken approximately >> about 30 seconds. >> >> Nov 13 16:14:18.643500 osafimmpbed [12841:immpbe_daemon.cc:0491] TR >> Begin PBE transaction for class create OK Nov 13 16:14:18.643597 >> osafimmpbed [12841:immpbe_dump.cc:0815] >> classToPBE Nov 13 >> 16:14:18.643615 osafimmpbed [12841:imma_om_api.c:4648] >> >> saImmOmClassDescriptionGet_2 >> Nov 13 16:14:18.643626 osafimmpbed [12841:imma_om_api.c:4657] >> >> saImmOmClassDescriptionGet_2 >> Nov 13 16:14:18.643641 osafimmpbed [12841:imma_om_api.c:4727] TR >> ClassName: testMA_verifyCreateCallback_101 >> >> Time taken because of the following sqlite operation when NFS is used >> at PBE-B >> >> rc = sqlite3_exec(dbHandle, "BEGIN EXCLUSIVE TRANSACTION", NULL, >> NULL, &execErr); >> >> when NFS is not used the problem reported in the ticket is not observed. >> >> Conclusion: >> >> a. since the processing of adminoperation is delayed in PBE-B, >> PbeRtReqContinuation is timeout and TIMEOUT is returned to Class >> create API b. since TIMEOUT is happened because PBE file is placed >> in shared NFS and the problem is not reproducible without NFS . >> c. The solution proposed in the patch, will not send the TIMEOUT to >> the class create OM-API, let the API TIMEOUT. >> >> In the above case the class is written to both PBE-A and PBE-B. >> >> /Neel. >> >> On Wednesday 03 December 2014 02:34 PM, Anders Björnerstedt wrote: >>> Also looking at the ticket itself, the problem it reports is that >>> it apparently gets ERR_TIMEOUT "too soon". >>> But that in itself is not a valid complaint. >>> >>> There is no API rule saying that the only case of getting >>> ERR_TIMEOUT must be the syncr-timeout. >>> ERR_TIMEOUT simply means somthing caused a timeout of the request >>> and the API user does then not know if the operation succeeded or >>> not. >>> >>> The ticket may still be of value (point to a valid problem) in that >>> the cause of the timeout needs investigation. >>> >>> /AndersBj >>> >>> >>> -----Original Message----- >>> From: Anders Björnerstedt >>> Sent: den 3 december 2014 09:51 >>> To: 'red...@or...' >>> Cc: ope...@li... >>> Subject: RE: [PATCH 0 of 1] Review Request for imm:Donot reply to OM >>> client if the class create/delete times out in waiting on reply from >>> PBE [#1091] >>> >>> Hi Neel, >>> >>> I have to NACK this solution. >>> >>> Timeout from the PBE means the immsv does not know what happened. >>> The PBE may have (should have) persistified the class-create/delete. >>> If it fails under self control it will exit and be restarted. >>> If it is hanging before or after peristification, it will be >>> terminated and restarted. >>> >>> That is how it is designed for CLASS_CREATE, CLASS_DELETE, >>> PRTO_CREATE, PRTO_DELETE, PRTA_UPDATE. >>> >>> You should instead try to find out why there was a timeout. >>> Why did the PBE hang in this case. >>> >>> This incident was with 2PBE. >>> Possibly one of the PBEs was hanging on some issue that we have >>> fixed now ? >>> >>> So please analyze what the cause of the haning was (if possible) And >>> eliminate that if it has not already been eliminated. >>> >>> /AndersBj >>> >>> -----Original Message----- >>> From: red...@or... >>> [mailto:red...@or...] >>> Sent: den 3 december 2014 09:44 >>> To: Anders Björnerstedt >>> Cc: ope...@li... >>> Subject: [PATCH 0 of 1] Review Request for imm:Donot reply to OM >>> client if the class create/delete times out in waiting on reply from >>> PBE [#1091] >>> >>> Summary: imm:Donot reply to OM client if the class create/delete >>> times out in waiting on reply from PBE [#1091] Review request for >>> Trac Ticket(s): 1091 Peer Reviewer(s): AndersBj Affected branch(es): >>> default Development branch: default >>> >>> -------------------------------- >>> Impacted area Impact y/n >>> -------------------------------- >>> Docs n >>> Build system n >>> RPM/packaging n >>> Configuration files n >>> Startup scripts n >>> SAF services y >>> OpenSAF services n >>> Core libraries n >>> Samples n >>> Tests n >>> Other n >>> >>> >>> Comments (indicate scope for each "y" above): >>> --------------------------------------------- >>> >>> changeset 4995e02013c0c4abf41695f8272fcabbcecf6b7c >>> Author: Neelakanta Reddy<red...@or...> >>> Date: Wed, 03 Dec 2014 14:04:40 +0530 >>> >>> imm:Donot reply to OM client if the class create/delete times >>> out in waiting >>> on reply from PBE [#1091] >>> >>> IF the class create/delete times out in waiting on reply from >>> PBE, donot >>> reply and let OM create/delete API will timeout. The class has been >>> created/deleted in imm-ram and will not be reverted even if PBE >>> persistification fails. >>> >>> >>> Complete diffstat: >>> ------------------ >>> osaf/services/saf/immsv/immnd/ImmModel.cc | 21 ++++++++++++++++++++- >>> 1 files changed, 20 insertions(+), 1 deletions(-) >>> >>> >>> Testing Commands: >>> ----------------- >>> Delay the PBE class create adminoperation result for more than 6 >>> seconds. >>> >>> Testing, Expected Results: >>> -------------------------- >>> IF the class create/delete times out in waiting on reply from PBE, >>> then OM API must timeout. IMMND server should not send timeout after >>> 6 seconds. >>> >>> Conditions of Submission: >>> ------------------------- >>> Ack from AndersBj >>> >>> Arch Built Started Linux distro >>> ------------------------------------------- >>> mips n n >>> mips64 n n >>> x86 n n >>> x86_64 y y >>> powerpc n n >>> powerpc64 n n >>> >>> >>> Reviewer Checklist: >>> ------------------- >>> [Submitters: make sure that your review doesn't trigger any >>> checkmarks!] >>> >>> >>> Your checkin has not passed review because (see checked entries): >>> >>> ___ Your RR template is generally incomplete; it has too many blank >>> entries >>> that need proper data filled in. >>> >>> ___ You have failed to nominate the proper persons for review and push. >>> >>> ___ Your patches do not have proper short+long header >>> >>> ___ You have grammar/spelling in your header that is unacceptable. >>> >>> ___ You have exceeded a sensible line length in your >>> headers/comments/text. >>> >>> ___ You have failed to put in a proper Trac Ticket # into your commits. >>> >>> ___ You have incorrectly put/left internal data in your comments/files >>> (i.e. internal bug tracking tool IDs, product names etc) >>> >>> ___ You have not given any evidence of testing beyond basic build >>> tests. >>> Demonstrate some level of runtime or other sanity testing. >>> >>> ___ You have ^M present in some of your files. These have to be >>> removed. >>> >>> ___ You have needlessly changed whitespace or added whitespace crimes >>> like trailing spaces, or spaces before tabs. >>> >>> ___ You have mixed real technical changes with whitespace and other >>> cosmetic code cleanup changes. These have to be separate commits. >>> >>> ___ You need to refactor your submission into logical chunks; there is >>> too much content into a single commit. >>> >>> ___ You have extraneous garbage in your review (merge commits etc) >>> >>> ___ You have giant attachments which should never have been sent; >>> Instead you should place your content in a public tree to be >>> pulled. >>> >>> ___ You have too many commits attached to an e-mail; resend as threaded >>> commits, or place in a public tree for a pull. >>> >>> ___ You have resent this content multiple times without a clear >>> indication >>> of what has changed between each re-send. >>> >>> ___ You have failed to adequately and individually address all of the >>> comments and change requests that were proposed in the initial >>> review. >>> >>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) >>> >>> ___ Your computer have a badly configured date and time; confusing the >>> the threaded patch review. >>> >>> ___ Your changes affect IPC mechanism, and you don't present any >>> results >>> for in-service upgradability test. >>> >>> ___ Your changes affect user manual and documentation, your patch >>> series >>> do not contain the patch that updates the Doxygen manual. >>> >> > > |