There is a number of problems that has to be fixed:
1.
When creating a new app stream BAD HANDLE may be returned by IMM when the RT object for the stream is requested. This is incorrectly sent back to the log agent that will report BAD HANDLE to the log client. The correct AIS return code do the client shall in this case be TRY AGAIN
See lgs_evt.cc proc_stream_open_msg(). Note that there is a check for invalid oi handle but in this situation it does not work. The check must be done on the return code from the function that creates the RT object
2.
Fix the lgs_imm_impl_reinit_nonblocking() function so that I can be called in several places without starting a recovery thread several times. Needed because of the following fixes.
3.
Call lgs_imm_impl_reinit_nonblocking() in all places (almost, e.g. not in initialize before log server is started) where any of the saImmOiRtObject functions are used if BAD_HANDLE is returned.
Note: Do not finalize OI and start recovery if ERR_TIMEOUT which has been suggested. A TIMEOUT error should also not be part of any try again loop. In such a situation we are in an undefined state and other action should be taken e.g. restart of immnd.
4.
The background thread doing the OI restore must be possible to stop. This is needed if a request to change state to standby is received (amf callback). If this is not done there is a risk of that the standby becomes OI
5.
If a stream create request fail because of ALREADY_EXIST error when the stream RT object shall be created the creation of the stream shall not fail. Instead the existing RT object shall be deleted an a new RT object be created. This is ok to do since the check if the stream already exist is done before RT object creation is requested. Exist means that a “stray” object exist. A “stray” object may be created if RT object creation fails with TIMEOUT_ERROR. In this case it is undefined whether a RT object was created or not. However the stream is not created in this case.
Note:
This will not solve the problem with a “silently” lost OI. That may happen if a static OI operation returns TIMEOUT error and no more static operation is done. In this case nothing will happen that may give a BAD HANDLE return code to trig an OI recovery. This is not a log problem, it’s a generic problem that may affect any service using an OI.
Diff:
commit a8f5bb5bb6fe348942bce37b2388f25520fdbd16
Author: Lennart Lund lennart.lund@ericsson.com
Date: Thu Apr 5 14:28:20 2018 +0200