A case has been seen where syslog gets filled with thousands of messages like the one below:
May 3 15:37:48 SC-1 osaflogd[7643]: ncs_sel_obj_rmv_ind: recv failed - Socket operation on non-socket
Probably the wrong file descriptor is being used here when this happens. When looking at the code, there are some obvious improvements that can be made:
- Whenever the file descriptors raise_obj and/or rmv_obj are closed, the file descriptors in the data structure should be overwritten with -1 to indicate that the file descriptor is no longer valid. Relying on subsequent system calls to fail with EBADF is not a good idea, since the file descriptor may be re-cycled. This might be what has happened in the syslog entry above.
- The function ncs_sel_obj_rmv_ind() should check if either file descriptor is less than zero, and if so, return immediately without trying to operate on the file descriptors. It may log to syslog in this case, but in order to avoid spamming the log it should make sure to log only once. This can be achieved by e.g. logging if the file descriptor is -1, and then change it to -2 so that the next call will not log to syslog.
- If, after implementing the changes suggested above, recv() still fails due to any other reason than EAGAIN, EWOULDBLOCK or EINTR, we should call osaf_abort() to generate a core dump. Errors like "socket operation on non-socket" is an indication of a bug.
Primary reason I see to happen this scenario was, the 'fd(s)' were not handled properly with in the process execution flow.
Chances of happening:
When the
fd
is made to close twice in a particular flow viz., after reallocating the closed 'fd' for some other usage, again this 'fd' is made to close by the initial flow.Similar behavior was explained in ticket# 147.
Yes, in this situation better to call osaf_abort() to generate a core-dump.
Fix pushed to default, 4.4 and 4.3.
changeset: 5429:19bbcda1b15a
tag: tip
parent: 5426:89f247c08c4e
user: Ramesh ramesh.betham@oracle.com
date: Fri Jun 20 18:41:01 2014 +0530
summary: base: Corrected handling of raise_obj, rmv_obj file descriptors of Selection object [#928]
changeset: 5428:3ddbecc11a98
branch: opensaf-4.4.x
parent: 5414:dba5f3bbbf6f
user: Ramesh ramesh.betham@oracle.com
date: Fri Jun 20 18:41:01 2014 +0530
summary: base: Corrected handling of raise_obj, rmv_obj file descriptors of Selection object [#928]
changeset: 5427:4c1bea3021ba
branch: opensaf-4.3.x
parent: 5413:02e77b43ee5b
user: Ramesh ramesh.betham@oracle.com
date: Fri Jun 20 18:36:24 2014 +0530
summary: base: Corrected handling of raise_obj, rmv_obj file descriptors of Selection object [#928]
Clean-up of SEL_OBJ macros to "default" branch is pending. Will be pushed subsequently.
Thanks,
Ramesh.
Related
Tickets: #928
Following ER messages shows to do similar corrections in "rda" as well.
...............
Jun 10 09:47:58 SC-1 osaflogd[7691]: ER recv: PCSRDA_RC_IPC_RECV_FAILED: rc=88-Socket operation on non-socket
Jun 10 09:47:58 SC-1 osaflogd[7691]: ncs_sel_obj_rmv_ind: recv failed - Socket operation on non-socket ....................
Thanks,
Ramesh.
The fix was already made into 4.3, 4,4 and default br.
Will be doing SEL_OBJ Macros cleanup in default (4.5) br.