PVs stuck in Initial sampling
Brought to you by:
slac-mshankar
When adding a batch of PVs some seem to get "stuck" and sit in the Initial sampling state indefinitely (>2 hours). While in this state no CA search requests are seen. If I cancel and re-added these PVs then process completes successfully, so I don't think this is the CAJ search bug.
I don't notice any correlations with recordtype.
I see this with CAJ 1.1.14 and the new pre-release build.
I'm unsure how to troubleshoot this further.
A possible clue. One of the PVs which is presistently "stuck" is:
Which is all NaN
So I think I had several issues going on here, some of which are now cleaned up. One issue was the my deployment phase was probably mixing .class files from different builds. Removing this directory each time seems to help with reproducability.
Yesterday I did another test of adding 51k PVs, all of which exist. This morning about 11k were listed as Being Archived with the remainder in Initial Sampling (specifically Current workflow state==START).
This time I can see a lot of CA search requests being sent for .ADEL and .MDEL of PVs with RTYP bi and similar, which don't have there fields.
Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.
The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).
The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).
The thread (t@75) which should be de-queueing from the HZ event queue is blocking in a callback while attempting to add to the eventbus queue. At the same time, the eventbus thread is in a callback which is trying to add to the HZ queue.
Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.
The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).
The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).
The thread (t@75) which should be de-queueing from the HZ event queue is blocking in a callback while attempting to add to the eventbus queue. At the same time, the eventbus thread is in a callback which is trying to add to the HZ queue.
We introduced throttling into the archive PV workflow; that seems to help a lot. This issue has been closed since early 2015. Please let me know if otherwise