EPICS Archiver Appliance / Bugs / #12 PVs stuck in Initial sampling

Michael Davidsaver - 2014-12-23

A possible clue. One of the PVs which is presistently "stuck" is:

$ cainfo BR-BI{}X17-I
BR-BI{}X17-I
State: connected
Host: 10.0.152.183:48382
Access: read, write
Native data type: DBF_FLOAT
Request type: DBR_FLOAT
Element count: 36

Which is all NaN

$ caget BR-BI{}X17-I
BR-BI{}X17-I 36 nan nan nan nan nan ...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Davidsaver - 2014-12-30

So I think I had several issues going on here, some of which are now cleaned up. One issue was the my deployment phase was probably mixing .class files from different builds. Removing this directory each time seems to help with reproducability.

Yesterday I did another test of adding 51k PVs, all of which exist. This morning about 11k were listed as Being Archived with the remainder in Initial Sampling (specifically Current workflow state==START).

This time I can see a lot of CA search requests being sent for .ADEL and .MDEL of PVs with RTYP bi and similar, which don't have there fields.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.

The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).

The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).

The thread (t@75) which should be de-queueing from the HZ event queue is blocking in a callback while attempting to add to the eventbus queue. At the same time, the eventbus thread is in a callback which is trying to add to the HZ queue.

"hz.appliance0.event-2" - Thread t@75
java.lang.Thread.State: BLOCKED
    at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:46)
    - waiting to lock <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber) owned by "MainThread" t@1214
    at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
    at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
    at com.google.common.eventbus.EventBus.post(EventBus.java:275)
    at org.epics.archiverappliance.config.DefaultConfigService$9.onMessage(DefaultConfigService.java:734)
    at com.hazelcast.topic.impl.TopicService.dispatchEvent(TopicService.java:135)
    at com.hazelcast.spi.impl.EventServiceImpl$LocalEventDispatcher.run(EventServiceImpl.java:658)
    at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:189)
    at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:173)

Locked ownable synchronizers:
    - None

"MainThread" - Thread t@1214
java.lang.Thread.State: TIMED_WAITING
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <1db41937> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
    at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:384)
    at com.hazelcast.util.executor.StripedExecutor$Worker.schedule(StripedExecutor.java:156)
    at com.hazelcast.util.executor.StripedExecutor$Worker.access$200(StripedExecutor.java:132)
    at com.hazelcast.util.executor.StripedExecutor.execute(StripedExecutor.java:112)
    at com.hazelcast.spi.impl.EventServiceImpl.executeLocal(EventServiceImpl.java:320)
    at com.hazelcast.spi.impl.EventServiceImpl.publishEvent(EventServiceImpl.java:304)
    at com.hazelcast.topic.impl.TopicService.publishEvent(TopicService.java:152)
    at com.hazelcast.topic.impl.TopicProxySupport.publishInternal(TopicProxySupport.java:94)
    at com.hazelcast.topic.impl.TopicProxy.publishInternal(TopicProxy.java:25)
    at com.hazelcast.topic.impl.TopicProxy.publish(TopicProxy.java:34)
    at org.epics.archiverappliance.config.DefaultConfigService.publishEventIntoCluster(DefaultConfigService.java:797)
    at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.google.common.eventbus.EventSubscriber.handleEvent(EventSubscriber.java:74)
    at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:47)
    - locked <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber)
    at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
    at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
    at com.google.common.eventbus.EventBus.post(EventBus.java:275)
    at org.epics.archiverappliance.mgmt.archivepv.ArchivePVState.run(ArchivePVState.java:65)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Locked ownable synchronizers:
    - locked <9c6f8f8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

mgmt-dump.txt

Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.

The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).

The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).

"hz.appliance0.event-2" - Thread t@75
java.lang.Thread.State: BLOCKED
    at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:46)
    - waiting to lock <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber) owned by "MainThread" t@1214
    at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
    at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
    at com.google.common.eventbus.EventBus.post(EventBus.java:275)
    at org.epics.archiverappliance.config.DefaultConfigService$9.onMessage(DefaultConfigService.java:734)
    at com.hazelcast.topic.impl.TopicService.dispatchEvent(TopicService.java:135)
    at com.hazelcast.spi.impl.EventServiceImpl$LocalEventDispatcher.run(EventServiceImpl.java:658)
    at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:189)
    at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:173)

Locked ownable synchronizers:
    - None

"MainThread" - Thread t@1214
java.lang.Thread.State: TIMED_WAITING
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <1db41937> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
    at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:384)
    at com.hazelcast.util.executor.StripedExecutor$Worker.schedule(StripedExecutor.java:156)
    at com.hazelcast.util.executor.StripedExecutor$Worker.access$200(StripedExecutor.java:132)
    at com.hazelcast.util.executor.StripedExecutor.execute(StripedExecutor.java:112)
    at com.hazelcast.spi.impl.EventServiceImpl.executeLocal(EventServiceImpl.java:320)
    at com.hazelcast.spi.impl.EventServiceImpl.publishEvent(EventServiceImpl.java:304)
    at com.hazelcast.topic.impl.TopicService.publishEvent(TopicService.java:152)
    at com.hazelcast.topic.impl.TopicProxySupport.publishInternal(TopicProxySupport.java:94)
    at com.hazelcast.topic.impl.TopicProxy.publishInternal(TopicProxy.java:25)
    at com.hazelcast.topic.impl.TopicProxy.publish(TopicProxy.java:34)
    at org.epics.archiverappliance.config.DefaultConfigService.publishEventIntoCluster(DefaultConfigService.java:797)
    at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.google.common.eventbus.EventSubscriber.handleEvent(EventSubscriber.java:74)
    at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:47)
    - locked <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber)
    at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
    at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
    at com.google.common.eventbus.EventBus.post(EventBus.java:275)
    at org.epics.archiverappliance.mgmt.archivepv.ArchivePVState.run(ArchivePVState.java:65)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Locked ownable synchronizers:
    - locked <9c6f8f8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Murali Shankar - 2015-10-28

We introduced throttling into the archive PV workflow; that seems to help a lot. This issue has been closed since early 2015. Please let me know if otherwise

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murali Shankar - 2015-10-28

status: open --> closed

assigned_to: Murali Shankar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

PVs stuck in Initial sampling

Group

Searches

Help

#12 PVs stuck in Initial sampling

Discussion