Menu

#12 PVs stuck in Initial sampling

v1.0_(example)
closed
None
5
2015-10-28
2014-12-22
No

When adding a batch of PVs some seem to get "stuck" and sit in the Initial sampling state indefinitely (>2 hours). While in this state no CA search requests are seen. If I cancel and re-added these PVs then process completes successfully, so I don't think this is the CAJ search bug.

I don't notice any correlations with recordtype.

I see this with CAJ 1.1.14 and the new pre-release build.

I'm unsure how to troubleshoot this further.

Discussion

  • Michael Davidsaver

    A possible clue. One of the PVs which is presistently "stuck" is:

    $ cainfo BR-BI{}X17-I
    BR-BI{}X17-I
    State: connected
    Host: 10.0.152.183:48382
    Access: read, write
    Native data type: DBF_FLOAT
    Request type: DBR_FLOAT
    Element count: 36

    Which is all NaN

    $ caget BR-BI{}X17-I
    BR-BI{}X17-I 36 nan nan nan nan nan ...

     
  • Michael Davidsaver

    So I think I had several issues going on here, some of which are now cleaned up. One issue was the my deployment phase was probably mixing .class files from different builds. Removing this directory each time seems to help with reproducability.

    Yesterday I did another test of adding 51k PVs, all of which exist. This morning about 11k were listed as Being Archived with the remainder in Initial Sampling (specifically Current workflow state==START).

    This time I can see a lot of CA search requests being sent for .ADEL and .MDEL of PVs with RTYP bi and similar, which don't have there fields.

     
  • Michael Davidsaver

    Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.

    The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).

    The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).

    The thread (t@75) which should be de-queueing from the HZ event queue is blocking in a callback while attempting to add to the eventbus queue. At the same time, the eventbus thread is in a callback which is trying to add to the HZ queue.

    "hz.appliance0.event-2" - Thread t@75
    java.lang.Thread.State: BLOCKED
        at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:46)
        - waiting to lock <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber) owned by "MainThread" t@1214
        at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
        at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
        at com.google.common.eventbus.EventBus.post(EventBus.java:275)
        at org.epics.archiverappliance.config.DefaultConfigService$9.onMessage(DefaultConfigService.java:734)
        at com.hazelcast.topic.impl.TopicService.dispatchEvent(TopicService.java:135)
        at com.hazelcast.spi.impl.EventServiceImpl$LocalEventDispatcher.run(EventServiceImpl.java:658)
        at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:189)
        at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:173)
    
    Locked ownable synchronizers:
        - None
    
    "MainThread" - Thread t@1214
    java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1db41937> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:384)
        at com.hazelcast.util.executor.StripedExecutor$Worker.schedule(StripedExecutor.java:156)
        at com.hazelcast.util.executor.StripedExecutor$Worker.access$200(StripedExecutor.java:132)
        at com.hazelcast.util.executor.StripedExecutor.execute(StripedExecutor.java:112)
        at com.hazelcast.spi.impl.EventServiceImpl.executeLocal(EventServiceImpl.java:320)
        at com.hazelcast.spi.impl.EventServiceImpl.publishEvent(EventServiceImpl.java:304)
        at com.hazelcast.topic.impl.TopicService.publishEvent(TopicService.java:152)
        at com.hazelcast.topic.impl.TopicProxySupport.publishInternal(TopicProxySupport.java:94)
        at com.hazelcast.topic.impl.TopicProxy.publishInternal(TopicProxy.java:25)
        at com.hazelcast.topic.impl.TopicProxy.publish(TopicProxy.java:34)
        at org.epics.archiverappliance.config.DefaultConfigService.publishEventIntoCluster(DefaultConfigService.java:797)
        at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.google.common.eventbus.EventSubscriber.handleEvent(EventSubscriber.java:74)
        at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:47)
        - locked <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber)
        at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
        at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
        at com.google.common.eventbus.EventBus.post(EventBus.java:275)
        at org.epics.archiverappliance.mgmt.archivepv.ArchivePVState.run(ArchivePVState.java:65)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
    
    Locked ownable synchronizers:
        - locked <9c6f8f8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
    
     
  • Michael Davidsaver

    Ok, so this appears to be a deadlock in the mgmt process. The two relevant threads are listed below. I'll try to attach the full dump as well.

    The deadlock involves a lock held by eventbus to prevent events from being handled concurrently (cf. SynchronizedEventSubscriber).

    The other aspect is the hazelcast event queue being full (cf. LinkedBlockingQueue).

    The thread (t@75) which should be de-queueing from the HZ event queue is blocking in a callback while attempting to add to the eventbus queue. At the same time, the eventbus thread is in a callback which is trying to add to the HZ queue.

    "hz.appliance0.event-2" - Thread t@75
    java.lang.Thread.State: BLOCKED
        at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:46)
        - waiting to lock <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber) owned by "MainThread" t@1214
        at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
        at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
        at com.google.common.eventbus.EventBus.post(EventBus.java:275)
        at org.epics.archiverappliance.config.DefaultConfigService$9.onMessage(DefaultConfigService.java:734)
        at com.hazelcast.topic.impl.TopicService.dispatchEvent(TopicService.java:135)
        at com.hazelcast.spi.impl.EventServiceImpl$LocalEventDispatcher.run(EventServiceImpl.java:658)
        at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:189)
        at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:173)
    
    Locked ownable synchronizers:
        - None
    
    "MainThread" - Thread t@1214
    java.lang.Thread.State: TIMED_WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <1db41937> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:384)
        at com.hazelcast.util.executor.StripedExecutor$Worker.schedule(StripedExecutor.java:156)
        at com.hazelcast.util.executor.StripedExecutor$Worker.access$200(StripedExecutor.java:132)
        at com.hazelcast.util.executor.StripedExecutor.execute(StripedExecutor.java:112)
        at com.hazelcast.spi.impl.EventServiceImpl.executeLocal(EventServiceImpl.java:320)
        at com.hazelcast.spi.impl.EventServiceImpl.publishEvent(EventServiceImpl.java:304)
        at com.hazelcast.topic.impl.TopicService.publishEvent(TopicService.java:152)
        at com.hazelcast.topic.impl.TopicProxySupport.publishInternal(TopicProxySupport.java:94)
        at com.hazelcast.topic.impl.TopicProxy.publishInternal(TopicProxy.java:25)
        at com.hazelcast.topic.impl.TopicProxy.publish(TopicProxy.java:34)
        at org.epics.archiverappliance.config.DefaultConfigService.publishEventIntoCluster(DefaultConfigService.java:797)
        at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.google.common.eventbus.EventSubscriber.handleEvent(EventSubscriber.java:74)
        at com.google.common.eventbus.SynchronizedEventSubscriber.handleEvent(SynchronizedEventSubscriber.java:47)
        - locked <2e324b96> (a com.google.common.eventbus.SynchronizedEventSubscriber)
        at com.google.common.eventbus.EventBus.dispatch(EventBus.java:322)
        at com.google.common.eventbus.EventBus.dispatchQueuedEvents(EventBus.java:304)
        at com.google.common.eventbus.EventBus.post(EventBus.java:275)
        at org.epics.archiverappliance.mgmt.archivepv.ArchivePVState.run(ArchivePVState.java:65)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
    
    Locked ownable synchronizers:
        - locked <9c6f8f8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
    
     
  • Murali Shankar

    Murali Shankar - 2015-10-28

    We introduced throttling into the archive PV workflow; that seems to help a lot. This issue has been closed since early 2015. Please let me know if otherwise

     
  • Murali Shankar

    Murali Shankar - 2015-10-28
    • status: open --> closed
    • assigned_to: Murali Shankar
     

Log in to post a comment.