Dapr 1.15.6
This update includes bug fixes:
- Fix Actor memory leak
- Fix Workflow state store contention
- Update Max Concurrent Workflow Operations
- Remove client-side rate limiter from injector
- Allow non-stream Workflow Schedule
- Scheduler connect after Placement dissemination
- Fix Scheduler Deadlock under high load
- Fix Placement reconnection during leader shutdown
- Fixed Kafka PubSub intermittently dropping messages while retrying
Fix Actor memory leak
Problem
Running Actor or Workflow workloads would see the daprd process consume more and more memory over time.
Impact
Running Actor or Workflow for long enough periods of time would see the daprd process consume all available memory, leading to an OOM crash.
Root cause
The Actor locking mechanism would not release object memory after use, in the case where the daprd it calling a remote Actor.
Solution
Defer Actor message locking to only the daprd which is hosting that Actor ID to ensure the lock object memory is always released after use.
Fix Workflow state store contention
Problem
Running Workflows in high throughput scenarios would see contention on processing or blocking Workflow state store operations.
Impact
Degraded performance of Workflow operations, leading to increased latency and potential timeouts.
Root cause
Circular Placement locking of Actor Workflow state operations.
Solution
Remove Placement locking in Workflow state operations as it is already locked at the Actor level.
Update Max Concurrent Workflow Operations
Problem
Workflows would experience degraded performance in high throughput scenarios.
Impact
Workflows would be limited in the number of concurrent operations they could handle, leading to increased latency and potential timeouts.
Root cause
The default value for maxConcurrentWorkflowInvocations
and maxConcurrentActivityInvocations
was set to 1000, which could lead to performance issues in high throughput scenarios.
Solution
Increase the default value for maxConcurrentWorkflowInvocations
and maxConcurrentActivityInvocations
to max int32 value, allowing for more concurrent operations to be processed without contention.
Remove client-side rate limiter from injector
Problem
The injector was using a client-side rate limiter that could cause throttling issues in production environments with high-throughput scenarios.
Impact
Degraded performance and potential service disruptions when the injector was under heavy load.
Root cause
The client-side rate limiter in the injector was unnecessarily restricting operations, causing bottlenecks in high-traffic scenarios.
Solution
Remove the client-side rate limiter from the injector to allow for better performance and eliminate unnecessary throttling in production environments.
Allow non-stream Workflow Schedule
Problem
Attempting to schedule a Workflow on a daprd with no Workflow listener stream would hang.
Impact
Attempting to schedule a Workflow on a daprd with no Workflow listener stream would hang indefinitely.
Root cause
The daprd would wait until a Workflow listener stream was established before processing the schedule request.
Solution
Allow scheduling of Workflows without requiring a Workflow listener stream to be established first. This allows for immediate processing of the schedule request.
Scheduler connect after Placement dissemination
Problem
Connecting a Workflow stream when there is worked queued in the system would cause a large number errors.
Impact
Connecting a Workflow stream when there is work queued in the system would cause a large number of errors, leading to degraded performance and potential timeouts on existing Workflows.
Root cause
Daprd would connect to Scheduler to receive Workflow work before Placement had disseminated.
Solution
Ensure that the daprd connects to the Scheduler only after Placement has disseminated, preventing errors and ensuring that the daprd is ready to process Workflow work without contention.
Fix Scheduler Deadlock under high load
Problem
Under high load, like with high Workflow through put, a Scheduler would deadlock.
Impact
Workflows and the Jobs API would fail to trigger.
Root cause
The Scheduler would not retry or handle transient Etcd errors related to high contention.
Solution
Add retry logic to the Scheduler to handle transient Etcd errors, preventing deadlocks and ensuring that Workflows and Jobs can be triggered successfully under high load conditions.
Fix Placement reconnection during leader shutdown
Problem
Daprd would fail to find the Placement leader host.
Impact
Actors (or Workflows) would fail to continue to function.
Root cause
The daprd DNS record set used to round robin the Placement hosts would not be refreshed in the event the leader host was shutdown.
Solution
Add a retry mechanism to the daprd to refresh the DNS record set for the Placement hosts, ensuring that it can always find the current leader host even during leader shutdown events.
Fixed Kafka PubSub intermittently dropping messages while retrying
Problem
Kafka PubSub would intermittently drop messages while retrying during rebalance or any other event requiring re-initialization of the consumer.
Impact
Messages sent to Kafka PubSub could be lost during retries, leading to potential data loss and inconsistencies in message processing.
Root cause
The processing loop for messages was continued in the consumer even after re-initialization, causing messages to no be processed if the consumer was not ready to process them.
Solution
Correctly breaking out of the processing loop in the Kafka consumer in the event of rebalancing or other re-initialization events.