Dapr 1.16.2
This update includes bug fixes:
- HTTP API default CORS behavior
- Scheduler External etcd with multiple client endpoints
- Placement not cleaning internal state after host that had actors disconnects
- Blocked Placement dissemination during high churn
- Blocked Placement dissemination with high Scheduler dataset
- Fix panic during actor deactivation
- OpenTelemetry environment variables support
- Fixing goavro bug due to codec state mutation
- APP_API_TOKEN not passed in gRPC metadata for app callbacks
- Fixed Pulsar OAuth token renewal
- Fix Scheduler connection during non-graceful network interruptions
- Prevent infinite loop when workflow state is corrupted or destroyed
HTTP API default CORS behavior
Problem
In the 1.16.0 release a change was introduced that changed the default behavior of CORS in the Dapr HTTP API. Now by default CORS headers were added to all HTTP responses. However this new behavior couldn't be disabled.
Impact
This caused problems in scenarios where CORS is handled outside of the Dapr sidecar, because the Dapr Sidecar always added CORS headers.
Solution
Revert part of the behavior introduced in this PR and change the default value of allowed-origins flag to be an empty string, and disabling the CORS filter by default.
Scheduler External etcd with multiple client endpoints
Problem
Using Scheduler in non-embed mode with multiple etcd client endpoints was not working.
Impact
It was not possible to use multiple etcd endpoints for high availability with an external etcd database for scheduler.
Root Cause
The Scheduler etcd client endpoints CLI flag was typed as an string array, rather than a string slice, causing the given value to be parsed as a single string rather than a slice of strings.
Solution
Changed the type of the etcd client endpoints CLI flag to be a string slice.
Placement not cleaning internal state after host that had actors disconnects
Problem
An actor host that had actors doesn't get properly cleaned up from placement after the sidecar is scaled down and the placement stream is closed.
Impact
This results in the placement server iterating over namespaces that no longer exist for every tick of the disseminate ticker.
Root Cause
The function requiresUpdateInPlacementTables sould not set isActorHost to false once it is set to true, because once a host has actors the placement server keeps internal state for it and cleanup logic must be executed once the host disconnects.
Solution
Update the logic in requiresUpdateInPlacementTables.
Blocked Placement dissemination during high churn
Problem
Placement would fail to ever, or very slowly, disseminate the actor table in high daprd churn scenarios.
Impact
Actors or workflows would fail to be activated, and existing actors or workflows would fail.
Root Cause
Placement used a "small" (100) queue size which when exhausted would cause a deadlock. Placement would also wait for a fully consumed channel queue before disseminating slowing down the dissemination process.
Solution
Increase the queue size to 10000 and change the dissemination logic to not wait for a fully consumed queue before disseminating.
Blocked Placement dissemination with high Scheduler dataset
Problem
Disseminations would hang for long periods of time when the Scheduler dataset was large.
Impact
Dissemination could take up to hours to complete, causing reminders to not be delivered for a long period of time.
Root Cause
The reminder migration of state store to scheduler reminders does a full decoded scan of the Scheduler database, which would take a long time if there were many entries. During this time the dissemination would be blocked.
Solution
Limit the maximum time spent doing the migration to 3 seconds.
Expose a new global.reminders.skipMigration="true" helm chart value which will skip the migration entirely.
Fix panic during actor deactivation
Problem
Daprd could panic during actor deactivation.
Impact
Daprd sidecar would crash, resulting in downtime for the application.
Root Cause
A race in the actor lock cached memory release and claiming logic meant a stale lock could be used during deactivation, double closing it, and causing a panic.
Solution
Tie the lock's lifecycle to the actor's lifecycle, ensuring the lock is only released when the actor is fully deactivated, and claimed with the actor itself.
OpenTelemetry environment variables support
Problem
OpenTelemetry OTEL_* environment variables were not fully respected, and dapr.io/env annotation parsing broke when values contained =.
Impact
OpenTelemetry resource attributes could not be reliably applied to the Dapr sidecar, degrading trace correlation with application containers, especially on Kubernetes. Configuring OTEL_RESOURCE_ATTRIBUTES via annotations did not work.
Root Cause
- Resource creation used manual logic instead of the OpenTelemetry SDK’s environment-based resource detection.
- The injector’s environment variable parsing treated
=as a hard delimiter, breaking values that include=.
Solution
- Adopt the OpenTelemetry SDK’s env-based resource detection so
OTEL_*variables (includingOTEL_RESOURCE_ATTRIBUTES) are honored. - Fix
dapr.io/envparsing to allow values containing=. - Keep the Dapr app ID as the default service name when not overridden.
Fixing goavro bug due to codec state mutation
Problem
The goavro library had a bug where the codec state was mutated during decoding, causing the decoder to panic.
Impact
The goavro library would panic, causing the application to crash.
Root Cause
The goavro library did not correctly handle the codec state, causing it to panic when the codec state was mutated during decoding.
Solution
Update the goavro library to v2.14.1 to fix the bug. Take a more defensive approach, bringing back the old approach that always creates a new codec.
APP_API_TOKEN not passed in gRPC metadata for app callbacks
Problem
When APP_API_TOKEN was configured, the token was not being passed in gRPC metadata for app callbacks including:
- PubSub subscriptions
- Bindings
- Jobs
This meant that applications using gRPC protocol could not authenticate incoming requests from Dapr when using the app API token security feature.
Impact
Applications that configured APP_API_TOKEN to secure their endpoints could not validate that incoming gRPC requests were from their Dapr sidecar. This broke the app API token authentication feature for gRPC applications.
Root Cause
The gRPC subscription delivery, binding, and job callback code paths were directly calling the app's gRPC client without going through the channel layer abstraction. The channel layer is responsible for injecting the APP_API_TOKEN in the dapr-api-token metadata header, but these direct calls bypassed this mechanism.
Solution
Centralized the APP_API_TOKEN injection logic in a helper function (AddAppTokenToContext) in the gRPC channel layer. Updated all gRPC app callback code paths (pubsub subscriptions, bindings, and job callbacks) to use this helper, ensuring the token is consistently added to the outgoing gRPC context metadata. Added comprehensive integration tests to verify token passing for all callback scenarios in both HTTP and gRPC protocols.
Fixed Pulsar OAuth token renewal
Problem
The pulsar pubsub component was not renewing the OAuth token when it expired.
Impact
Applications using the pulsar pubsub component could not receive/publish messages when the OAuth token expired.
Root Cause
There was a bug in the component code that was preventing the OAuth token from being renewed when it expired.
Solution
Fixed the bug in the component code ensuring the OAuth token is renewed when it expires. Also added a test to verify the token renewal functionality. Fixed in https://github.com/dapr/components-contrib/pull/4079
Fix Scheduler connection during non-graceful network interruptions
Problem
Catastrophic failure of scheduler connection during non-graceful network interruptions would not cause the dapr runtime to attempt to reconnect to Scheduler.
Impact
A true host network interruption (e.g. unplugging the network cable) would cause the dapr runtime to only recover connections to Scheduler after roughly 2 hours.
Root Cause
The gRPC KeepAlive parameters were not set correctly, causing the gRPC client to not detect broken connections in a timely manner.
Solution
The server and client KeepAlive parameters are now set to 3 second intervals with a 5 second timeout.
Prevent infinite loop when workflow state is corrupted or destroyed
Problem
Dapr workflows could enter an infinite reminder loop when the workflow state in the actor state store is corrupted or destroyed.
Impact
Dapr workflows would enter an infinite loop of reminder calls.
Root Cause
When a workflow reminder is triggered, the workflow state is loaded from the actor state store. If the state is corrupted or destroyed, the workflow would not be able to progress and would keep re-triggering the same reminder indefinitely.
Solution
Do not retry the reminder if the workflow state cannot be loaded, and instead log an error and exit the workflow execution.