Dapr 1.17.3
This update contains bug fixes and security fixes:
- Actor method invocation returns 200 with empty body over h2c
- Security: Fixes gRPC authorization bypass - CVE-2026-33186
- Service invocation and actor responses forward stale Content-Length header
- Security: Fixes TIFF image OOM denial of service - CVE-2026-33809
- False positive injection failure metrics for non-Dapr pods
- Placement dissemination timeout cascades across all replicas
- Daprd placement reconnect hangs for 20 seconds on stale DNS
- Scheduler instance silently stops participating after cluster scale-up
- Windows sidecar container fails to start on AKS due to missing OSVersion in image manifest
Actor method invocation returns 200 with empty body over h2c
Problem
When using the h2c (HTTP/2 cleartext) app protocol, actor method invocations could return HTTP 200 with correct headers (including Content-Length) but an empty body.
Impact
Applications using --app-protocol h2c with actors could receive empty response bodies from actor method calls, despite the actor handler returning data. This caused silent data loss that was difficult to diagnose because the HTTP status code and headers appeared correct.
Root Cause
Dapr v1.17.2 introduced pipe-based streaming for service invocation response bodies to avoid buffering large payloads in memory. The response headers (including Content-Length) are captured when the pipe is ready, and the body streams lazily through an io.Pipe via io.Copy.
With HTTP/2, response body reads are tied to the request context. When the context is cancelled — either by the resiliency policy runner's defer cancel() after InvokeMethod returns, or by placement dissemination cancelling actor claims — the HTTP/2 stream is reset (RST_STREAM). The goroutine performing io.Copy from the HTTP/2 response body then fails, writing 0 bytes to the pipe. The pipe closes normally (EOF), and ProtoWithData() reads an empty body. The caller receives 200 OK with the original Content-Length header but no data.
HTTP/1.1 is unaffected because TCP buffer reads do not check the request context.
Solution
Two changes in the HTTP app channel:
-
Pipe error propagation:
io.Copyerrors are now captured and propagated through the pipe viapw.CloseWithError(err)instead of silently closing with EOF. If context cancellation causes the HTTP/2 body read to fail, callers receive an error rather than an empty body. -
h2c context detachment: For HTTP/2 (h2c) transports only, the HTTP request to the app now uses a context detached from the caller's context (
context.WithoutCancel). This prevents resiliency timeout cancellation and placement dissemination from resetting the HTTP/2 stream while body data is in flight. The detached context is cancelled when the pipe reader is closed, preventing goroutine leaks. HTTP/1.1 behavior is unchanged.
Security: gRPC authorization bypass
Problem
An upstream dependency (google.golang.org/grpc) used by Dapr introduced a vulnerability that could allow gRPC authorization bypass under certain conditions (CVE-2026-33186).
Impact
Users running affected versions could be exposed to unauthorized gRPC requests.
Root Cause
The issue originated in an upstream library.
Solution
This release upgrades the affected dependency to a version that resolves CVE-2026-33186.
Users are strongly encouraged to upgrade to this release.
Service invocation and actor responses forward stale Content-Length header
Problem
When a Dapr sidecar forwarded HTTP responses from service invocation or actor method calls, the Content-Length header from the upstream application was passed through verbatim to the caller, even though the sidecar rebuilds the response body from an internal protobuf representation.
If the upstream application sent an incorrect Content-Length (or the body size changed during serialization), the caller received a Content-Length header that did not match the actual response body.
Impact
Callers reading the response body with standard HTTP clients (such as Go's io.ReadAll or Python's aiohttp) received unexpected EOF errors or truncated bodies when the forwarded Content-Length exceeded the actual body size.
When the forwarded Content-Length was smaller than the actual body, clients silently truncated the response.
This affected both service invocation (direct messaging) and actor method invocation via HTTP.
Root Cause
The InternalMetadataToHTTPHeader utility function converted all internal gRPC metadata headers to HTTP response headers, including Content-Length.
Since the sidecar reconstructs the response body from a protobuf message (not by proxying the original HTTP stream), the original Content-Length from the upstream application became stale.
Go's http.ResponseWriter honored this pre-set Content-Length header instead of computing the correct value from the data actually written.
Solution
Two sets of changes prevent stale Content-Length headers from propagating:
-
Internal metadata path: Added
Content-Lengthto the skip list inInternalMetadataToHTTPHeader, alongsideContent-Typewhich was already skipped. This prevents the stale upstreamContent-Lengthfrom being forwarded to the HTTP response writer. Go'shttp.ResponseWriternow computes the correctContent-Lengthautomatically from the actual response body. Updated the HTTP channel'sconstructRequestto readContent-Lengthdirectly from internal metadata for outgoing requests, since it is no longer present in the forwarded HTTP headers. -
HTTP channel response path: The HTTP app channel now strips
Content-Lengthfrom upstream response headers before forwarding them through the response pipe. If the upstream app declared aContent-Lengthlarger than the actual body, the resultingio.ErrUnexpectedEOFfrom Go's HTTP client is treated as normal completion, since the received data is valid.
Security: gRPC authorization bypass
Problem
An upstream dependency (google.golang.org/grpc) used by Dapr introduced a vulnerability that could allow gRPC authorization bypass under certain conditions (CVE-2026-33186).
Impact
Users running affected versions could be exposed to unauthorized gRPC requests.
Root Cause
The issue originated in an upstream library.
Solution
This release upgrades the affected dependency to a version that resolves CVE-2026-33186.
Users are strongly encouraged to upgrade to this release.
Security: TIFF image OOM denial of service
Problem
An upstream dependency (golang.org/x/image) used by Dapr contained a vulnerability that could cause an out-of-memory crash when decoding a maliciously crafted TIFF image (CVE-2026-33809).
Impact
A malicious 8-byte TIFF file with an IFD offset of 0xFFFFFFFF could cause golang.org/x/image/tiff.Decode to allocate up to ~4GB of memory, leading to an out-of-memory crash.
Any Dapr component or application path that processes untrusted TIFF image input through this library could be exploited for denial of service.
Root Cause
The issue originated in the upstream golang.org/x/image/tiff library.
The buffer.fill() function did not validate the IFD offset before allocating memory, allowing a crafted offset to trigger an unbounded allocation.
Solution
This release upgrades golang.org/x/image from v0.25.0 to v0.38.0, which resolves CVE-2026-33809.
False positive injection failure metrics for non-Dapr pods
Problem
When a pod without Dapr annotations (e.g. infrastructure pods like Vault or Nginx) was created by a service account not in the injector's allowed list, the injector logged an error (service account '...' not on the list of allowed controller accounts) and incremented the dapr_injector_sidecar_injection_failed_total metric, even though the pod was never meant to be Dapr-enabled.
Impact
Infrastructure and non-Dapr workloads deployed by service accounts not in the injector's allowedServiceAccounts list caused false-positive error logs and inflated the sidecar_injection_failed_total metric with reason="pod_patch". This triggered spurious alerts in monitoring systems for pods that had nothing to do with Dapr.
Root Cause
The injector checked service account authorization before checking whether the pod had the dapr.io/enabled annotation. Non-Dapr pods were rejected at the authorization step, producing error logs and failure metrics, instead of being silently allowed.
Solution
The injector now checks the dapr.io/enabled annotation before checking service account authorization. Pods without the annotation set to "true" are immediately allowed with no patch, skipping all injection and authorization logic. This ensures non-Dapr pods never produce error logs or increment failure metrics regardless of which service account creates them.
Placement dissemination timeout cascades across all replicas
Problem
When a single slow or unresponsive daprd sidecar fails to respond during placement table dissemination, the dissemination timeout disconnects all connected sidecars in the namespace, not just the slow one. This causes a cascading failure where healthy replicas lose their placement tables and must reconnect and re-disseminate.
During rolling updates, rapid sequential connect/disconnect cycles generate a "version storm" where the placement server churns through many dissemination versions faster than sidecars can process them, leading to repeated timeouts that affect all replicas.
Impact
Any deployment with multiple replicas using actors or workflows is affected. A single slow sidecar (due to GC pressure, network latency, or resource contention) can cause all replicas in the namespace to lose actor routing for the duration of the timeout cycle. During rolling updates, the version storm can prevent new replicas from receiving placement tables for extended periods, causing actor invocations to fail.
Root Cause
Three issues in the placement server's dissemination logic:
-
Nuclear timeout:
handleTimeoutclosed ALL streams on timeout, regardless of whether they had responded to the current dissemination phase. Healthy streams that had already acknowledged LOCK/UPDATE were disconnected alongside the slow one. -
No phase advancement on disconnect: When a stream disconnected during an active dissemination round, the
streamsInTargetStatecounter was not adjusted. If the disconnected stream was the last one that hadn't responded, the round would stall until the timeout fired, even though all remaining streams had already responded. -
Orphaned store entries: During rolling updates, stream close events could arrive at the disseminator after the dissemination round had already completed, leaving stale host entries in the placement table.
Solution
-
Selective timeout:
handleTimeoutnow only closes streams that have NOT reached the current dissemination phase. Streams that responded successfully survive the timeout and participate in the next round. Waiting connections that were queued during the timed-out round are added to the new round instead of being cancelled. -
Phase advancement on disconnect:
handleCloseStreamnow decrementsstreamsInTargetStatewhen a counted stream disconnects and callsadvancePhase()if removing the stream completes the current phase. This prevents rounds from stalling when a slow stream disconnects. -
Delete batching: Stream deletions during active dissemination are queued and processed when the round completes, combining multiple deletes into a single dissemination round.
handleAddalso processes pending deletes before adding new streams, coalescing delete+add pairs during rolling updates. -
Orphan cleanup: After each dissemination round completes, the store is scanned for entries whose streams are no longer active. These orphaned entries are cleaned up in the next round.
Daprd placement reconnect hangs for 20 seconds on stale DNS
Problem
When a placement server pod is restarted (due to rolling update, eviction, or crash), daprd sidecars that were connected to the old pod attempt to reconnect using cached DNS entries. The cached IP address points to the terminated pod, and each connection attempt hangs for 20 seconds (TCP dial timeout) before trying the next address.
Impact
With a 3-node placement cluster, daprd can try up to 3 stale IPs sequentially, causing a 60-second reconnect delay in the worst case. During this time, actor invocations and workflow operations fail because the sidecar has no placement table.
Root Cause
The DNS round-robin connector cached the IP addresses from the initial DNS lookup and only re-resolved when the cache was exhausted (all IPs tried). After a placement pod restart, the old IP remained in the cache. Since gRPC dials are lazy (they succeed immediately and fail on the first RPC), the stale IP was not detected until the sidecar tried to open a placement stream, which then hung for the full TCP dial timeout.
Solution
The DNS connector now re-resolves DNS on every Connect call instead of caching IPs across calls.
Kubernetes headless service DNS always returns the current set of pod IPs, so each reconnect attempt immediately gets fresh addresses.
The round-robin index is preserved across lookups to maintain even distribution.
Scheduler instance silently stops participating after cluster scale-up
Problem
After a Scheduler pod restart, rolling update, or cluster scale-up event, one or more Scheduler instances can silently stop participating in the cluster. The affected instance's pod remains running and passes health checks, but its cron engine exits and never restarts. The instance stops publishing its address to the WatchHosts API, so daprd sidecars never discover it. Jobs, Actor Reminders or Workflows assigned to the affected instance's partitions are never triggered.
From a user's perspective, workflows, scheduled jobs, and actor reminders randomly stop firing after a Scheduler pod restart. The issue is intermittent and depends on the exact timing of the restart relative to other cluster activity.
Impact
Any Dapr deployment running the Scheduler in a multi-instance (HA) configuration is affected. The issue is triggered when a Scheduler instance joins or rejoins the cluster, causing a leadership quorum change (e.g., partition count changes from 2 to 3). The likelihood increases with the number of connected daprd sidecars, as more WatchHosts subscribers means the host broadcast takes longer, widening the race window.
When the bug is triggered:
- The affected Scheduler instance owns partitions but cannot deliver jobs on them. All workflow activity timers, scheduled jobs, and actor reminders assigned to those partitions stop firing and are logged as
UNDELIVERABLE. - Daprd sidecars that call
WatchHostsmay receive an incomplete host list (missing the affected instance), or may never receive a response at all — leaving the sidecar stuck on thescheduler-watch-hostsreadiness gate indefinitely, preventing the application from becoming ready. - The remaining healthy instances cannot take over the affected instance's partitions because its leadership key remains in etcd (the lease keep-alive continues independently). The cluster is stuck in a state where all instances are running but quorum can never converge on the correct partition count.
The only recovery is to restart all Scheduler pods simultaneously, which forces a fresh leadership election.
Root Cause
When the Scheduler's internal cron module reaches leadership quorum after a partition change, it calls runEngine to start the cron engine.
Before starting the engine, runEngine sends the updated cluster host addresses to an internal unbuffered Go channel (WatchLeadership) so that the WatchHosts API can broadcast them to connected sidecars.
If the channel consumer is busy (for example, broadcasting a previous host update to many WatchHosts subscribers), the send blocks. Meanwhile, if another Scheduler instance joins and causes a second quorum change, the elected context is cancelled while the send is still blocked.
Because the cron module has exited, it never calls Reelect to update its leadership key with the new partition total.
The other Scheduler instances see this stale key and cannot reach quorum agreement, preventing the entire cluster from converging.
Solution
The internal channel consumer in the Scheduler's cron wrapper has been replaced with a non-blocking event loop (events/loop).
The loop's Enqueue method never blocks.
If the current segment is full, it allocates a new one.
This means the channel send from the cron library always completes immediately, regardless of how busy the consumer is with broadcasting.
Since the send no longer blocks, the context cancellation race that caused the silent exit can no longer occur.
The cron loop continues to call Reelect after each quorum change, leadership keys are updated with the correct partition total, and all instances converge normally.
Windows sidecar container fails to start on AKS due to missing OSVersion in image manifest
Problem
Starting with Dapr v1.16.9, the Dapr sidecar container (daprd) fails to start on AKS Windows nodes with the error:
hcs::CreateComputeSystem daprd: The container operating system does not match the host operating system.
Impact
All Windows-based Dapr sidecar deployments on AKS are broken from v1.16.9 onward. The daprd container enters CrashLoopBackOff and never starts. Linux deployments are unaffected.
Root Cause
In v1.16.9, the docker-manifest-create target in docker/docker.mk was changed from using docker manifest create / docker manifest push to docker buildx imagetools create.
The docker manifest commands automatically read os.version from each source image's config and include it in the manifest list entries.
The docker buildx imagetools create command does not carry the os.version field through to the manifest list.
Without os.version on the Windows manifest entries, the Windows container runtime cannot distinguish between the two windows/amd64 images (Server 2019 and Server 2022) and pulls the wrong variant for the host OS build.
Solution
Reverted the docker-manifest-create target to use docker manifest create and docker manifest push, restoring the os.version field in the manifest list entries for Windows images.