Dapr 1.16.11
This update includes a Go version bump, and critical bug fixes:
- Go version updated to 1.25.8
- Scheduler crashes with "catastrophic state machine error" during leadership changes
- Scheduler instance silently stops participating after cluster scale-up
- Windows sidecar container fails to start on AKS due to missing OSVersion in image manifest
Go version updated to 1.25.8
Problem
Dapr 1.16.10 was built with Go 1.25.7.
Go 1.25.8 includes security fixes to the html/template, net/url, and os packages, as well as bug fixes to the go command, the compiler, and the os package.
Impact
Users running Dapr built with Go 1.25.7 may be exposed to known vulnerabilities that have been patched in Go 1.25.8.
Solution
Updated the Go version from 1.25.7 to 1.25.8.
Scheduler crashes with "catastrophic state machine error" during leadership changes
Problem
The Dapr Scheduler process crashes with a fatal error during leadership quorum changes under high job throughput:
level=fatal msg="Fatal error running scheduler: catastrophic state machine error: lost inner loop reference"
Once a Scheduler instance crashes, the remaining instances experience further quorum instability, which can trigger the same crash on those instances as well, leading to a cascading failure across the entire Scheduler cluster. All workflow executions, scheduled jobs, and actor reminders stop firing until the Scheduler cluster recovers.
Impact
Any Dapr deployment running Scheduler in a multi-instance (HA) configuration is affected. The crash is triggered when a Scheduler instance restarts, whether due to a Kubernetes pod eviction, OOM kill, node maintenance, rolling update, or transient etcd connectivity issue, while the cluster is under high job throughput. This is especially severe for deployments using Dapr Workflows, as the Scheduler is responsible for dispatching all workflow activity and orchestration timers.
Affected versions: v1.16.0 through v1.16.10.
Root Cause
The Scheduler's internal cron engine uses a router that manages per-job counter loops.
When a job completes or is deleted, the counter loop emits a CloseJob event to the router, which removes the counter from its internal map.
During a leadership quorum change (e.g., partition count changes from 3 to 2 when a Scheduler instance goes down), the cron engine restarts.
In the narrow window between the old engine shutting down and the new engine starting up, stale CloseJob events from the old engine's counter loops can arrive at the new engine's router.
Because the new router starts with an empty counter map, these stale events reference counters that do not exist.
Previously, this condition returned a fatal error ("catastrophic state machine error: lost inner loop reference"), which propagated through the router's event loop, cancelled the router's parent context, and terminated the entire Scheduler process.
Solution
The router now treats a CloseJob event for a missing counter as a benign no-op: the stale event is dropped with a debug-level log message and the router continues processing normally.
This is safe because:
- The counter was already cleaned up — there is nothing to close.
- The event is from a previous engine lifecycle and is no longer relevant.
- All other
CloseJobcode paths (counter exists, counter reused with new index) are unaffected.
This fix is in the go-etcd-cron dependency ([diagridio/go-etcd-cron#127](https://github.com/diagridio/go-etcd-cron/issues/127)).
Scheduler instance silently stops participating after cluster scale-up
Problem
After a Scheduler pod restart, rolling update, or cluster scale-up event, one or more Scheduler instances can silently stop participating in the cluster. The affected instance's pod remains running and passes health checks, but its cron engine exits and never restarts. The instance stops publishing its address to the WatchHosts API, so daprd sidecars never discover it. Jobs, Actor Reminders or Workflows assigned to the affected instance's partitions are never triggered.
From a user's perspective, workflows, scheduled jobs, and actor reminders randomly stop firing after a Scheduler pod restart. The issue is intermittent and depends on the exact timing of the restart relative to other cluster activity.
Impact
Any Dapr deployment running the Scheduler in a multi-instance (HA) configuration is affected. The issue is triggered when a Scheduler instance joins or rejoins the cluster, causing a leadership quorum change (e.g., partition count changes from 2 to 3). The likelihood increases with the number of connected daprd sidecars, as more WatchHosts subscribers means the host broadcast takes longer, widening the race window.
When the bug is triggered:
- The affected Scheduler instance owns partitions but cannot deliver jobs on them. All workflow activity timers, scheduled jobs, and actor reminders assigned to those partitions stop firing and are logged as
UNDELIVERABLE. - Daprd sidecars that call
WatchHostsmay receive an incomplete host list (missing the affected instance), or may never receive a response at all — leaving the sidecar stuck on thescheduler-watch-hostsreadiness gate indefinitely, preventing the application from becoming ready. - The remaining healthy instances cannot take over the affected instance's partitions because its leadership key remains in etcd (the lease keep-alive continues independently). The cluster is stuck in a state where all instances are running but quorum can never converge on the correct partition count.
The only recovery is to restart all Scheduler pods simultaneously, which forces a fresh leadership election.
Root Cause
When the Scheduler's internal cron module reaches leadership quorum after a partition change, it calls runEngine to start the cron engine.
Before starting the engine, runEngine sends the updated cluster host addresses to an internal unbuffered Go channel (WatchLeadership) so that the WatchHosts API can broadcast them to connected sidecars.
If the channel consumer is busy (for example, broadcasting a previous host update to many WatchHosts subscribers), the send blocks. Meanwhile, if another Scheduler instance joins and causes a second quorum change, the elected context is cancelled while the send is still blocked.
Because the cron module has exited, it never calls Reelect to update its leadership key with the new partition total.
The other Scheduler instances see this stale key and cannot reach quorum agreement, preventing the entire cluster from converging.
Solution
The internal channel consumer in the Scheduler's cron wrapper has been replaced with a non-blocking event loop (events/loop).
The loop's Enqueue method never blocks.
If the current segment is full, it allocates a new one.
This means the channel send from the cron library always completes immediately, regardless of how busy the consumer is with broadcasting.
Since the send no longer blocks, the context cancellation race that caused the silent exit can no longer occur.
The cron loop continues to call Reelect after each quorum change, leadership keys are updated with the correct partition total, and all instances converge normally.
Windows sidecar container fails to start on AKS due to missing OSVersion in image manifest
Problem
Starting with Dapr v1.16.9, the Dapr sidecar container (daprd) fails to start on AKS Windows nodes with the error:
hcs::CreateComputeSystem daprd: The container operating system does not match the host operating system.
Impact
All Windows-based Dapr sidecar deployments on AKS are broken from v1.16.9 onward. The daprd container enters CrashLoopBackOff and never starts. Linux deployments are unaffected.
Root Cause
In v1.16.9, the docker-manifest-create target in docker/docker.mk was changed from using docker manifest create / docker manifest push to docker buildx imagetools create.
The docker manifest commands automatically read os.version from each source image's config and include it in the manifest list entries. The docker buildx imagetools create command does not carry the os.version field through to the manifest list.
Without os.version on the Windows manifest entries, the Windows container runtime cannot distinguish between the two windows/amd64 images (Server 2019 and Server 2022) and pulls the wrong variant for the host OS build.
Solution
Reverted the docker-manifest-create target to use docker manifest create and docker manifest push, restoring the os.version field in the manifest list entries for Windows images.