The Distributed Application Runtime (Dapr) provides APIs that simplify microservice architecture development and increases developer productivity. Whether your communication pattern is service-to-service invocation or pub/sub messaging, Dapr helps you write resilient and secured microservices....
This is the release candidate 1.17.7-rc.1
Dapr 1.17.7
This update contains the following bug fixes:
Workflow GetWorkItems gRPC stream torn down when the history payload exceeds the max body size
Problem
When the proto-encoded WorkItem that the orchestrator sends to a connected SDK worker on the GetWorkItems gRPC stream grew larger than the dapr API gRPC server's MaxSendMsgSize (which is the same as --max-body-size, default 4 MiB), the underlying stream.Send returned ResourceExhausted and the entire stream was cancelled.
Every other workflow that happened to be pending on the same stream was cancelled along with it, and the SDK reconnected only to repeat the same failure on the same offending workflow.
Impact
Any long-running workflow whose accumulated history (PastEvents + NewEvents + propagated history) crossed the configured --max-body-size could trigger a stream tear-down loop.
Visible symptoms included:
- The offending workflow appeared frozen with no diagnostic in its history.
- Other workflows that shared the worker's stream were repeatedly cancelled mid-execution and replayed.
- The SDK logged repeated reconnects to the dapr sidecar.
- An activity dispatched with a very large
PropagatedHistory could exhibit the same tear-down on the activity work-item path.
Root Cause
Neither the orchestrator nor the durabletask gRPC executor measured the size of the WorkItem proto before pushing it onto the stream.
Once the message reached stream.Send, gRPC enforced MaxSendMsgSize and aborted the entire GetWorkItems server stream with ResourceExhausted.
Because the failure was at the transport layer, the runtime had no place to record a structured signal back to the user, and there was no terminal state for an orchestration that could not legally be dispatched.
Solution
The orchestrator now precomputes the proto size of the WorkItem it is about to dispatch and compares it to a 95% safety threshold of --max-body-size (the headroom covers the engine's WorkflowStarted event injection plus gRPC framing overhead).
If the threshold would be crossed:
- For a workflow dispatch,
runWorkflow short-circuits before the work item is handed to the durabletask scheduler.
- For an activity dispatch,
callActivity short-circuits before the activity actor is invoked, and the parent workflow is stalled.
Either path appends an ExecutionStalled event to the workflow's history with the new StalledReason value PAYLOAD_SIZE_EXCEEDED and transitions the workflow into the existing STALLED state.
The orchestrator's stallable lock is held until the actor is deactivated, so the next activation re-evaluates: if the operator has purged or terminated the workflow, or restarted daprd with a larger --max-body-size, the workflow resumes; otherwise it re-stalls without disturbing other instances on the stream.
Workflows orphaned or not purged after scheduler pod restart under load
Problem
When a scheduler pod is killed during workflow execution under load, some workflows become orphaned: they remain in RUNNING state with no further execution, or they reach a terminal state but are never purged despite a configured retention policy.
dapr workflow history shows nothing abnormal, execution simply stops.
dapr workflow list reports the affected completed workflows as much older than the configured retention window.
Impact
Any deployment running workflows with a multi-replica scheduler is affected when scheduler pods restart during load.
This is most visible during routine operations such as Kubernetes rolling updates, node drains, or OOM-driven scheduler restarts.
Root Cause
The actor state-store transaction that persists workflow state was not coordinated with the gRPC call that registers the corresponding wake-up reminder in the scheduler service.
These are two independent operations against two different systems with no atomic boundary between them.
When a scheduler pod was killed mid-RPC, the state save had typically completed and the reminder Create was lost.
The reminder failure policy retries an already-persisted reminder forever; it cannot recover a reminder whose Create RPC never reached durable storage.
For completed workflows, the retention path was particularly fragile: the workflow's firing reminder was deleted before the retention reminder was created.
If the retention Create then failed, no reminder remained to drive a retry, leaving the workflow terminal-but-not-purged.
Solution
Three changes close the loss windows:
-
In-process retry on reminder creation.
Every reminder Create now retries with bounded exponential backoff (up to 60 seconds total) before returning to the caller.
Retries reuse the same reminder Name; the scheduler's overwrite-by-name semantics keep them idempotent.
A typical scheduler-pod failover completes in seconds, so the retry transparently heals the failure without surfacing it to the workflow.
-
Retention reminder created before deletion.
In the completion path, the retention reminder is now registered before the workflow's own reminders are deleted.
If the retention Create still fails after the in-process retry, the firing reminder remains alive and its failure-policy retry brings execution back to the completion path.
-
Idempotent retention recovery on re-fire.
When a reminder fires for a workflow whose state is already terminal but whose inbox is empty, the runtime now re-issues the retention reminder Create.
The retention reminder name is deterministic, so this is a safe overwrite rather than a duplicate.
This recovers workflows whose completion was persisted in a prior run but whose retention reminder Create was lost.
The retention reminder's due time is now anchored to the workflow's actual completion time rather than the moment of the Create call, so retries converge on a single reminder at a stable due time instead of pushing retention back on every retry.
Workflow inbox accumulates duplicate completion events under pod migration, driving an SDK spin loop
Problem
When the workflow actor on one pod was cancelled mid-flight (typically during a rolling deployment) after dispatching an activity but before its state save committed, the activity actor still completed normally and posted its TaskCompleted event back to the workflow actor's inbox.
On the next workflow activation, the orchestrator re-yielded the same ScheduleTask because its replay state did not yet reflect the dispatch, so the activity actor ran a second time and posted a second TaskCompleted for the same taskScheduledId.
The same shape applied to TaskFailed, TimerFired, and child-workflow completions delivered through the inbox.
The language SDK's process_event handlers for these event kinds silently return when no matching pending task is found, producing zero new actions, so dapr re-fired the wake-up reminder against the same un-cleared inbox and the cycle repeated.
Impact
Any deployment running workflows whose hosting pods are restarted during load is affected.
This is most visible during routine operations such as Kubernetes rolling updates or node drains.
Visible symptoms include:
- A workflow appears stuck in
RUNNING while its persisted history grows steadily with full activity payloads.
- Sidecar logs show repeated
dropping duplicate event: executionStarted warnings on the dapr side, paired with thousands of Ignoring unexpected taskCompleted event with ID = N warnings on the SDK side for the same instance.
- An activity executes more times than the workflow function calls it, because the activity actor re-runs each time the orchestrator re-yields the schedule.
Root Cause
Two layers were missing safeguards.
First, the workflow actor's addWorkflowEvent (the inbox-write boundary called by the activity actor and by sub-workflow completion delivery) did not deduplicate task-resolution events.
A redelivered completion was appended to the inbox, persisted, and a new wake-up reminder was created, even when the same resolution was already committed to history or queued in the inbox from an earlier delivery.
Second, the orchestrator's callActivities did not check whether the activity it was about to dispatch had already resolved.
When the orchestrator re-yielded a ScheduleTask because its replay state was missing the corresponding TaskScheduled (e.g. after a partial save was lost on cancellation), the activity actor was invoked again, ran the activity body again, and posted yet another TaskCompleted to the inbox.
The two layers compounded: the inbox grew because the dispatch produced new completions, the orchestrator re-ran because the inbox grew, and the SDK silently spun on the unmatched events.
Solution
Two complementary checks were added in the workflow actor, both backed by a shared dedup helper:
-
Inbox-write dedup in addWorkflowEvent.
A TaskCompleted / TaskFailed / TimerFired / ChildWorkflowInstance{Completed,Failed} whose correlator (taskScheduledId or timerId) already appears in either state.History or state.Inbox is dropped before it reaches state.AddToInbox, the transactional save, and the new-event reminder.
EventRaised and ExecutionTerminated are intentionally excluded: EventRaised is a user signal that may legitimately repeat, and ExecutionTerminated is idempotent.
-
Dispatch-skip in callActivities.
Before invoking the activity actor for a TaskScheduled, the workflow actor checks whether a matching TaskCompleted or TaskFailed for the same taskScheduledId is already in state.History or state.Inbox.
If it is, the dispatch is suppressed; the orchestrator's stale re-yield no longer triggers a second activity run.
The underlying engine in durabletask-go was hardened in lockstep: runtimestate.AddEvent now also rejects a resolution event whose correlator is already present, providing defence in depth for any caller that bypasses the actor's inbox-write path.
The Stalled-clear logic runs only on a successful add, so a duplicate-rejection error preserves a prior stalled state.
After upgrading, persisted histories from older daprd versions that already accumulated duplicates are silently truncated on next workflow load (the duplicate entries are not re-added to the in-memory OldEvents), so the upgrade is one-way for that state.
Sentry fails to start with "unsupported key type" when the issuer key is Ed25519 or RSA
Problem
Operators who downgraded a control plane from 1.18 back to 1.17 saw dapr-sentry crash on startup with:
fatal: error creating CA: failed to get CA bundle: failed to verify CA bundle: unsupported key type ed25519.PrivateKey
The same failure mode also rejected RSA-keyed issuer bundles. The crash is hit before sentry serves any traffic, so every sidecar that depends on sentry for its identity certificate stops being able to obtain or rotate one.
Impact
Any 1.17 control plane whose dapr-trust-bundle secret was generated by, or migrated through, a newer Dapr release that issues Ed25519 (or RSA) issuer keys is affected. In practice this includes:
- Downgrade from Dapr 1.18 to 1.17 against the same cluster.
- Existing 1.17 deployments where the issuer key was rotated or replaced with an Ed25519 / RSA key by the operator.
Sentry crash-loops, no new mTLS identities are issued, and existing certificates are not rotated. Sidecars whose certs have not yet expired keep working; sidecars that come up fresh, restart, or hit cert expiry start failing to obtain identities.
Root Cause
dapr/kit's crypto/pem.EncodePrivateKey (used by sentry to re-encode the issuer key it just decoded from the trust bundle) only matched *ecdsa.PrivateKey and *ed25519.PrivateKey in its type switch. ed25519.PrivateKey is itself a []byte alias rather than a struct, so the *ed25519.PrivateKey case never matched a real Ed25519 key. RSA private keys were never listed at all.
When sentry called EncodePrivateKey on an Ed25519 or RSA issuer key it fell through to the default branch and returned unsupported key type %T, which the CA initialiser surfaced as a fatal error.
Solution
dapr/kit's EncodePrivateKey now matches ed25519.PrivateKey (value form) and *rsa.PrivateKey alongside *ecdsa.PrivateKey. All three round-trip through PKCS#8 unchanged. Dapr 1.17.7 picks up this fix by bumping github.com/dapr/kit to v0.17.1, which also includes table-driven roundtrip tests for ECDSA P-256, RSA-2048, and Ed25519 to guard the regression.
No operator action is required beyond upgrading sentry to 1.17.7. Existing trust bundles are read as-is; the issuer key is not regenerated.
Kafka in-flight pub/sub messages abandoned during graceful shutdown
Problem
When a sidecar received SIGTERM, Kafka pub/sub subscriptions tore down their consumer group session before the messages already fetched from the broker had been delivered to the application.
The contrib retry loop observed context canceled, the runtime logged Too many failed attempts at processing Kafka message ... Error: context canceled, and the broker handed the same offsets to whichever consumer won the rebalance.
Impact
Any deployment running Kafka pub/sub through a multi-replica subscriber was affected on rolling restarts, node drains, or any other graceful-shutdown event.
Visible symptoms included:
- Repeated
Too many failed attempts at processing Kafka message and kafka: tried to use a consumer group that was closed errors during shutdown.
- The application processed the same message twice across pods (once via a partial in-flight call that got cancelled, again after rebalance).
- Latency-sensitive workloads (e.g. financial transactions) experienced retry-driven tail latency on every pod restart.
Root Cause
The runtime's Subscription.Stop() set its closed flag immediately on entry, which caused the handler closure to reject any further deliveries from contrib with errors.New("subscription is closed").
Contrib treated that as an error and retried inside an already-closing session, eventually giving up and surrendering the partition to the rebalance.
The "in-flight" definition was also too narrow: only handlers already inside the closure counted, while messages that contrib had pulled from the broker but not yet handed to the handler were considered absent and got the rejection path.
Solution
A new pubsub.PausableSubscriber capability lets the runtime ask a component to stop fetching from the broker without tearing down the consumer group session.
On graceful shutdown the runtime now:
- Pauses the underlying component (Kafka's implementation calls Sarama's
PauseAll, which stops broker fetches while keeping the session and partition assignments alive).
- Leaves
closed=false during a bounded drain window so handlers continue delivering buffered messages to the application via postman.
- Polls an
inflight counter with a stable-quiet predicate (100 ms of consecutive zero readings on the paused path) so the drain does not seal in the sub-millisecond gap between handler return and the next claim-buffer read.
- Caps the drain at 30 seconds so a misbehaving application that keeps returning RETRY cannot block
StopAllSubscriptionsForever and prevent the block-shutdown timer from starting.
- On ceiling hit, force-cancels the subscription context so stuck handlers' HTTP/gRPC calls error out via context propagation rather than running indefinitely.
- Falls back to the previous close-first behavior for non-pausable components and non-graceful Stop calls.
The components-contrib Kafka component additionally gates consumerGroup.Close() on the last subscription exiting (so multi-topic pubsubs no longer race a sibling subscription's reload into a closed group) and demotes the Too many failed attempts log to debug when the cause is shutdown rather than real retry exhaustion.
Kafka bulk subscriber partial batches flushed early after a count-based flush
Problem
When a Kafka bulk subscriber's buffer filled to maxMessagesCount and was flushed before its maxAwaitDurationMs window had elapsed, the await ticker continued firing on its original schedule.
Any subsequent partial batch was then flushed within (often well under) one period of the count-based flush instead of waiting for a fresh maxAwaitDurationMs window from the moment the buffer was last drained.
Impact
Any deployment using Kafka bulk pub/sub subscriptions with both maxMessagesCount and maxAwaitDurationMs configured was affected.
Visible symptoms included:
- Partial batches delivered to the application much sooner than the configured
maxAwaitDurationMs after a count-based flush.
- Effective batch sizes lower than expected during steady-state traffic, because the await window was shortened by however much of the original window had already elapsed before the count threshold was hit.
- Workloads tuned to amortize per-batch overhead (large bulk handlers, batched downstream writes) seeing more invocations than the configuration implied.
Root Cause
In ConsumeClaim, the bulk path used a single time.Ticker constructed from maxAwaitDurationMs to trigger time-based flushes.
When the count threshold (len(messages) >= maxMessagesCount) was reached and flushBulkMessages was called, the ticker was not reset.
The next tick still fired at its original wall-clock schedule, so a partial batch arriving just after a count-flush was eligible for flush after only the residual portion of the original ticker period rather than a full maxAwaitDurationMs.
Solution
After a count-based flush in ConsumeClaim, the await ticker is now reset to a fresh maxAwaitDurationMs window via ticker.Reset, anchoring the next time-based flush to the moment of the count-flush.
Go 1.23+ guarantees that Ticker.Reset discards any tick that was queued before the call, so no stale tick can fire immediately after the reset and short-circuit the new window.
Partial batches now consistently wait a full maxAwaitDurationMs after the most recent flush, regardless of whether that flush was triggered by the count threshold or the timer.