Timeouts#
How to configure per-actor timeouts in your AsyncActor specs to prevent runaway executions.
What is actorTimeout?#
actorTimeout is the maximum wall-clock time your actor has to process a single message. If the runtime doesn't respond within this duration, the sidecar cancels the message and routes it to the error queue (x-sump).
Set it in the AsyncActor spec under resiliency.actorTimeout:
apiVersion: asya.sh/v1alpha1
kind: AsyncActor
metadata:
name: llm-inference
namespace: prod
spec:
actor: llm-inference
resiliency:
actorTimeout: 5m # 5 minutes
Format: Duration string (30s, 2m, 1h). Default: 5m (sidecar default when not specified in the AsyncActor spec).
What happens when a timeout fires?#
When the sidecar's deadline expires before the runtime responds:
- Runtime is cancelled — the sidecar stops waiting for a response
- Envelope routed to x-sump — the error queue receives the message with a timeout error
- Pod crashes — the sidecar exits with status code 1 to prevent zombie processing
The pod crash forces Kubernetes to restart the container, ensuring a clean slate for the next message. This prevents scenarios where the runtime continues executing after the sidecar has given up.
Log entry (sidecar):
[ERROR] Runtime timeout: context deadline exceeded
[INFO] Routing to x-sump: timeout_error
[FATAL] Crashing pod to prevent zombie processing
Timeout vs SLA deadline#
Asya has two timeout mechanisms:
| Type | Scope | Configured in | Enforced by | On expiry |
|---|---|---|---|---|
| actorTimeout | Single actor call | AsyncActor resiliency.actorTimeout |
Sidecar | Routes to x-sump, crashes pod |
| SLA deadline | Entire pipeline | Gateway status.deadline_at |
Sidecar | Routes to x-sink with phase=failed, reason=Timeout |
actorTimeout is per-actor: each actor in a multi-actor pipeline gets the full timeout budget.
SLA deadline is pipeline-wide: the gateway sets status.deadline_at when creating the envelope; each actor checks it before calling the runtime. If the deadline has passed, the sidecar routes the envelope directly to x-sink without calling the runtime.
Example: 3-actor pipeline#
# AsyncActor: actor-a
resiliency:
actorTimeout: 2m
# AsyncActor: actor-b
resiliency:
actorTimeout: 5m
# AsyncActor: actor-c
resiliency:
actorTimeout: 1m
With a 10-minute SLA set by the gateway:
- actor-a has up to 2 minutes to process
- actor-b has up to 5 minutes to process
- actor-c has up to 1 minute to process
- If the cumulative time exceeds 10 minutes, remaining actors see the SLA expired and route to
x-sink
Effective timeout calculation#
The sidecar enforces the minimum of:
actorTimeoutfrom the AsyncActor spec- Remaining time until the SLA deadline (if set)
effective_timeout = min(actorTimeout, remaining_SLA_time)
If the SLA deadline is 30 seconds away but actorTimeout is 5 minutes, the runtime has only 30 seconds.
Log entry (sidecar):
[DEBUG] Computed timeout: actor=5m0s, remaining_SLA=42s, effective=42s
Setting actorTimeout#
Short timeouts for fast actors#
For actors that should complete quickly (e.g., data validation, routing logic):
resiliency:
actorTimeout: 30s
Benefits: - Fail fast on unexpected hangs - Free up queue consumers quickly - Prevent resource waste
Long timeouts for AI workloads#
For actors that call LLM APIs or run heavy inference:
resiliency:
actorTimeout: 10m
Benefits: - Allow sufficient time for model initialization - Accommodate slow streaming responses - Handle bursty LLM API latency
Unlimited timeout (not recommended)#
Omit resiliency.actorTimeout entirely:
resiliency: {}
The actor has unlimited time unless an SLA deadline is set. Discouraged — without a timeout, a hung runtime can block a queue consumer indefinitely.
Best practices#
1. Always set a timeout#
Even if you expect an actor to complete in seconds, set a generous timeout (e.g., 5 minutes) to catch unexpected hangs.
resiliency:
actorTimeout: 5m
2. Align timeout with workload#
Match the timeout to the actor's expected latency:
| Actor type | Typical timeout |
|---|---|
| Data validation, routing | 30s - 1m |
| Database queries, API calls | 1m - 3m |
| LLM inference (streaming) | 3m - 10m |
| Batch processing, model training | 10m - 1h |
3. Add headroom for variability#
Set the timeout to 2-3x the expected p95 latency to accommodate: - LLM API rate limits and retries - Slow model initialization on cold starts - Bursty network latency
# Expected p95: 2 minutes
# Set timeout: 5 minutes (2.5x headroom)
resiliency:
actorTimeout: 5m
4. Monitor timeout metrics#
The sidecar exposes Prometheus metrics for timeout events:
asya_actor_runtime_errors_total{error_type="timeout"}
If timeouts are frequent, either: - Increase the timeout (if the workload is legitimately slow) - Investigate why the runtime is hanging (if timeouts are unexpected)
Debugging timeout issues#
Symptoms of timeout problems#
- Frequent pod restarts — Kubernetes CrashLoopBackoff due to sidecar crashes
- Messages stuck in x-sump — timeout errors accumulating in the error queue
- Progress stops mid-pipeline — later actors never receive envelopes
Common causes#
| Cause | Solution |
|---|---|
| Timeout too short for workload | Increase actorTimeout to match expected latency |
| Runtime code hangs (infinite loop, deadlock) | Add defensive timeouts in user code; review logic |
| LLM API rate limit / slow endpoint | Implement backoff in handler; increase timeout |
| Model initialization too slow | Cache model in memory (class handler with __init__) |
| SLA deadline too tight | Increase SLA at gateway or reduce pipeline depth |
Inspecting timeout logs#
Sidecar logs (where the timeout is enforced):
kubectl logs -n prod deployment/llm-inference -c asya-sidecar
Look for:
[ERROR] Runtime timeout: context deadline exceeded
[INFO] Routing to x-sump: timeout_error
Runtime logs (what the handler was doing):
kubectl logs -n prod deployment/llm-inference -c asya-runtime
Look for the last log entry before the timeout — that's where the runtime was stuck.
Timeout propagation in multi-actor pipelines#
Timeouts do NOT propagate across actors. Each actor enforces its own actorTimeout independently.
Example:
# actor-a
resiliency:
actorTimeout: 2m
# actor-b
resiliency:
actorTimeout: 5m
If actor-a takes 1m 50s and routes to actor-b, actor-b gets the full 5 minutes — not the remaining 10 seconds from actor-a's timeout.
However, the SLA deadline DOES propagate:
- Gateway sets
status.deadline_atwhen creating the envelope - Each actor checks the deadline before calling the runtime
- If the deadline has passed, the actor routes to
x-sink(phase=failed, reason=Timeout)
This ensures the entire pipeline respects the SLA even if individual actors have generous timeouts.
Timeout vs retry policies#
actorTimeout controls when the sidecar gives up on a single message.
Runtime timeouts are not retried. When actorTimeout fires, the sidecar sends the envelope to x-sump and crashes the pod (os.Exit(1)) to prevent zombie processing. The pod restart ensures a clean state, but the timed-out message is terminal.
Retry policies (configured via resiliency.policies and resiliency.rules) apply to handler errors (exceptions raised by user code), not to runtime timeouts. For example:
resiliency:
actorTimeout: 5m
policies:
llm_retry:
maxAttempts: 3
initialDelay: 10s
backoff: exponential
rules:
- errors: ["TimeoutError"]
policy: llm_retry
With this configuration:
1. Sidecar waits up to 5 minutes for the runtime to respond
2. If the handler raises a TimeoutError (e.g., from an HTTP client), the retry policy matches and re-enqueues
3. If the sidecar's own deadline fires (runtime does not respond at all), the envelope goes to x-sump and the pod crashes -- no retry
See the retry policy documentation for details on configuring error-specific retry behavior.
See also#
| Topic | Document |
|---|---|
| Sidecar timeout enforcement | docs/reference/components/core-sidecar.md |
| Runtime behavior on timeout | docs/reference/components/core-runtime.md |
| Retry policies for timeout errors | Error handling |
| SLA configuration | docs/reference/components/core-gateway.md |
Platform configuration: To configure SLA, gateway backstop, and transport-level timeouts, see setup/guide-timeouts.md.