Autoscaling#
Configure KEDA-based autoscaling for AsyncActor workloads, including queue-depth scaling, GPU workloads, cost optimization, and advanced scaling modifiers.
How KEDA Works#
KEDA (Kubernetes Event Driven Autoscaling) monitors external metrics (queue depth, custom metrics) and scales Kubernetes Deployments.
Components:
- KEDA Operator: Watches ScaledObjects
- Metrics Server: Exposes metrics to HPA
- ScaledObject: Defines scaling triggers and targets
Asya Integration#
The Crossplane composition creates a KEDA ScaledObject for each AsyncActor:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: text-processor
spec:
scaleTargetRef:
name: text-processor # Deployment to scale
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 50 # Max replicas
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-text-processor
queueLength: "5" # Target: 5 messages per replica
awsRegion: us-east-1
Formula: desiredReplicas = ceil(queueDepth / queueLength)
Example: 100 messages, queueLength=5 → 20 replicas
Configuration#
Scaling configured in AsyncActor spec:
spec:
scaling:
enabled: true # Enable KEDA autoscaling
minReplicaCount: 0 # Minimum pods (0 for scale-to-zero)
maxReplicaCount: 100 # Maximum pods
queueLength: 5 # Target messages per replica
cooldownPeriod: 60 # Seconds before scaling down (default: 60s)
pollingInterval: 10 # How often KEDA checks queue depth (default: 10s)
Parameters:
enabled: Enable/disable KEDA autoscaling (default: false)minReplicaCount: Minimum pods (default: 0 for scale-to-zero)maxReplicaCount: Maximum pods (default: 50)queueLength: Target messages per replica (default: 5)cooldownPeriod: Delay before scaling down in seconds (default: 60)pollingInterval: Queue check frequency in seconds (default: 10)
Advanced Scaling Configuration#
For fine-grained KEDA behavior, use the scaling.advanced sub-object:
spec:
scaling:
minReplicaCount: 0
maxReplicaCount: 20
advanced:
restoreToOriginalReplicaCount: true
formula: "queue"
target: "10"
activationTarget: "1"
metricType: AverageValue
Parameters:
| Field | Type | Description |
|---|---|---|
restoreToOriginalReplicaCount |
bool | When true, replicas are restored to their value before the ScaledObject was created when the ScaledObject is deleted |
formula |
string | Composite metric formula combining multiple metrics (KEDA scalingModifiers.formula). Requires target. Formula must reference trigger names — Asya compositions name the primary trigger queue, so use queue to reference it. |
target |
string | Target value for the composite formula (required with formula) |
activationTarget |
string | Minimum metric value before scaling activates (avoids scaling at near-zero load) |
metricType |
AverageValue | Value | Utilization |
Metric aggregation method for the composite formula |
Formula trigger reference: KEDA validates formula identifiers at admission time using expr-lang. Formulas must reference trigger names defined in triggers[N].name. Asya compositions set name: queue on the primary trigger, so queue is always a valid reference:
advanced:
formula: "queue"
target: "5"
activationTarget: "1"
metricType: AverageValue
Notes:
- formula, target, activationTarget, and metricType map to spec.advanced.scalingModifiers in the KEDA ScaledObject
- restoreToOriginalReplicaCount maps to spec.advanced.restoreToOriginalReplicaCount
- target is required when formula is set; the XRD enforces this with a oneOf validation constraint
GPU Workloads#
spec:
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Ensure the GPU node group exists and the NVIDIA device plugin is installed in your cluster.
With minReplicaCount: 0, GPU actors scale to zero between bursts — no idle GPU cost.
Cost Optimization#
queueLength trades cost for speed:
queueLength |
100 messages → pods | Throughput | Cost |
|---|---|---|---|
| 5 | 20 | High | High |
| 10 | 10 | Medium | Medium |
| 20 | 5 | Low | Low |
Set based on per-message processing time and your latency budget.
Spot instances (AWS): GPU actors tolerate interruption well because messages re-queue on pod
termination. Use spot/preemptible nodes for actors with minReplicaCount: 0.
Benefits#
Scale to zero:
- 0 messages → 0 pods → $0 cost
- Queue fills → Spin up to maxReplicaCount in seconds
Independent scaling:
- Each actor scales based on its own queue depth
- Data-loader scales differently than LLM inference
Cost optimization:
- Only run GPU pods when needed
- No warm pools, no idle resources
Handle bursts:
- Automatic response to traffic spikes
- Gradual scale-down when load decreases
Scaling Scenarios#
Idle Workload#
- Queue: 0 messages
- Replicas: 0 (minReplicaCount=0)
- Cost: $0
Low Load#
- Queue: 10 messages, queueLength=5
- Replicas: 2
- Processing: ~5 messages per replica
High Load#
- Queue: 250 messages, queueLength=5
- Replicas: 50 (capped at maxReplicaCount)
- Processing: ~5 messages per replica
Burst#
- Queue suddenly: 500 messages
- KEDA scales up: 0 → 50 in ~30-60 seconds
- After processing: Queue drains → Scale down to 0
Transport-Specific Triggers#
SQS#
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-actor
queueLength: "5"
awsRegion: us-east-1
RabbitMQ#
triggers:
- type: rabbitmq
metadata:
host: amqp://rabbitmq:5672
queueName: asya-actor
queueLength: "5"
Monitoring Autoscaling#
# Watch HPA status
kubectl get hpa -w
# View ScaledObject
kubectl get scaledobject text-processor -o yaml
# View KEDA metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
See: Observability for autoscaling metrics.