Autoscaling

Configure KEDA-based autoscaling for AsyncActor workloads, including queue-depth scaling, GPU workloads, cost optimization, and advanced scaling modifiers.

How KEDA Works#

KEDA (Kubernetes Event Driven Autoscaling) monitors external metrics (queue depth, custom metrics) and scales Kubernetes Deployments.

Components:

KEDA Operator: Watches ScaledObjects
Metrics Server: Exposes metrics to HPA
ScaledObject: Defines scaling triggers and targets

Asya Integration#

The Crossplane composition creates a KEDA ScaledObject for each AsyncActor:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: text-processor
spec:
  scaleTargetRef:
    name: text-processor   # Deployment to scale
  minReplicaCount: 0       # Scale to zero when idle
  maxReplicaCount: 50      # Max replicas
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-text-processor
      queueLength: "5"      # Target: 5 messages per replica
      awsRegion: us-east-1

Formula: desiredReplicas = ceil(queueDepth / queueLength)

Example: 100 messages, queueLength=5 → 20 replicas

Configuration#

Scaling configured in AsyncActor spec:

spec:
  scaling:
    enabled: true            # Enable KEDA autoscaling
    minReplicaCount: 0       # Minimum pods (0 for scale-to-zero)
    maxReplicaCount: 100     # Maximum pods
    queueLength: 5           # Target messages per replica
    cooldownPeriod: 60       # Seconds before scaling down (default: 60s)
    pollingInterval: 10      # How often KEDA checks queue depth (default: 10s)

Parameters:

enabled: Enable/disable KEDA autoscaling (default: false)
minReplicaCount: Minimum pods (default: 0 for scale-to-zero)
maxReplicaCount: Maximum pods (default: 50)
queueLength: Target messages per replica (default: 5)
cooldownPeriod: Delay before scaling down in seconds (default: 60)
pollingInterval: Queue check frequency in seconds (default: 10)

Advanced Scaling Configuration#

For fine-grained KEDA behavior, use the scaling.advanced sub-object:

spec:
  scaling:
    minReplicaCount: 0
    maxReplicaCount: 20
    advanced:
      restoreToOriginalReplicaCount: true
      formula: "queue"
      target: "10"
      activationTarget: "1"
      metricType: AverageValue

Parameters:

Field	Type	Description
`restoreToOriginalReplicaCount`	bool	When true, replicas are restored to their value before the ScaledObject was created when the ScaledObject is deleted
`formula`	string	Composite metric formula combining multiple metrics (KEDA `scalingModifiers.formula`). Requires `target`. Formula must reference trigger names — Asya compositions name the primary trigger `queue`, so use `queue` to reference it.
`target`	string	Target value for the composite formula (required with `formula`)
`activationTarget`	string	Minimum metric value before scaling activates (avoids scaling at near-zero load)
`metricType`	`AverageValue` \| `Value` \| `Utilization`	Metric aggregation method for the composite formula

Formula trigger reference: KEDA validates formula identifiers at admission time using expr-lang. Formulas must reference trigger names defined in triggers[N].name. Asya compositions set name: queue on the primary trigger, so queue is always a valid reference:

advanced:
  formula: "queue"
  target: "5"
  activationTarget: "1"
  metricType: AverageValue

Notes: - formula, target, activationTarget, and metricType map to spec.advanced.scalingModifiers in the KEDA ScaledObject - restoreToOriginalReplicaCount maps to spec.advanced.restoreToOriginalReplicaCount - target is required when formula is set; the XRD enforces this with a oneOf validation constraint

GPU Workloads#

spec:
  resources:
    limits:
      nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Ensure the GPU node group exists and the NVIDIA device plugin is installed in your cluster.

With minReplicaCount: 0, GPU actors scale to zero between bursts — no idle GPU cost.

Cost Optimization#

queueLength trades cost for speed:

`queueLength`	100 messages → pods	Throughput	Cost
5	20	High	High
10	10	Medium	Medium
20	5	Low	Low

Set based on per-message processing time and your latency budget.

Spot instances (AWS): GPU actors tolerate interruption well because messages re-queue on pod termination. Use spot/preemptible nodes for actors with minReplicaCount: 0.

Benefits#

Scale to zero:

0 messages → 0 pods → $0 cost
Queue fills → Spin up to maxReplicaCount in seconds

Independent scaling:

Each actor scales based on its own queue depth
Data-loader scales differently than LLM inference

Cost optimization:

Only run GPU pods when needed
No warm pools, no idle resources

Handle bursts:

Automatic response to traffic spikes
Gradual scale-down when load decreases

Scaling Scenarios#

Idle Workload#

Queue: 0 messages
Replicas: 0 (minReplicaCount=0)
Cost: $0

Low Load#

Queue: 10 messages, queueLength=5
Replicas: 2
Processing: ~5 messages per replica

High Load#

Queue: 250 messages, queueLength=5
Replicas: 50 (capped at maxReplicaCount)
Processing: ~5 messages per replica

Burst#

Queue suddenly: 500 messages
KEDA scales up: 0 → 50 in ~30-60 seconds
After processing: Queue drains → Scale down to 0

Transport-Specific Triggers#

SQS#

triggers:

- type: aws-sqs-queue
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-actor
    queueLength: "5"
    awsRegion: us-east-1

RabbitMQ#

triggers:

- type: rabbitmq
  metadata:
    host: amqp://rabbitmq:5672
    queueName: asya-actor
    queueLength: "5"

Monitoring Autoscaling#

# Watch HPA status
kubectl get hpa -w

# View ScaledObject
kubectl get scaledobject text-processor -o yaml

# View KEDA metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

See: Observability for autoscaling metrics.