Configure KEDA-based autoscaling for AsyncActor workloads, including queue-depth scaling, GPU workloads, cost optimization, and advanced scaling modifiers.


How KEDA Works#

KEDA (Kubernetes Event Driven Autoscaling) monitors external metrics (queue depth, custom metrics) and scales Kubernetes Deployments.

Components:

  • KEDA Operator: Watches ScaledObjects
  • Metrics Server: Exposes metrics to HPA
  • ScaledObject: Defines scaling triggers and targets

Asya Integration#

The Crossplane composition creates a KEDA ScaledObject for each AsyncActor:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: text-processor
spec:
  scaleTargetRef:
    name: text-processor   # Deployment to scale
  minReplicaCount: 0       # Scale to zero when idle
  maxReplicaCount: 50      # Max replicas
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-text-processor
      queueLength: "5"      # Target: 5 messages per replica
      awsRegion: us-east-1

Formula: desiredReplicas = ceil(queueDepth / queueLength)

Example: 100 messages, queueLength=5 → 20 replicas

Configuration#

Scaling configured in AsyncActor spec:

spec:
  scaling:
    enabled: true            # Enable KEDA autoscaling
    minReplicaCount: 0       # Minimum pods (0 for scale-to-zero)
    maxReplicaCount: 100     # Maximum pods
    queueLength: 5           # Target messages per replica
    cooldownPeriod: 60       # Seconds before scaling down (default: 60s)
    pollingInterval: 10      # How often KEDA checks queue depth (default: 10s)

Parameters:

  • enabled: Enable/disable KEDA autoscaling (default: false)
  • minReplicaCount: Minimum pods (default: 0 for scale-to-zero)
  • maxReplicaCount: Maximum pods (default: 50)
  • queueLength: Target messages per replica (default: 5)
  • cooldownPeriod: Delay before scaling down in seconds (default: 60)
  • pollingInterval: Queue check frequency in seconds (default: 10)

Advanced Scaling Configuration#

For fine-grained KEDA behavior, use the scaling.advanced sub-object:

spec:
  scaling:
    minReplicaCount: 0
    maxReplicaCount: 20
    advanced:
      restoreToOriginalReplicaCount: true
      formula: "queue"
      target: "10"
      activationTarget: "1"
      metricType: AverageValue

Parameters:

Field Type Description
restoreToOriginalReplicaCount bool When true, replicas are restored to their value before the ScaledObject was created when the ScaledObject is deleted
formula string Composite metric formula combining multiple metrics (KEDA scalingModifiers.formula). Requires target. Formula must reference trigger names — Asya compositions name the primary trigger queue, so use queue to reference it.
target string Target value for the composite formula (required with formula)
activationTarget string Minimum metric value before scaling activates (avoids scaling at near-zero load)
metricType AverageValue | Value | Utilization Metric aggregation method for the composite formula

Formula trigger reference: KEDA validates formula identifiers at admission time using expr-lang. Formulas must reference trigger names defined in triggers[N].name. Asya compositions set name: queue on the primary trigger, so queue is always a valid reference:

advanced:
  formula: "queue"
  target: "5"
  activationTarget: "1"
  metricType: AverageValue

Notes: - formula, target, activationTarget, and metricType map to spec.advanced.scalingModifiers in the KEDA ScaledObject - restoreToOriginalReplicaCount maps to spec.advanced.restoreToOriginalReplicaCount - target is required when formula is set; the XRD enforces this with a oneOf validation constraint

GPU Workloads#

spec:
  resources:
    limits:
      nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Ensure the GPU node group exists and the NVIDIA device plugin is installed in your cluster.

With minReplicaCount: 0, GPU actors scale to zero between bursts — no idle GPU cost.

Cost Optimization#

queueLength trades cost for speed:

queueLength 100 messages → pods Throughput Cost
5 20 High High
10 10 Medium Medium
20 5 Low Low

Set based on per-message processing time and your latency budget.

Spot instances (AWS): GPU actors tolerate interruption well because messages re-queue on pod termination. Use spot/preemptible nodes for actors with minReplicaCount: 0.

Benefits#

Scale to zero:

  • 0 messages → 0 pods → $0 cost
  • Queue fills → Spin up to maxReplicaCount in seconds

Independent scaling:

  • Each actor scales based on its own queue depth
  • Data-loader scales differently than LLM inference

Cost optimization:

  • Only run GPU pods when needed
  • No warm pools, no idle resources

Handle bursts:

  • Automatic response to traffic spikes
  • Gradual scale-down when load decreases

Scaling Scenarios#

Idle Workload#

  • Queue: 0 messages
  • Replicas: 0 (minReplicaCount=0)
  • Cost: $0

Low Load#

  • Queue: 10 messages, queueLength=5
  • Replicas: 2
  • Processing: ~5 messages per replica

High Load#

  • Queue: 250 messages, queueLength=5
  • Replicas: 50 (capped at maxReplicaCount)
  • Processing: ~5 messages per replica

Burst#

  • Queue suddenly: 500 messages
  • KEDA scales up: 0 → 50 in ~30-60 seconds
  • After processing: Queue drains → Scale down to 0

Transport-Specific Triggers#

SQS#

triggers:

- type: aws-sqs-queue
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/.../asya-actor
    queueLength: "5"
    awsRegion: us-east-1

RabbitMQ#

triggers:

- type: rabbitmq
  metadata:
    host: amqp://rabbitmq:5672
    queueName: asya-actor
    queueLength: "5"

Monitoring Autoscaling#

# Watch HPA status
kubectl get hpa -w

# View ScaledObject
kubectl get scaledobject text-processor -o yaml

# View KEDA metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

See: Observability for autoscaling metrics.