vs Ray Serve#
TL;DR#
Ray Serve is a Python-native model serving framework built on the Ray distributed runtime. It excels at multi-model composition, GPU-aware scheduling, and low-latency inference — all within its own cluster abstraction. Asya is a Kubernetes-native actor mesh where each step is an independent pod scaling on queue depth via KEDA. Ray Serve owns the cluster; Asya delegates everything to Kubernetes.
Choose Ray Serve when you need tight GPU multiplexing and sub-50ms inter-model latency. Choose Asya when you need independent per-step scaling, scale-to-zero, dynamic routing, and want to stay on pure Kubernetes without a second cluster runtime.
At a Glance#
| Dimension | Ray Serve | 🎭 |
|---|---|---|
| Runtime | Ray cluster (head + worker nodes) | Kubernetes pods + message queues |
| Unit of deployment | @serve.deployment Python class |
AsyncActor CRD + pure Python function |
| Scaling | Ray Autoscaler (replica + node level) | KEDA ScaledObject per actor (queue depth) |
| Scale to zero | ❌ Not supported (head node always runs) | ✅ Native — every actor scales to 0 |
| GPU support | First-class (ray_actor_options={"num_gpus": 1}) |
K8s resources.limits + GPU node pools |
| Inter-step communication | In-process or Ray object store (shared memory) | Message queue (SQS, RabbitMQ, Pub/Sub) |
| Latency | Microseconds (shared memory) | Milliseconds–seconds (queue round-trip) |
| Fault tolerance | Ray reconstructs lost objects; actor restart | Queue redelivery; pod restart |
| Multi-model composition | Deployment graph / bind() API |
Envelope routing (route.next) |
| Dynamic routing | ⚠️ Application code in driver | ✅ Actor rewrites route.next at runtime |
| Protocol surface | HTTP/gRPC ingress | A2A, MCP, HTTP gateway |
| Language | Python (primarily) | Handler: Python; sidecar: Go; infra: YAML |
| Operational dependency | Ray head node, GCS, dashboard | Kubernetes, KEDA, queue backend |
| Maturity | 🟢 Mature (Ray 2.x, widespread adoption) | 🟡 Alpha (production at Delivery Hero) |
Architecture#
Ray Serve#
Ray Serve runs on top of a Ray cluster — a head node plus autoscaled worker nodes. Each deployment is a Ray actor class replicated across workers. Calls between deployments travel through Ray's object store (shared-memory plasma, or distributed via GCS), giving near-zero serialization overhead for in-cluster communication.
Client --> HTTP/gRPC Ingress
|
Ray Head Node (GCS, Dashboard, Autoscaler)
|
+--------+--------+
| | |
Worker 1 Worker 2 Worker 3
(GPU) (GPU) (CPU)
[ModelA] [ModelA] [PostProc]
[ModelB] [ModelB]
Deployment graph (or bind()) lets you compose models:
from ray import serve
@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class Embedder:
def __init__(self):
self.model = load_model("bge-large")
async def __call__(self, request):
return self.model.encode(request.json()["text"])
@serve.deployment(num_replicas=1)
class Reranker:
def __init__(self, embedder):
self.embedder = embedder
async def __call__(self, request):
embedding = await self.embedder.remote(request)
return self.rerank(embedding)
app = Reranker.bind(Embedder.bind())
serve.run(app)
Asya#
Asya has no cluster runtime. Each actor is a standard Kubernetes Deployment with an injected sidecar. Actors communicate exclusively through message queues. KEDA watches each queue independently and scales each actor from 0 to N.
Client --> Gateway (A2A / MCP / HTTP)
|
Queue A ──> Actor A (Embedder, GPU, 0-5 pods)
|
Queue B ──> Actor B (Reranker, CPU, 0-10 pods)
|
Queue C ──> x-sink (persist result)
The same pipeline as two pure functions and one manifest:
# embedder/handler.py
def embed(payload: dict) -> dict:
payload["embedding"] = model.encode(payload["text"])
return payload
# reranker/handler.py
def rerank(payload: dict) -> dict:
payload["ranked"] = do_rerank(payload["embedding"])
return payload
# asyncactor.yaml
apiVersion: asya.sh/v1alpha1
kind: AsyncActor
metadata:
name: embedder
spec:
image: embedder:latest
handler: handler.embed
scaling:
minReplicaCount: 0
maxReplicaCount: 5
queueLength: 1
resources:
limits:
nvidia.com/gpu: "1"
resiliency:
actorTimeout: 120s
policies:
default:
maxAttempts: 3
backoff: exponential
Developer Experience#
| Aspect | Ray Serve | 🎭 |
|---|---|---|
| Getting started | pip install ray[serve]; run locally |
Requires K8s cluster, KEDA, queue backend |
| Local iteration | serve.run(app) — instant |
python handler.py for logic; K8s for integration |
| Deployment | serve deploy config.yaml or KubeRay operator |
kubectl apply -f asyncactor.yaml (Crossplane) |
| Monitoring | Ray Dashboard (built-in) | Grafana + per-actor queue metrics |
| Model updates | Rolling update via Ray config | Rolling update via K8s Deployment |
| Multi-tenancy | Single Ray cluster, namespace isolation | Native K8s namespaces, RBAC, network policies |
| Secrets / config | Environment or Ray runtime env | K8s Secrets, ConfigMaps, external-secrets-operator |
When to Choose Ray Serve#
-
Low-latency model composition — your pipeline chains 3-5 models and needs sub-50ms end-to-end; Ray's shared-memory object store avoids serialization overhead.
-
GPU multiplexing — you want to pack multiple models onto a single GPU with fractional allocation (
num_gpus=0.5); Ray handles placement and memory management. -
Python-centric team — your team is all Python, models are all PyTorch/TensorFlow, and you want everything in one language with minimal YAML.
-
Batch inference pipelines — Ray's
@serve.batchdecorator handles dynamic batching natively, accumulating requests to fill GPU utilization. -
Tight coupling between models — models share tensors or embeddings in-memory; serializing to a queue would be wasteful.
When to Choose Asya#
-
Independent per-step scaling — each pipeline stage has a different resource profile (CPU preprocessing, GPU inference, CPU postprocessing) and needs to scale on its own queue depth, not a shared autoscaler.
-
Scale to zero — GPU pods cost nothing when idle. Ray Serve requires at least one head node running at all times; Asya actors scale to zero pods when their queue is empty.
-
Dynamic routing — actors rewrite
route.nextat runtime based on LLM confidence, content type, or business rules. No static DAG to rebuild. -
Separation of concerns — data scientists write pure Python functions; platform engineers own the
AsyncActormanifest (scaling, retries, timeouts). The handler has zero infrastructure imports. -
Agentic workflows — human-in-the-loop pause/resume, multi-turn conversations, A2A protocol support, MCP tool exposure. These are built into the Asya gateway.
-
Multi-transport / multi-cloud — the same actor runs on SQS, RabbitMQ, or Pub/Sub by changing one field in the manifest. No code changes.
-
K8s-native operations — you already have Kubernetes with RBAC, network policies, namespaces, GitOps (ArgoCD/Flux), and observability. Asya adds no new cluster runtime — it is Kubernetes.
-
Queue-native resilience — if a pod dies mid-inference, the message goes back to the queue. No custom checkpointing, no object store reconstruction. The queue is the checkpoint.
Further Reading#
- Motivation — why async-first, queue-based AI pipelines
- Actor Mesh — choreography vs orchestration
- Scale to Zero — per-actor KEDA scaling