vs Ray Serve

TL;DR#

Ray Serve is a Python-native model serving framework built on the Ray distributed runtime. It excels at multi-model composition, GPU-aware scheduling, and low-latency inference — all within its own cluster abstraction. Asya is a Kubernetes-native actor mesh where each step is an independent pod scaling on queue depth via KEDA. Ray Serve owns the cluster; Asya delegates everything to Kubernetes.

Choose Ray Serve when you need tight GPU multiplexing and sub-50ms inter-model latency. Choose Asya when you need independent per-step scaling, scale-to-zero, dynamic routing, and want to stay on pure Kubernetes without a second cluster runtime.

At a Glance#

Dimension	Ray Serve	🎭
Runtime	Ray cluster (head + worker nodes)	Kubernetes pods + message queues
Unit of deployment	`@serve.deployment` Python class	`AsyncActor` CRD + pure Python function
Scaling	Ray Autoscaler (replica + node level)	KEDA ScaledObject per actor (queue depth)
Scale to zero	❌ Not supported (head node always runs)	✅ Native — every actor scales to 0
GPU support	First-class (`ray_actor_options={"num_gpus": 1}`)	K8s `resources.limits` + GPU node pools
Inter-step communication	In-process or Ray object store (shared memory)	Message queue (SQS, RabbitMQ, Pub/Sub)
Latency	Microseconds (shared memory)	Milliseconds–seconds (queue round-trip)
Fault tolerance	Ray reconstructs lost objects; actor restart	Queue redelivery; pod restart
Multi-model composition	Deployment graph / `bind()` API	Envelope routing (`route.next`)
Dynamic routing	⚠️ Application code in driver	✅ Actor rewrites `route.next` at runtime
Protocol surface	HTTP/gRPC ingress	A2A, MCP, HTTP gateway
Language	Python (primarily)	Handler: Python; sidecar: Go; infra: YAML
Operational dependency	Ray head node, GCS, dashboard	Kubernetes, KEDA, queue backend
Maturity	🟢 Mature (Ray 2.x, widespread adoption)	🟡 Alpha (production at Delivery Hero)

Architecture#

Ray Serve#

Ray Serve runs on top of a Ray cluster — a head node plus autoscaled worker nodes. Each deployment is a Ray actor class replicated across workers. Calls between deployments travel through Ray's object store (shared-memory plasma, or distributed via GCS), giving near-zero serialization overhead for in-cluster communication.

Client --> HTTP/gRPC Ingress
              |
         Ray Head Node (GCS, Dashboard, Autoscaler)
              |
     +--------+--------+
     |        |        |
  Worker 1  Worker 2  Worker 3
  (GPU)     (GPU)     (CPU)
  [ModelA]  [ModelA]  [PostProc]
  [ModelB]  [ModelB]

Deployment graph (or bind()) lets you compose models:

from ray import serve

@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class Embedder:
    def __init__(self):
        self.model = load_model("bge-large")

    async def __call__(self, request):
        return self.model.encode(request.json()["text"])

@serve.deployment(num_replicas=1)
class Reranker:
    def __init__(self, embedder):
        self.embedder = embedder

    async def __call__(self, request):
        embedding = await self.embedder.remote(request)
        return self.rerank(embedding)

app = Reranker.bind(Embedder.bind())
serve.run(app)

Asya#

Asya has no cluster runtime. Each actor is a standard Kubernetes Deployment with an injected sidecar. Actors communicate exclusively through message queues. KEDA watches each queue independently and scales each actor from 0 to N.

Client --> Gateway (A2A / MCP / HTTP)
              |
         Queue A ──> Actor A (Embedder, GPU, 0-5 pods)
                         |
                    Queue B ──> Actor B (Reranker, CPU, 0-10 pods)
                                    |
                               Queue C ──> x-sink (persist result)

The same pipeline as two pure functions and one manifest:

# embedder/handler.py
def embed(payload: dict) -> dict:
    payload["embedding"] = model.encode(payload["text"])
    return payload

# reranker/handler.py
def rerank(payload: dict) -> dict:
    payload["ranked"] = do_rerank(payload["embedding"])
    return payload

# asyncactor.yaml
apiVersion: asya.sh/v1alpha1
kind: AsyncActor
metadata:
  name: embedder
spec:
  image: embedder:latest
  handler: handler.embed
  scaling:
    minReplicaCount: 0
    maxReplicaCount: 5
    queueLength: 1
  resources:
    limits:
      nvidia.com/gpu: "1"
  resiliency:
    actorTimeout: 120s
    policies:
      default:
        maxAttempts: 3
        backoff: exponential

Developer Experience#

Aspect	Ray Serve	🎭
Getting started	`pip install ray[serve]`; run locally	Requires K8s cluster, KEDA, queue backend
Local iteration	`serve.run(app)` — instant	`python handler.py` for logic; K8s for integration
Deployment	`serve deploy config.yaml` or KubeRay operator	`kubectl apply -f asyncactor.yaml` (Crossplane)
Monitoring	Ray Dashboard (built-in)	Grafana + per-actor queue metrics
Model updates	Rolling update via Ray config	Rolling update via K8s Deployment
Multi-tenancy	Single Ray cluster, namespace isolation	Native K8s namespaces, RBAC, network policies
Secrets / config	Environment or Ray runtime env	K8s Secrets, ConfigMaps, external-secrets-operator

When to Choose Ray Serve#

Low-latency model composition — your pipeline chains 3-5 models and needs sub-50ms end-to-end; Ray's shared-memory object store avoids serialization overhead.
GPU multiplexing — you want to pack multiple models onto a single GPU with fractional allocation (num_gpus=0.5); Ray handles placement and memory management.
Python-centric team — your team is all Python, models are all PyTorch/TensorFlow, and you want everything in one language with minimal YAML.
Batch inference pipelines — Ray's @serve.batch decorator handles dynamic batching natively, accumulating requests to fill GPU utilization.
Tight coupling between models — models share tensors or embeddings in-memory; serializing to a queue would be wasteful.

When to Choose Asya#

Independent per-step scaling — each pipeline stage has a different resource profile (CPU preprocessing, GPU inference, CPU postprocessing) and needs to scale on its own queue depth, not a shared autoscaler.
Scale to zero — GPU pods cost nothing when idle. Ray Serve requires at least one head node running at all times; Asya actors scale to zero pods when their queue is empty.
Dynamic routing — actors rewrite route.next at runtime based on LLM confidence, content type, or business rules. No static DAG to rebuild.
Separation of concerns — data scientists write pure Python functions; platform engineers own the AsyncActor manifest (scaling, retries, timeouts). The handler has zero infrastructure imports.
Agentic workflows — human-in-the-loop pause/resume, multi-turn conversations, A2A protocol support, MCP tool exposure. These are built into the Asya gateway.
Multi-transport / multi-cloud — the same actor runs on SQS, RabbitMQ, or Pub/Sub by changing one field in the manifest. No code changes.
K8s-native operations — you already have Kubernetes with RBAC, network policies, namespaces, GitOps (ArgoCD/Flux), and observability. Asya adds no new cluster runtime — it is Kubernetes.
Queue-native resilience — if a pod dies mid-inference, the message goes back to the queue. No custom checkpointing, no object store reconstruction. The queue is the checkpoint.

Asya Docs

vs Ray Serve#