AI/ML Serving

KServe, KAITO, KubeAI, vLLM/SGLang, LLM-d

TL;DR#

Model serving tools provide inference endpoints — they load a model into GPU memory, handle batching, and expose a predict/completions API. Asya is the pipeline layer that connects these endpoints into multi-step workflows with queue-based routing and independent scaling. An Asya actor calls a KServe or vLLM endpoint; it does not replace one.

Comparison Table#

Dimension	KServe	KAITO	KubeAI	vLLM / SGLang	LLM-d	🎭
Primary purpose	Model serving on K8s with autoscaling	One-click LLM deployment on AKS	Lightweight LLM gateway on K8s	High-throughput LLM inference engine	Disaggregated LLM serving for K8s	Multi-step AI pipeline routing
What it deploys	InferenceService CRD (predictor + transformer + explainer)	GPU node provisioning + model deployment via Workspace CRD	Model CRD with OpenAI-compatible endpoint	Single inference server process	Prefill/decode disaggregated serving pods	AsyncActor CRD (stateless pod + sidecar + queue)
Scaling trigger	Knative (RPS/concurrency) or HPA	Azure node autoprovisioner	Pending request count	N/A (single process, external LB)	KPA or custom metrics	KEDA (queue depth per actor)
Scale to zero	✅ Yes (Knative serverless)	❌ No (GPU nodes stay provisioned)	✅ Yes (based on pending requests)	❌ No (process must be running)	⚠️ Depends on autoscaler config	✅ Yes (per actor, per queue)
GPU management	Tolerations + node selectors	Provisions GPU nodes automatically (Azure)	Node selectors, multi-GPU	Direct CUDA access, tensor parallelism	Disaggregated prefill/decode across GPUs	Delegates to K8s scheduler; actors specify resource requests
Batching	Built-in request batching	Inherited from inference runtime	Proxy-level batching	Continuous batching, PagedAttention	Continuous batching with KV-cache routing	No batching (one envelope = one invocation)
Protocol	REST/gRPC (V1/V2 inference protocol)	OpenAI-compatible API	OpenAI-compatible API	OpenAI-compatible API	OpenAI-compatible API	Envelope protocol over message queues
Multi-step pipelines	⚠️ Transformer chain (limited, same InferenceService)	❌ No	❌ No	❌ No	❌ No	✅ Core capability: actors chained via queue routing
Dynamic routing	❌ No	❌ No	❌ No	❌ No	⚠️ Routing between prefill/decode nodes	✅ Actors rewrite `route.next` at runtime
Supported models	Any (custom containers)	Curated catalog (Llama, Mistral, Falcon, etc.)	Curated catalog (Ollama, vLLM backends)	Any HuggingFace-compatible LLM	LLMs via vLLM backend	Any workload (LLM, vision, audio, custom code)
Queue integration	❌ No (synchronous request/response)	❌ No	❌ No	❌ No	⚠️ NATS-based internal routing	✅ Native: SQS, RabbitMQ, GCP Pub/Sub

When to Use What#

Use KServe / KAITO / KubeAI / vLLM / LLM-d when:

You need a model inference endpoint — load a model, expose a predict or completions API
Your concern is serving throughput — continuous batching, PagedAttention, tensor parallelism
You want managed model lifecycle — canary rollouts, A/B testing, model versioning
You need an OpenAI-compatible API for drop-in replacement of hosted LLM providers

Use Asya when:

You need to chain multiple steps — preprocess, call an LLM endpoint, postprocess, score, route
Steps have different scaling profiles — a CPU formatter at 50 replicas feeding a GPU scorer at 3
You want queue-based decoupling — callers enqueue and move on; no connection held open during inference
You need dynamic routing — an LLM judge routes high-confidence results to storage, low-confidence to human review
Your pipeline mixes multiple model endpoints (KServe for vision, vLLM for text) into a single workflow

Asya Docs

AI/ML Serving#

TL;DR#

Comparison Table#

When to Use What#