KServe, KAITO, KubeAI, vLLM/SGLang, LLM-d

TL;DR#

Model serving tools provide inference endpoints — they load a model into GPU memory, handle batching, and expose a predict/completions API. Asya is the pipeline layer that connects these endpoints into multi-step workflows with queue-based routing and independent scaling. An Asya actor calls a KServe or vLLM endpoint; it does not replace one.

Comparison Table#

Dimension KServe KAITO KubeAI vLLM / SGLang LLM-d 🎭
Primary purpose Model serving on K8s with autoscaling One-click LLM deployment on AKS Lightweight LLM gateway on K8s High-throughput LLM inference engine Disaggregated LLM serving for K8s Multi-step AI pipeline routing
What it deploys InferenceService CRD (predictor + transformer + explainer) GPU node provisioning + model deployment via Workspace CRD Model CRD with OpenAI-compatible endpoint Single inference server process Prefill/decode disaggregated serving pods AsyncActor CRD (stateless pod + sidecar + queue)
Scaling trigger Knative (RPS/concurrency) or HPA Azure node autoprovisioner Pending request count N/A (single process, external LB) KPA or custom metrics KEDA (queue depth per actor)
Scale to zero ✅ Yes (Knative serverless) ❌ No (GPU nodes stay provisioned) ✅ Yes (based on pending requests) ❌ No (process must be running) ⚠️ Depends on autoscaler config ✅ Yes (per actor, per queue)
GPU management Tolerations + node selectors Provisions GPU nodes automatically (Azure) Node selectors, multi-GPU Direct CUDA access, tensor parallelism Disaggregated prefill/decode across GPUs Delegates to K8s scheduler; actors specify resource requests
Batching Built-in request batching Inherited from inference runtime Proxy-level batching Continuous batching, PagedAttention Continuous batching with KV-cache routing No batching (one envelope = one invocation)
Protocol REST/gRPC (V1/V2 inference protocol) OpenAI-compatible API OpenAI-compatible API OpenAI-compatible API OpenAI-compatible API Envelope protocol over message queues
Multi-step pipelines ⚠️ Transformer chain (limited, same InferenceService) ❌ No ❌ No ❌ No ❌ No ✅ Core capability: actors chained via queue routing
Dynamic routing ❌ No ❌ No ❌ No ❌ No ⚠️ Routing between prefill/decode nodes ✅ Actors rewrite route.next at runtime
Supported models Any (custom containers) Curated catalog (Llama, Mistral, Falcon, etc.) Curated catalog (Ollama, vLLM backends) Any HuggingFace-compatible LLM LLMs via vLLM backend Any workload (LLM, vision, audio, custom code)
Queue integration ❌ No (synchronous request/response) ❌ No ❌ No ❌ No ⚠️ NATS-based internal routing ✅ Native: SQS, RabbitMQ, GCP Pub/Sub

When to Use What#

Use KServe / KAITO / KubeAI / vLLM / LLM-d when:

  • You need a model inference endpoint — load a model, expose a predict or completions API
  • Your concern is serving throughput — continuous batching, PagedAttention, tensor parallelism
  • You want managed model lifecycle — canary rollouts, A/B testing, model versioning
  • You need an OpenAI-compatible API for drop-in replacement of hosted LLM providers

Use Asya when:

  • You need to chain multiple steps — preprocess, call an LLM endpoint, postprocess, score, route
  • Steps have different scaling profiles — a CPU formatter at 50 replicas feeding a GPU scorer at 3
  • You want queue-based decoupling — callers enqueue and move on; no connection held open during inference
  • You need dynamic routing — an LLM judge routes high-confidence results to storage, low-confidence to human review
  • Your pipeline mixes multiple model endpoints (KServe for vision, vLLM for text) into a single workflow