Observability

This guide shows how to set up monitoring for Asya deployments using Prometheus and Grafana.

Overview#

Asya components expose Prometheus metrics. Only the sidecar exposes metrics - runtime uses different observability approaches.

All sidecar metrics use namespace asya_actor (configurable via ASYA_METRICS_NAMESPACE).

Metrics Reference#

Message Counters#

asya_actor_messages_received_total{queue, transport} - Messages received from queue
asya_actor_messages_processed_total{queue, status} - Successfully processed (status: success, empty_response, end_consumed)
asya_actor_messages_sent_total{destination_queue, message_type} - Messages sent to queues (message_type: routing, sink, sump)
asya_actor_messages_failed_total{queue, reason} - Failed messages by reason
Reasons: parse_error, runtime_error, transport_error, validation_error, route_mismatch, error_queue_send_failed

Duration Histograms#

asya_actor_processing_duration_seconds{queue} - Total processing time (queue receive → queue send)
asya_actor_runtime_execution_duration_seconds{queue} - Runtime execution time only
asya_actor_queue_receive_duration_seconds{queue, transport} - Time to receive from queue
asya_actor_queue_send_duration_seconds{destination_queue, transport} - Time to send to queue

Size and State#

asya_actor_message_size_bytes{direction} - Message size in bytes (direction: received, sent)
asya_actor_active_messages - Currently processing messages (gauge)
asya_actor_runtime_errors_total{queue, error_type} - Runtime errors by type

Queue Depth (from KEDA)#

keda_scaler_metrics_value{scaledObject} - Current queue depth (exposed by KEDA, not Asya)
keda_scaler_active{scaledObject} - Active scalers (1=active, 0=inactive)

Autoscaling (from kube-state-metrics)#

kube_horizontalpodautoscaler_status_current_replicas - Current pod count
kube_horizontalpodautoscaler_status_desired_replicas - Desired pod count

Custom Metrics#

Configurable via ASYA_CUSTOM_METRICS environment variable (JSON array). See Sidecar documentation for details.

Prometheus Configuration#

Sidecar exposes metrics on :8080/metrics (configurable via ASYA_METRICS_ADDR).

ServiceMonitor (Prometheus Operator)#

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: asya-actors
spec:
  selector:
    matchLabels:
      asya.sh/actor: "*"  # Matches all AsyncActors
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

Scrape Config (Standard Prometheus)#

scrape_configs:

- job_name: asya-actors
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_asya_sh_actor]
    action: keep
    regex: .+
  - source_labels: [__meta_kubernetes_pod_container_name]
    action: keep
    regex: asya-sidecar
  - source_labels: [__address__]
    action: replace
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8080
    target_label: __address__

Note: Operator does NOT automatically create ServiceMonitors. You must configure Prometheus scraping manually.

Grafana Dashboards#

Example Queries#

Actor throughput (messages per second):

rate(asya_actor_messages_processed_total{queue="asya-my-actor"}[5m])

P95 processing latency:

histogram_quantile(0.95, rate(asya_actor_processing_duration_seconds_bucket{queue="asya-my-actor"}[5m]))

P95 runtime latency (handler execution only):

histogram_quantile(0.95, rate(asya_actor_runtime_execution_duration_seconds_bucket{queue="asya-my-actor"}[5m]))

Error rate (errors per second):

rate(asya_actor_messages_failed_total{queue="asya-my-actor"}[5m])

Error rate by reason:

sum by (reason) (rate(asya_actor_messages_failed_total{queue="asya-my-actor"}[5m]))

Queue depth (from KEDA):

keda_scaler_metrics_value{scaledObject="my-actor"}

Active replicas vs desired:

kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler="my-actor"}
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler="my-actor"}

Messages in flight:

asya_actor_active_messages{queue="asya-my-actor"}

Alerting#

Example Prometheus Alerts#

High error rate:

- alert: AsyaActorHighErrorRate
  expr: |
    (
      rate(asya_actor_messages_failed_total{queue=~"asya-.*"}[5m])
      /
      rate(asya_actor_messages_received_total{queue=~"asya-.*"}[5m])
    ) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate for queue {{ $labels.queue }}"
    description: "Error rate is {{ $value | humanizePercentage }} (threshold: 10%)"

Queue backing up:

- alert: AsyaQueueBackingUp
  expr: keda_scaler_metrics_value{scaledObject=~".*"} > 1000
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Queue {{ $labels.scaledObject }} depth exceeds 1000 messages"
    description: "Current queue depth: {{ $value }}"

Actor scaling to max:

- alert: AsyaActorAtMaxReplicas
  expr: |
    kube_horizontalpodautoscaler_status_current_replicas
    ==
    kube_horizontalpodautoscaler_spec_max_replicas
  for: 15m
  labels:
    severity: info
  annotations:
    summary: "Actor {{ $labels.horizontalpodautoscaler }} at max replicas"
    description: "Consider increasing maxReplicaCount if queue continues to grow"

High processing latency:

- alert: AsyaHighLatency
  expr: |
    histogram_quantile(0.95,
      rate(asya_actor_processing_duration_seconds_bucket{queue=~"asya-.*"}[5m])
    ) > 60
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High P95 latency for queue {{ $labels.queue }}"
    description: "P95 latency is {{ $value }}s (threshold: 60s)"

Runtime errors:

- alert: AsyaRuntimeErrors
  expr: rate(asya_actor_runtime_errors_total{queue=~"asya-.*"}[5m]) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Runtime errors detected for queue {{ $labels.queue }}"
    description: "Error rate: {{ $value }} errors/second"

Gateway Metrics#

Gateway does NOT currently expose Prometheus metrics. Available operational data:

Structured JSON logs — request/response logging with trace context
PostgreSQL task state — query tasks table for status, timestamps, error details
Health endpoint — GET /health for liveness/readiness probes

Prometheus metric instrumentation is planned as a future enhancement.

Operator Metrics#

Exposed via controller-runtime:

controller_runtime_reconcile_total{controller="asyncactor"} - Total reconciliations
controller_runtime_reconcile_errors_total{controller="asyncactor"} - Failed reconciliations
controller_runtime_reconcile_time_seconds{controller="asyncactor"} - Reconciliation duration

Distributed Tracing#

Configuration#

Set OTEL_EXPORTER_OTLP_ENDPOINT on sidecar and gateway to enable tracing:

Sidecar: Set via spec.tracing.endpoint in AsyncActor CR
Gateway: Set via tracing.endpoint in gateway Helm values

Playground Setup#

Enable sampleTracing.enabled: true in the playground chart to deploy Grafana Tempo. The Tempo datasource is auto-provisioned in Grafana.

Querying Traces#

In Grafana Explore, select the Tempo datasource and use TraceQL:

{resource.service.name="my-actor"}
{span.asya.actor="text-processor" && status=error}

Logging#

Use standard Kubernetes logging tools:

Fluentd
Loki
CloudWatch (AWS)

Structured logs in JSON format for easy parsing:

{
  "level": "info",
  "msg": "Processing message",
  "message_id": "5e6fdb2d-1d6b-4e91-baef-73e825434e7b",
  "actor": "text-processor",
  "timestamp": "2025-11-18T12:00:00Z"
}

Asya Docs

Observability#