Observability for Model Serving

Observability for Model Serving

Observability helps you ensure that your deployed AI models are running efficiently and reliably within AI Factory.

Model Serving in AI Factory uses KServe on Kubernetes to serve models. This provides built-in options to monitor:

  • Model serving status and availability
  • Resource usage (CPU, Memory, GPU)
  • Inference performance and throughput

Key monitoring capabilities

KServe InferenceService status

You can inspect model serving status directly via Kubernetes:

kubectl get inferenceservice -n <namespace>

Common status fields include:

  • Ready / NotReady
  • URL endpoint
  • Current replicas
  • Allocated resources (GPU, CPU, Memory)

For detailed inspection:

kubectl describe inferenceservice <name> -n <namespace>

GPU utilization monitoring

If your models are deployed on GPU nodes, monitor GPU usage to optimize resource allocation.

Example:

kubectl top node

For deeper GPU-specific metrics (if supported):

nvidia-smi

Prometheus and Grafana integration

If Prometheus is configured, AI Factory model serving exposes metrics through KServe:

Prometheus annotations to enable scraping:

serving.kserve.io/enable-prometheus-scraping: "true"
prometheus.kserve.io/port: "8000"
prometheus.kserve.io/path: "/v1/metrics"

You can build Grafana dashboards to monitor:

  • Inference requests per second
  • Latency and error rates
  • GPU utilization trends
  • Pod restarts and health

Logs and debugging

You can access detailed logs from model serving pods:

kubectl logs -f <pod-name> -n <namespace>

Use logs to:

  • Debug model loading or initialization issues
  • Review inference request behavior
  • Monitor response handling and error conditions

Hybrid Manager observability integration

When running AI Factory in conjunction with Hybrid Manager (HCP), you can integrate model serving observability into the broader HCP observability stack.

See Hybrid Manager Model Serving Observability for additional guidance on using HCP observability tools with Model Serving.


Best practices

  • Enable Prometheus scraping on InferenceServices for production.
  • Regularly monitor GPU usage to optimize capacity.
  • Set proper resource requests/limits in InferenceService specs.
  • Leverage Grafana dashboards for real-time observability.
  • Use log inspection to support debugging and tuning.


By applying these observability practices, you can operate your AI Factory Model Serving workloads with greater confidence, proactively identifying and addressing issues before they impact users.


Could this page be better? Report a problem or suggest an addition!