Observability for Model Serving

Observability helps you ensure that your deployed AI models are running efficiently and reliably within AI Factory.

Model Serving in AI Factory uses KServe on Kubernetes to serve models. This provides built-in options to monitor:

Model serving status and availability
Resource usage (CPU, Memory, GPU)
Inference performance and throughput

Key monitoring capabilities

KServe InferenceService status

You can inspect model serving status directly via Kubernetes:

kubectl get inferenceservice -n <namespace>

Common status fields include:

Ready / NotReady
URL endpoint
Current replicas
Allocated resources (GPU, CPU, Memory)

For detailed inspection:

kubectl describe inferenceservice <name> -n <namespace>

GPU utilization monitoring

If your models are deployed on GPU nodes, monitor GPU usage to optimize resource allocation.

Example:

kubectl top node

For deeper GPU-specific metrics (if supported):

nvidia-smi

Prometheus and Grafana integration

If Prometheus is configured, AI Factory model serving exposes metrics through KServe:

Prometheus annotations to enable scraping:

serving.kserve.io/enable-prometheus-scraping: "true"
prometheus.kserve.io/port: "8000"
prometheus.kserve.io/path: "/v1/metrics"

You can build Grafana dashboards to monitor:

Inference requests per second
Latency and error rates
GPU utilization trends
Pod restarts and health

Logs and debugging

You can access detailed logs from model serving pods:

kubectl logs -f <pod-name> -n <namespace>

Use logs to:

Debug model loading or initialization issues
Review inference request behavior
Monitor response handling and error conditions

Hybrid Manager observability integration

When running AI Factory in conjunction with Hybrid Manager (HCP), you can integrate model serving observability into the broader HCP observability stack.

See Hybrid Manager Model Serving Observability for additional guidance on using HCP observability tools with Model Serving.

Best practices

Enable Prometheus scraping on InferenceServices for production.
Regularly monitor GPU usage to optimize capacity.
Set proper resource requests/limits in InferenceService specs.
Leverage Grafana dashboards for real-time observability.
Use log inspection to support debugging and tuning.

By applying these observability practices, you can operate your AI Factory Model Serving workloads with greater confidence, proactively identifying and addressing issues before they impact users.

← Prev

Monitor deployed models with KServe

↑ Up

Model Serving How-To Guides

Update GPU Resources for an InferenceService

Could this page be better? Report a problem or suggest an addition!