Observability for Model Serving
Observability for Model Serving
Observability helps you ensure that your deployed AI models are running efficiently and reliably within AI Factory.
Model Serving in AI Factory uses KServe on Kubernetes to serve models. This provides built-in options to monitor:
- Model serving status and availability
- Resource usage (CPU, Memory, GPU)
- Inference performance and throughput
Key monitoring capabilities
KServe InferenceService status
You can inspect model serving status directly via Kubernetes:
kubectl get inferenceservice -n <namespace>
Common status fields include:
- Ready / NotReady
- URL endpoint
- Current replicas
- Allocated resources (GPU, CPU, Memory)
For detailed inspection:
kubectl describe inferenceservice <name> -n <namespace>
GPU utilization monitoring
If your models are deployed on GPU nodes, monitor GPU usage to optimize resource allocation.
Example:
kubectl top node
For deeper GPU-specific metrics (if supported):
nvidia-smi
Prometheus and Grafana integration
If Prometheus is configured, AI Factory model serving exposes metrics through KServe:
Prometheus annotations to enable scraping:
serving.kserve.io/enable-prometheus-scraping: "true" prometheus.kserve.io/port: "8000" prometheus.kserve.io/path: "/v1/metrics"
You can build Grafana dashboards to monitor:
- Inference requests per second
- Latency and error rates
- GPU utilization trends
- Pod restarts and health
Logs and debugging
You can access detailed logs from model serving pods:
kubectl logs -f <pod-name> -n <namespace>
Use logs to:
- Debug model loading or initialization issues
- Review inference request behavior
- Monitor response handling and error conditions
Hybrid Manager observability integration
When running AI Factory in conjunction with Hybrid Manager (HCP), you can integrate model serving observability into the broader HCP observability stack.
See Hybrid Manager Model Serving Observability for additional guidance on using HCP observability tools with Model Serving.
Best practices
- Enable Prometheus scraping on InferenceServices for production.
- Regularly monitor GPU usage to optimize capacity.
- Set proper resource requests/limits in InferenceService specs.
- Leverage Grafana dashboards for real-time observability.
- Use log inspection to support debugging and tuning.
Related links
By applying these observability practices, you can operate your AI Factory Model Serving workloads with greater confidence, proactively identifying and addressing issues before they impact users.
- On this page
- Observability for Model Serving
← Prev
Monitor deployed models with KServe
↑ Up
Model Serving How-To Guides
Next →
Update GPU Resources for an InferenceService
Could this page be better? Report a problem or suggest an addition!