Monitor deployed models with KServe
Monitor deployed models with KServe
This guide explains how to monitor deployed AI models using KServe on Kubernetes.
Monitoring your models helps ensure reliability, performance, and optimal use of resources — whether you are working in general Kubernetes or using Hybrid Manager AI Factory.
For AI Factory users, Hybrid Manager will provide additional value-add monitoring and observability features — see Model Serving in Hybrid Manager.
Goal
Monitor deployed models, check model status and serving endpoints, and retrieve resource usage information.
Estimated time
5–10 minutes.
What you will accomplish
- List deployed InferenceServices (models).
- Retrieve model endpoint and runtime details.
- Understand how to observe model performance.
- Prepare to integrate model metrics into observability pipelines.
What this unlocks
- Confidence that models are correctly deployed and serving.
- Ability to troubleshoot or scale model deployments.
- Foundation for using Hybrid Manager AI Factory observability for model serving.
Prerequisites
- Deployed InferenceService on KServe.
- ClusterServingRuntime defined.
- kubectl configured for your Kubernetes cluster.
For background concepts, see:
Steps
1. List deployed InferenceServices
To list deployed models and see key details:
kubectl get InferenceService \ -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \ --namespace=default
Key columns:
- NAME: Name of the InferenceService.
- MODEL: Model format name (from ClusterServingRuntime).
- URL: Service endpoint for inference requests.
- RUNTIME: ClusterServingRuntime used.
- GPUs: Number of GPUs allocated.
2. Retrieve runtime details
To view ClusterServingRuntime details, including serving port and resource allocations:
kubectl get ClusterServingRuntimes \ -o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \ --namespace=all-namespaces
Key columns:
- NAME: Name of the runtime.
- IMAGE: Model server image used.
- PORT: Inference port (commonly 8000 for NIM).
- CPUs: CPU resources allocated.
- MEMORY: Memory allocated.
3. Observe model metrics
If you enabled Prometheus scraping via InferenceService annotations:
- serving.kserve.io/enable-prometheus-scraping: "true"
Then Prometheus can scrape metrics at:
/v1/metrics on port 8000 of the model service.
Metrics typically include:
- Request latency
- Throughput (requests per second)
- Error rates
- GPU utilization (if GPUs used)
You can visualize these metrics in tools such as Grafana.
4. Check pod status
For debugging, you can also view model pods directly:
kubectl get pods --namespace=default
Look for pods with names matching:
<inference-service-name>-predictor-*
Check pod status and logs if needed:
kubectl logs <pod-name> --namespace=default
Next steps
- Update GPU resources for a deployed model (Coming soon)
- Deploy additional NVIDIA NIM models
- Model Serving in Hybrid Manager (Coming soon)
Related reading
- On this page
- Monitor deployed models with KServe
Could this page be better? Report a problem or suggest an addition!