Configure a ClusterServingRuntime
Configure a ClusterServingRuntime
This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.
For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.
Goal
Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.
Estimated time
5–10 minutes.
What you will accomplish
- Define a ClusterServingRuntime YAML manifest.
- Apply it to your Kubernetes cluster.
- Enable reusable serving configuration for one or more models.
What this unlocks
- Supports consistent deployment of models using a standard runtime definition.
- Allows for centralized control over serving images and resource profiles.
- Required step for deploying NVIDIA NIM containers with KServe.
Prerequisites
- Kubernetes cluster with KServe installed.
- Access to container image registry with the desired model server image.
- NVIDIA GPU node pool configured (if using GPU-based models).
- (If required) Kubernetes secret configured for API keys (e.g., NVIDIA NGC).
For background concepts, see:
Steps
1. Create ClusterServingRuntime YAML
Create a file named ClusterServingRuntime.yaml
.
Example:
apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: nvidia-nim-llama-3.1-8b-instruct-1.3.3 namespace: default spec: containers: - env: - name: NIM_CACHE_PATH value: /tmp - name: NGC_API_KEY valueFrom: secretKeyRef: name: nvidia-nim-secrets key: NGC_API_KEY image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3 name: kserve-container ports: - containerPort: 8000 protocol: TCP resources: limits: cpu: "12" memory: 64Gi requests: cpu: "12" memory: 64Gi volumeMounts: - mountPath: /dev/shm name: dshm imagePullSecrets: - name: edb-cred protocolVersions: - v2 - grpc-v2 supportedModelFormats: - autoSelect: true name: nvidia-nim-llama-3.1-8b-instruct priority: 1 version: "1.3.3" volumes: - emptyDir: medium: Memory sizeLimit: 16Gi name: dshm
Key fields explained:
containers.image
: The model server container (e.g., NVIDIA NIM image).resources
: CPU, memory, and GPU requirements.NGC_API_KEY
: Secret reference for NVIDIA models.supportedModelFormats
: Logical name used by InferenceService to reference this runtime.
2. Apply the ClusterServingRuntime
Run:
kubectl apply -f ClusterServingRuntime.yaml
3. Verify deployed ClusterServingRuntime
Run:
kubectl get ClusterServingRuntime
NAME AGE nvidia-nim-llama-3.1-8b-instruct-1.3.3 1m
You can inspect full details with:
kubectl get ClusterServingRuntime <name> -o yaml
4. Reference runtime in InferenceService
When you create your InferenceService, reference this runtime:
runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3 modelFormat: name: nvidia-nim-llama-3.1-8b-instruct
See Deploy an NVIDIA NIM container with KServe.
Notes
- Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
- Use meaningful names and version fields in
supportedModelFormats
for traceability. - You can update a runtime by editing and re-applying the YAML.
Next steps
- Deploy an NVIDIA NIM container with KServe
- Update GPU resources for a deployed model
- Monitor deployed models with KServe
Related reading
- On this page
- Configure a ClusterServingRuntime
← Prev
Model Serving How-To Guides
↑ Up
Model Serving How-To Guides
Next →
Create an InferenceService for NVIDIA NIM Container
Could this page be better? Report a problem or suggest an addition!