Deploy an NVIDIA NIM container with KServe
Deploy an NVIDIA NIM container with KServe
This guide explains how to deploy an NVIDIA NIM container using KServe on a Kubernetes cluster.
While the steps here apply to general Kubernetes environments, in Hybrid Manager AI Factory, we provide additional value such as lifecycle management, observability, and simplified integration. Learn more in the Model Serving in Hybrid Manager section.
Goal
Deploy an NVIDIA NIM container using KServe to create a network-accessible inference service that can be consumed by applications.
Estimated time
15–30 minutes depending on cluster setup.
What you will accomplish
- Define and deploy a ClusterServingRuntime for an NVIDIA NIM container.
- Deploy an InferenceService that uses this runtime.
- Validate your deployment and retrieve the model endpoint.
What this unlocks
- Ability to serve NVIDIA NIM models via standard inference protocols (HTTP/gRPC).
- Prepare to integrate these models with applications or tools such as Griptape (Gen AI Builder) or AIDB Knowledge Bases.
- Foundation for using Hybrid Manager AI Factory model-serving capabilities.
Prerequisites
- Kubernetes cluster with KServe installed.
- GPU node pool configured (with NVIDIA device plugin).
- NVIDIA NIM container image available in a private registry or NGC.
- Kubernetes secret containing NGC API Key.
- kubectl configured for your cluster.
For background concepts, see:
Steps
1. Create ClusterServingRuntime
Define ClusterServingRuntime.yaml
:
apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: nvidia-nim-llama-3.1-8b-instruct-1.3.3 namespace: default spec: containers: - env: - name: NIM_CACHE_PATH value: /tmp - name: NGC_API_KEY valueFrom: secretKeyRef: name: nvidia-nim-secrets key: NGC_API_KEY image: your-registry/nim/meta/llama-3.1-8b-instruct:1.3.3 name: kserve-container ports: - containerPort: 8000 protocol: TCP resources: limits: cpu: "12" memory: 64Gi requests: cpu: "12" memory: 64Gi volumeMounts: - mountPath: /dev/shm name: dshm imagePullSecrets: - name: edb-cred protocolVersions: - v2 - grpc-v2 supportedModelFormats: - autoSelect: true name: nvidia-nim-llama-3.1-8b-instruct priority: 1 version: "1.3.3" volumes: - emptyDir: medium: Memory sizeLimit: 16Gi name: dshm
Apply the runtime:
kubectl apply -f ClusterServingRuntime.yaml
2. Create InferenceService
Define InferenceService.yaml
:
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: serving.kserve.io/enable-prometheus-scraping: "true" prometheus.kserve.io/port: "8000" prometheus.kserve.io/path: "/v1/metrics" name: llama-3-1-8b-instruct-1xgpu namespace: default spec: predictor: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: nvidia.com/gpu: "true" imagePullSecrets: - name: edb-cred model: modelFormat: name: nvidia-nim-llama-3.1-8b-instruct resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
Deploy the InferenceService:
kubectl apply -f InferenceService.yaml
3. Verify deployed models
List active InferenceServices:
kubectl get InferenceService \ -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \ --namespace=default
NAME MODEL URL RUNTIME GPUs llama-3-1-8b-instruct-1xgpu nvidia-nim-llama-3.1-8b-instruct http://llama-3-1-8b-instruct-1xgpu.default.svc.cluster.local nvidia-nim-llama-3.1-8b-instruct-1.3.3 1
4. Retrieve runtime details
Check port and resources:
kubectl get ClusterServingRuntimes \ -o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \ --namespace=all-namespaces
Next steps
- Monitor deployed models (Coming soon)
- Update GPU resources for a deployed model (Coming soon)
Related reading
- On this page
- Deploy an NVIDIA NIM container with KServe
← Prev
Create an InferenceService for NVIDIA NIM Container
↑ Up
Model Serving How-To Guides
Next →
Model Serving FAQ
Could this page be better? Report a problem or suggest an addition!