Deploy an NVIDIA NIM container with KServe

Deploy an NVIDIA NIM container with KServe

This guide explains how to deploy an NVIDIA NIM container using KServe on a Kubernetes cluster.

While the steps here apply to general Kubernetes environments, in Hybrid Manager AI Factory, we provide additional value such as lifecycle management, observability, and simplified integration. Learn more in the Model Serving in Hybrid Manager section.

Goal

Deploy an NVIDIA NIM container using KServe to create a network-accessible inference service that can be consumed by applications.

Estimated time

15–30 minutes depending on cluster setup.

What you will accomplish

  • Define and deploy a ClusterServingRuntime for an NVIDIA NIM container.
  • Deploy an InferenceService that uses this runtime.
  • Validate your deployment and retrieve the model endpoint.

What this unlocks

  • Ability to serve NVIDIA NIM models via standard inference protocols (HTTP/gRPC).
  • Prepare to integrate these models with applications or tools such as Griptape (Gen AI Builder) or AIDB Knowledge Bases.
  • Foundation for using Hybrid Manager AI Factory model-serving capabilities.

Prerequisites

  • Kubernetes cluster with KServe installed.
  • GPU node pool configured (with NVIDIA device plugin).
  • NVIDIA NIM container image available in a private registry or NGC.
  • Kubernetes secret containing NGC API Key.
  • kubectl configured for your cluster.

For background concepts, see:

Steps

1. Create ClusterServingRuntime

Define ClusterServingRuntime.yaml:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
namespace: default
spec:
containers:
- env:
- name: NIM_CACHE_PATH
value: /tmp
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: nvidia-nim-secrets
key: NGC_API_KEY
image: your-registry/nim/meta/llama-3.1-8b-instruct:1.3.3
name: kserve-container
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
cpu: "12"
memory: 64Gi
requests:
cpu: "12"
memory: 64Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
imagePullSecrets:
- name: edb-cred
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- autoSelect: true
name: nvidia-nim-llama-3.1-8b-instruct
priority: 1
version: "1.3.3"
volumes:
- emptyDir:
medium: Memory
sizeLimit: 16Gi
name: dshm

Apply the runtime:

kubectl apply -f ClusterServingRuntime.yaml

2. Create InferenceService

Define InferenceService.yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
prometheus.kserve.io/port: "8000"
prometheus.kserve.io/path: "/v1/metrics"
name: llama-3-1-8b-instruct-1xgpu
namespace: default
spec:
predictor:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu: "true"
imagePullSecrets:
- name: edb-cred
model:
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3

Deploy the InferenceService:

kubectl apply -f InferenceService.yaml

3. Verify deployed models

List active InferenceServices:

kubectl get InferenceService \
-o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \
--namespace=default
Output
NAME                           MODEL                              URL                                                                         RUNTIME                                  GPUs
llama-3-1-8b-instruct-1xgpu    nvidia-nim-llama-3.1-8b-instruct   http://llama-3-1-8b-instruct-1xgpu.default.svc.cluster.local                 nvidia-nim-llama-3.1-8b-instruct-1.3.3   1

4. Retrieve runtime details

Check port and resources:

kubectl get ClusterServingRuntimes \
-o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \
--namespace=all-namespaces

Next steps



Could this page be better? Report a problem or suggest an addition!