Create an InferenceService for NVIDIA NIM Container
Create an InferenceService for NVIDIA NIM Container
This How-To explains how to deploy an NVIDIA NIM container as an InferenceService using KServe on Kubernetes.
You will define and apply an InferenceService.yaml
manifest that references an existing ClusterServingRuntime and runs a GPU-accelerated model instance.
This is a core step to making your model available for inference through a network-accessible endpoint.
Goal
Deploy an InferenceService for an NVIDIA NIM container using KServe.
Estimated time
10–15 minutes.
What you will accomplish
- Define an InferenceService manifest.
- Deploy the InferenceService to your Kubernetes cluster.
- Make the model available via a network endpoint.
What this unlocks
Once deployed, your model can be consumed by applications (such as Gen AI Builder agents) via standard inference APIs. It will also appear in your KServe monitoring and can be scaled or updated dynamically.
Prerequisites
- Kubernetes cluster with KServe installed and configured.
- ClusterServingRuntime configured for your NIM container. See Configure ClusterServingRuntime.
- GPU-enabled node pool provisioned and labeled appropriately.
- NVIDIA NGC API key stored as Kubernetes secret, if required.
- The desired NIM container image available in your registry.
Steps
1. Create InferenceService manifest
Create a file named InferenceService.yaml
with the following content:
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: serving.kserve.io/enable-prometheus-scraping: "true" prometheus.kserve.io/port: "8000" prometheus.kserve.io/path: "/v1/metrics" name: llama-3-1-8b-instruct-1xgpu-g5 namespace: default spec: predictor: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: nvidia.com/gpu: "true" imagePullSecrets: - name: edb-cred model: modelFormat: name: nvidia-nim-llama-3.1-8b-instruct resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1" runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
Key points:
- name → provide a unique name for your InferenceService.
- tolerations and nodeSelector → must match your GPU node pool configuration.
- runtime and modelFormat → must match the values used in your ClusterServingRuntime.
- GPU resources → specify the number of GPUs to allocate.
2. Apply the InferenceService
Run:
kubectl apply -f InferenceService.yaml
3. Verify the deployment
Check the InferenceService status:
kubectl get InferenceService -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu --namespace=default
You should see:
- The name of your InferenceService.
- The model format being served.
- The endpoint URL.
- The ClusterServingRuntime used.
- The number of GPUs allocated.
What’s next
Now that your model is deployed:
- Your application (or Gen AI Builder agent) can send inference requests to the provided URL.
- You can monitor model performance and resource usage.
- You can update the GPU allocation or other runtime parameters if needed.
Related topics
- Configure ClusterServingRuntime
- Deploy NVIDIA NIM container with KServe (placeholder)(future page showing full deployment flow starting from scratch)
- AI Factory Concepts
- AI Factory Terminology
Next steps
Explore more in the AI Factory learning guide.
← Prev
Configure a ClusterServingRuntime
↑ Up
Model Serving How-To Guides
Next →
Deploy an NVIDIA NIM container with KServe
Could this page be better? Report a problem or suggest an addition!