Create an InferenceService for NVIDIA NIM Container

Create an InferenceService for NVIDIA NIM Container

This How-To explains how to deploy an NVIDIA NIM container as an InferenceService using KServe on Kubernetes.

You will define and apply an InferenceService.yaml manifest that references an existing ClusterServingRuntime and runs a GPU-accelerated model instance.

This is a core step to making your model available for inference through a network-accessible endpoint.

Goal

Deploy an InferenceService for an NVIDIA NIM container using KServe.

Estimated time

10–15 minutes.

What you will accomplish

  • Define an InferenceService manifest.
  • Deploy the InferenceService to your Kubernetes cluster.
  • Make the model available via a network endpoint.

What this unlocks

Once deployed, your model can be consumed by applications (such as Gen AI Builder agents) via standard inference APIs. It will also appear in your KServe monitoring and can be scaled or updated dynamically.

Prerequisites

  • Kubernetes cluster with KServe installed and configured.
  • ClusterServingRuntime configured for your NIM container. See Configure ClusterServingRuntime.
  • GPU-enabled node pool provisioned and labeled appropriately.
  • NVIDIA NGC API key stored as Kubernetes secret, if required.
  • The desired NIM container image available in your registry.

Steps

1. Create InferenceService manifest

Create a file named InferenceService.yaml with the following content:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
prometheus.kserve.io/port: "8000"
prometheus.kserve.io/path: "/v1/metrics"
name: llama-3-1-8b-instruct-1xgpu-g5
namespace: default
spec:
predictor:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu: "true"
imagePullSecrets:
- name: edb-cred
model:
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3

Key points:

  • name → provide a unique name for your InferenceService.
  • tolerations and nodeSelector → must match your GPU node pool configuration.
  • runtime and modelFormat → must match the values used in your ClusterServingRuntime.
  • GPU resources → specify the number of GPUs to allocate.

2. Apply the InferenceService

Run:

kubectl apply -f InferenceService.yaml

3. Verify the deployment

Check the InferenceService status:

kubectl get InferenceService -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu --namespace=default

You should see:

  • The name of your InferenceService.
  • The model format being served.
  • The endpoint URL.
  • The ClusterServingRuntime used.
  • The number of GPUs allocated.

What’s next

Now that your model is deployed:

  • Your application (or Gen AI Builder agent) can send inference requests to the provided URL.
  • You can monitor model performance and resource usage.
  • You can update the GPU allocation or other runtime parameters if needed.

Next steps

Explore more in the AI Factory learning guide.


Could this page be better? Report a problem or suggest an addition!