GPUs in Model Serving

GPU acceleration is essential for running modern deep learning models in production. Many Large Language Models (LLMs), embedding models, and vision models require GPUs to deliver acceptable inference performance.

Model Serving in AI Factory relies on GPU-enabled Kubernetes nodes for hosting KServe InferenceServices that run these models.

Why GPUs matter for Model Serving

  • Many NVIDIA NIM containers are designed to run on GPUs, with optimized inference serving.
  • Model Serving in AI Factory supports GPU scheduling and resource control through Kubernetes.
  • GPU nodes enable serving models that would otherwise be too slow or expensive to run on CPUs.
  • GPU-based serving supports AIDB Knowledge Bases and GenAI Builder assistants at scale.

GPU usage in Hybrid Manager

Hybrid Manager (HCP) manages the Kubernetes infrastructure where Model Serving runs. In this context:

  • GPU-enabled node groups (AWS EKS) or node pools (GCP GKE, RHOS) must be provisioned.
  • These nodes must be labeled and tainted to allow KServe model pods to schedule properly.
  • The NVIDIA Kubernetes device plugin must be installed to expose GPU resources to Kubernetes.
  • Kubernetes secrets must be created to store NVIDIA API keys required by NIM models.

Actions you can take

To enable GPU-based Model Serving:

  1. Provision GPU node groups in your HCP Kubernetes cluster.
  2. Label and taint GPU nodes correctly.
  3. Deploy the NVIDIA device plugin DaemonSet.
  4. Create a Kubernetes secret with your NVIDIA API key.
  5. Deploy ClusterServingRuntime and InferenceService manifests targeting GPU nodes.

Next steps


Could this page be better? Report a problem or suggest an addition!