GPUs in Model Serving
GPU acceleration is essential for running modern deep learning models in production. Many Large Language Models (LLMs), embedding models, and vision models require GPUs to deliver acceptable inference performance.
Model Serving in AI Factory relies on GPU-enabled Kubernetes nodes for hosting KServe InferenceServices that run these models.
Why GPUs matter for Model Serving
- Many NVIDIA NIM containers are designed to run on GPUs, with optimized inference serving.
- Model Serving in AI Factory supports GPU scheduling and resource control through Kubernetes.
- GPU nodes enable serving models that would otherwise be too slow or expensive to run on CPUs.
- GPU-based serving supports AIDB Knowledge Bases and GenAI Builder assistants at scale.
GPU usage in Hybrid Manager
Hybrid Manager (HCP) manages the Kubernetes infrastructure where Model Serving runs. In this context:
- GPU-enabled node groups (AWS EKS) or node pools (GCP GKE, RHOS) must be provisioned.
- These nodes must be labeled and tainted to allow KServe model pods to schedule properly.
- The NVIDIA Kubernetes device plugin must be installed to expose GPU resources to Kubernetes.
- Kubernetes secrets must be created to store NVIDIA API keys required by NIM models.
Actions you can take
To enable GPU-based Model Serving:
- Provision GPU node groups in your HCP Kubernetes cluster.
- Label and taint GPU nodes correctly.
- Deploy the NVIDIA device plugin DaemonSet.
- Create a Kubernetes secret with your NVIDIA API key.
- Deploy ClusterServingRuntime and InferenceService manifests targeting GPU nodes.
Related concepts
- Model Serving overview
- KServe in AI Factory concepts
- AI Factory Learning Paths
- Model Serving How-To Guides
Next steps
- Follow the How-To Guide: Setup GPU resources in HCP.
← Prev
Model capabilities in Hybrid Manager
↑ Up
Model capabilities in Hybrid Manager
Next →
Asset Library in Hybrid Manager
Could this page be better? Report a problem or suggest an addition!