Update GPU Resources for an InferenceService
Update GPU Resources for an InferenceService
This How-To explains how to adjust the number of GPUs allocated to an existing InferenceService deployed with KServe. This allows you to scale your model deployment dynamically without redeploying from scratch.
Goal
Change the GPU resource allocation (number of GPUs) for a deployed InferenceService.
Estimated time
5 minutes.
What you will accomplish
- Edit an existing InferenceService resource.
- Apply updated GPU resource limits and requests.
- Trigger KServe to redeploy the model container with new GPU settings.
What this unlocks
Enables dynamic scaling of GPU resources to optimize performance and cost:
- Increase GPUs for faster response time and higher throughput.
- Reduce GPUs when demand is lower to save resources.
Prerequisites
- An InferenceService already deployed using KServe. See Create InferenceService for NVIDIA NIM Container.
- GPU-enabled node pool available in your Kubernetes cluster.
Steps
1. Edit the InferenceService
Run:
kubectl edit InferenceService <your-inferenceservice-name>
Example:
kubectl edit InferenceService llama-3-1-8b-instruct-1xgpu-g5
2. Update GPU resource settings
Locate this section:
spec: predictor: model: resources: limits: nvidia.com/gpu: "1" requests: nvidia.com/gpu: "1"
Change both limits and requests to the desired number of GPUs. Example to scale to 4 GPUs:
spec: predictor: model: resources: limits: nvidia.com/gpu: "4" requests: nvidia.com/gpu: "4"
Save and close the editor.
3. Verify the updated GPU allocation
Run:
kubectl get InferenceService -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu --namespace=default
Confirm that the GPUs column reflects your updated setting.
KServe will automatically redeploy the InferenceService with the new configuration.
Related topics
- Create InferenceService for NVIDIA NIM Container
- Configure ClusterServingRuntime
- Deploy NVIDIA NIM container with KServe (placeholder)
- AI Factory Concepts
- AI Factory Terminology
Next steps
Explore more in the Analytics & AI Factory learning guide.
- On this page
- Update GPU Resources for an InferenceService
Could this page be better? Report a problem or suggest an addition!