Update GPU Resources for an InferenceService

Update GPU Resources for an InferenceService

This How-To explains how to adjust the number of GPUs allocated to an existing InferenceService deployed with KServe. This allows you to scale your model deployment dynamically without redeploying from scratch.

Goal

Change the GPU resource allocation (number of GPUs) for a deployed InferenceService.

Estimated time

5 minutes.

What you will accomplish

  • Edit an existing InferenceService resource.
  • Apply updated GPU resource limits and requests.
  • Trigger KServe to redeploy the model container with new GPU settings.

What this unlocks

Enables dynamic scaling of GPU resources to optimize performance and cost:

  • Increase GPUs for faster response time and higher throughput.
  • Reduce GPUs when demand is lower to save resources.

Prerequisites

Steps

1. Edit the InferenceService

Run:

kubectl edit InferenceService <your-inferenceservice-name>

Example:

kubectl edit InferenceService llama-3-1-8b-instruct-1xgpu-g5

2. Update GPU resource settings

Locate this section:

spec:
predictor:
model:
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"

Change both limits and requests to the desired number of GPUs. Example to scale to 4 GPUs:

spec:
predictor:
model:
resources:
limits:
nvidia.com/gpu: "4"
requests:
nvidia.com/gpu: "4"

Save and close the editor.

3. Verify the updated GPU allocation

Run:

kubectl get InferenceService -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu --namespace=default

Confirm that the GPUs column reflects your updated setting.

KServe will automatically redeploy the InferenceService with the new configuration.

Next steps

Explore more in the Analytics & AI Factory learning guide.


Could this page be better? Report a problem or suggest an addition!