How Model Serving Deployment Works

Suggest edits

AI Factory makes it easy to deploy your AI models as scalable, production-ready inference services. The platform uses KServe as the model serving engine, operating within the Hybrid Manager (HCP) Kubernetes infrastructure.

This page explains the general flow of model deployment and links to key how-to guides for hands-on instructions.

Deployment flow overview

You deploy models by creating KServe InferenceServices in your HCP project.
AI Factory provides GPU-enabled Kubernetes infrastructure to run these services.
You can deploy supported NVIDIA NIM Microservices or other compatible models.
The Model Library helps you discover and manage model images.
Applications access model endpoints over HTTP or gRPC APIs.

Deployment components

KServe InferenceService

Each model is deployed via a Kubernetes-native InferenceService object:

Manages lifecycle of the model server pods.
Handles scaling, health checks, and routing.
Exposes a network endpoint for model consumption.

ClusterServingRuntime

Advanced users can also configure ClusterServingRuntime resources to customize runtime environments for their models.

Where to start

If you're ready to deploy models, follow these guides:

Hybrid Manager integration

Model Serving runs on Hybrid Manager (HCP) Kubernetes clusters. For more on Hybrid Manager and GPU setup:

Best practices

Use the Model Library to select supported models.
Verify that your cluster has sufficient GPU resources.
Monitor deployed models to ensure performance and availability.
Use ClusterServingRuntime where advanced customization is needed.

Next steps

Explore our Model Serving How-To Guides
Review supported models in the Supported Models Index
Learn about Observability for Model Serving

By following this deployment flow, you can run AI models in production with full observability and scale — directly integrated with the broader AI Factory and Hybrid Manager ecosystem.

← Prev

AI Factory Models

↑ Up