How Model Serving Deployment Works

AI Factory makes it easy to deploy your AI models as scalable, production-ready inference services. The platform uses KServe as the model serving engine, operating within the Hybrid Manager (HCP) Kubernetes infrastructure.

This page explains the general flow of model deployment and links to key how-to guides for hands-on instructions.


Deployment flow overview

  • You deploy models by creating KServe InferenceServices in your HCP project.
  • AI Factory provides GPU-enabled Kubernetes infrastructure to run these services.
  • You can deploy supported NVIDIA NIM containers or other compatible models.
  • The Model Library helps you discover and manage model images.
  • Applications access model endpoints over HTTP or gRPC APIs.

Deployment components

KServe InferenceService

Each model is deployed via a Kubernetes-native InferenceService object:

  • Manages lifecycle of the model server pods.
  • Handles scaling, health checks, and routing.
  • Exposes a network endpoint for model consumption.

ClusterServingRuntime

Advanced users can also configure ClusterServingRuntime resources to customize runtime environments for their models.


Where to start

If you're ready to deploy models, follow these guides:


Hybrid Manager integration

Model Serving runs on Hybrid Manager (HCP) Kubernetes clusters. For more on Hybrid Manager and GPU setup:


Best practices

  • Use the Model Library to select supported models.
  • Verify that your cluster has sufficient GPU resources.
  • Monitor deployed models to ensure performance and availability.
  • Use ClusterServingRuntime where advanced customization is needed.

Next steps


By following this deployment flow, you can run AI models in production with full observability and scale — directly integrated with the broader AI Factory and Hybrid Manager ecosystem.


Could this page be better? Report a problem or suggest an addition!