Model Serving Explained
Model Serving in AI Factory allows you to deploy AI models as scalable, production-grade inference services — running on your Kubernetes infrastructure.
It provides a Kubernetes-native architecture based on KServe, giving your models the ability to serve predictions and embeddings over network-accessible APIs.
AI Factory Model Serving is optimized to support enterprise-class AI workloads with:
- GPU-accelerated infrastructure
- Flexible scaling
- Integrated observability
- Sovereign AI alignment — models run under your governance
- Seamless integration with Gen AI Builder, Knowledge Bases, and other AI Factory pipelines
Before you start
Prerequisites for understanding Model Serving:
- Familiarity with Kubernetes basics
- Understanding of KServe and InferenceService
- Awareness of Model Library → Model Serving workflow in AI Factory
- Understanding of Sovereign AI principles — models running under your governance
Suggested reading:
How Model Serving works
Core stack
Layer | Purpose |
---|---|
AI Factory | Provides infrastructure and Model Serving APIs |
Hybrid Manager Kubernetes Cluster | Hosts model-serving workloads |
KServe | Manages model serving lifecycle and APIs |
InferenceService | Deployed model resource |
Model Library | Manages model image versions |
GPU Nodes | Run high-performance model serving pods |
User Applications | Call model endpoints via REST/gRPC |
Key components
- InferenceService — Kubernetes CRD representing a deployed model.
- ServingRuntime / ClusterServingRuntime — Define reusable runtime configurations.
- Model containers — Currently focused on NVIDIA NIM containers in AI Factory 1.2.
- Observability — Integrated Prometheus-compatible metrics, Kubernetes logging.
Supported models
AI Factory Model Serving currently supports NVIDIA NIM containers for:
Model Type | Example Usage |
---|---|
Text Completion | LLM agents, Assistants |
Text Embeddings | Knowledge Bases, RAG |
Text Reranking | RAG pipelines |
Image Embeddings | Multi-modal search |
Image OCR | Document extraction |
See: Supported Models
Deployment architecture
Applications → Model Endpoints (REST/gRPC) → KServe → GPU-enabled Kubernetes → Model Containers
- Each model is isolated in its own InferenceService.
- KServe manages:
- Model lifecycle (start, stop, update)
- Scaling (including scale-to-zero)
- Endpoint routing (REST/gRPC)
- GPU resources are provisioned and scheduled via Hybrid Manager integration.
Patterns of use
Gen AI Builder
- LLM endpoints power Assistants and Agents.
- Embedding models support hybrid RAG pipelines.
Knowledge Bases
- Embedding models serve vectorization needs.
- Retrieval and reranking models power semantic search pipelines.
Custom applications
- Business applications can consume InferenceService endpoints for:
- Real-time predictions
- Image analysis
- Text processing
Best practices
- Deploy models via the Model Library → Model Serving flow to ensure governance.
- Use ClusterServingRuntime for reusable runtime configs.
- Monitor GPU utilization and model latency closely.
- Test scale-to-zero configurations for readiness in production.
- Ensure Model Library tags are versioned and documented.
- Regularly audit deployed InferenceServices as part of Sovereign AI governance.
Summary
Model Serving in AI Factory provides a robust, scalable architecture for serving production AI models:
- Kubernetes-native serving with KServe
- GPU acceleration and optimized serving runtimes
- Integrated observability and governance
- Tight integration with AI Factory components: Gen AI Builder, Knowledge Bases, custom AI pipelines
Model Serving helps you implement Sovereign AI — with your models, on your infrastructure, under your control.
Next steps
- Deploy your first InferenceService
- Verify deployed models
- Deploy NVIDIA NIM containers
- Explore Model Library
Model Serving gives you a powerful foundation for building intelligent applications and data products — securely, scalably, and under your governance — as part of EDB Postgres® AI.
Could this page be better? Report a problem or suggest an addition!