Model Serving Explained

Model Serving in AI Factory allows you to deploy AI models as scalable, production-grade inference services — running on your Kubernetes infrastructure.

It provides a Kubernetes-native architecture based on KServe, giving your models the ability to serve predictions and embeddings over network-accessible APIs.

AI Factory Model Serving is optimized to support enterprise-class AI workloads with:

  • GPU-accelerated infrastructure
  • Flexible scaling
  • Integrated observability
  • Sovereign AI alignment — models run under your governance
  • Seamless integration with Gen AI Builder, Knowledge Bases, and other AI Factory pipelines

Before you start

Prerequisites for understanding Model Serving:

  • Familiarity with Kubernetes basics
  • Understanding of KServe and InferenceService
  • Awareness of Model Library → Model Serving workflow in AI Factory
  • Understanding of Sovereign AI principles — models running under your governance

Suggested reading:


How Model Serving works

Core stack

LayerPurpose
AI FactoryProvides infrastructure and Model Serving APIs
Hybrid Manager Kubernetes ClusterHosts model-serving workloads
KServeManages model serving lifecycle and APIs
InferenceServiceDeployed model resource
Model LibraryManages model image versions
GPU NodesRun high-performance model serving pods
User ApplicationsCall model endpoints via REST/gRPC

Key components

  • InferenceService — Kubernetes CRD representing a deployed model.
  • ServingRuntime / ClusterServingRuntime — Define reusable runtime configurations.
  • Model containers — Currently focused on NVIDIA NIM containers in AI Factory 1.2.
  • Observability — Integrated Prometheus-compatible metrics, Kubernetes logging.

Supported models

AI Factory Model Serving currently supports NVIDIA NIM containers for:

Model TypeExample Usage
Text CompletionLLM agents, Assistants
Text EmbeddingsKnowledge Bases, RAG
Text RerankingRAG pipelines
Image EmbeddingsMulti-modal search
Image OCRDocument extraction

See: Supported Models


Deployment architecture

Applications → Model Endpoints (REST/gRPC) → KServe → GPU-enabled Kubernetes → Model Containers

  • Each model is isolated in its own InferenceService.
  • KServe manages:
  • Model lifecycle (start, stop, update)
  • Scaling (including scale-to-zero)
  • Endpoint routing (REST/gRPC)
  • GPU resources are provisioned and scheduled via Hybrid Manager integration.

Patterns of use

Gen AI Builder

  • LLM endpoints power Assistants and Agents.
  • Embedding models support hybrid RAG pipelines.

Knowledge Bases

  • Embedding models serve vectorization needs.
  • Retrieval and reranking models power semantic search pipelines.

Custom applications

  • Business applications can consume InferenceService endpoints for:
  • Real-time predictions
  • Image analysis
  • Text processing

Best practices

  • Deploy models via the Model Library → Model Serving flow to ensure governance.
  • Use ClusterServingRuntime for reusable runtime configs.
  • Monitor GPU utilization and model latency closely.
  • Test scale-to-zero configurations for readiness in production.
  • Ensure Model Library tags are versioned and documented.
  • Regularly audit deployed InferenceServices as part of Sovereign AI governance.

Summary

Model Serving in AI Factory provides a robust, scalable architecture for serving production AI models:

  • Kubernetes-native serving with KServe
  • GPU acceleration and optimized serving runtimes
  • Integrated observability and governance
  • Tight integration with AI Factory components: Gen AI Builder, Knowledge Bases, custom AI pipelines

Model Serving helps you implement Sovereign AI — with your models, on your infrastructure, under your control.


Next steps


Model Serving gives you a powerful foundation for building intelligent applications and data products — securely, scalably, and under your governance — as part of EDB Postgres® AI.



Could this page be better? Report a problem or suggest an addition!