Model Serving Concepts
Model Serving is a core capability of EDB Postgres® AI (EDB PG AI), enabling scalable, flexible, and high-performance serving of AI/ML models on Kubernetes.
It powers:
- Gen AI applications
- Intelligent retrieval systems
- Advanced data pipelines
Model Serving is implemented using KServe, an open-source Kubernetes-native engine for standardized model inferencing. AI Factory integrates KServe with Hybrid Manager to provide enterprise-grade lifecycle management, security, and observability.
Key to Sovereign AI: Models run on your Kubernetes clusters, under your control, with full observability and governance.
Before you start
Prerequisites for understanding Model Serving:
- Familiarity with Kubernetes basics (pods, services, deployments)
- Understanding of InferenceService as a Kubernetes CRD
- Awareness of Model Library and how models are registered and deployed
- Understanding of Sovereign AI principles in EDB PG AI
Suggested path:
Why it matters
Model Serving enables your AI models and Postgres data to work together seamlessly — securely and scalably — under Sovereign AI principles:
- Deploy open-source or commercial models to your Kubernetes cluster.
- Serve Gen AI models for Assistants, Knowledge Bases, and RAG pipelines.
- Support multi-modal retrieval (text, image, hybrid search).
- Optimize performance with GPU acceleration and server-side batching.
- Maintain full observability, auditing, and governance over model usage.
See also: Hybrid Manager Model Serving integration in production environments.
Core concepts
InferenceService (via KServe)
At the core of Model Serving is the InferenceService — a Kubernetes-native resource that represents a deployed model.
It defines the end-to-end serving pipeline:
- Predictor — Runs the model server and handles inference.
- Transformer (optional) — Applies pre-processing or post-processing.
- Explainer (optional) — Provides model explainability outputs.
Predictor
The Predictor defines:
- Model format — PyTorch, TensorFlow, ONNX, Triton, etc.
- Model location — S3-compatible storage, OCI registry, PVC.
- Resources — CPU, memory, GPU.
- Autoscaling — Policies for elastic scaling, including scale-to-zero.
ServingRuntime / ClusterServingRuntime
Reusable runtime definitions for serving:
- ServingRuntime — Namespace-scoped.
- ClusterServingRuntime — Cluster-wide reusable runtimes.
Benefits:
- Tailor runtime settings to model type and hardware.
- Standardize runtime configurations across teams and projects.
How it works
Lifecycle flow
- Register model image in Model Library.
- Deploy model via Model Serving UI or CLI → creates InferenceService.
- AI Factory + KServe provision Kubernetes resources (pods, services).
- Model is loaded into runtime container.
- Kubernetes service endpoint is exposed.
- Clients send HTTP/gRPC inference requests.
- Requests may pass through Transformers and Explainers.
- Inference response is returned.
Key features
- Multi-framework support — PyTorch, TensorFlow, ONNX, XGBoost, Triton, and more.
- GPU acceleration — Native NVIDIA GPU support.
- Autoscaling — Including scale-to-zero via Knative.
- Observability — Prometheus metrics, Kubernetes logging, and Hybrid Manager dashboards.
- Batching — Server-side batching for improved throughput.
- Explainability — Support for model explainability tooling.
- Security and Sovereign AI — Models run on your Kubernetes clusters under your control.
Patterns of use
Gen AI Builder
- Serve LLMs and multi-modal models powering Assistants and Agents.
Knowledge Bases
- Serve embedding and retrieval models used in:
- Knowledge Base indexing
- RAG pipelines
Custom applications
- Expose InferenceService endpoints to:
- Business applications
- Microservices
- ETL/ELT pipelines
Hybrid + Sovereign AI alignment
- All models run on your infrastructure via Hybrid Manager KServe layer.
- You control:
- Which models are deployed
- Resource allocations
- Deployment topology
- Observability and auditing
Best practices
- Always deploy models through the Model Library → Model Serving flow to ensure governance.
- Monitor resource consumption — especially GPU utilization.
- Test scale-to-zero policies carefully before production use.
- Use ServingRuntime and ClusterServingRuntime templates for consistency.
- Tag and document production models clearly in Model Library.
- Audit InferenceService deployments regularly — critical for Sovereign AI.
In AI Factory
Model Serving powers multiple components in EDB PG AI:
Component | How it uses Model Serving |
---|---|
Gen AI Builder | Runs LLMs and specialized models for Assistants |
Knowledge Bases | Serves embedding and retrieval models for RAG |
Custom AI apps | Exposes InferenceService endpoints for business use cases |
AI Factory manages Model Serving through:
- Integrated Model Library for image management.
- GPU resource management and scheduling.
- Centralized observability and logging.
- Seamless Hybrid Manager integration for lifecycle control.
Related topics
- Model Library Explained
- Deploy AI Models
- Deploy an InferenceService
- Verify Model Deployments
- KServe Official Docs
- KServe GitHub Repository
Next steps
- Explore available models in your Model Library.
- Deploy your first model using Model Serving.
- Monitor deployed models through Hybrid Manager observability dashboards.
- Explore advanced Model Serving capabilities:
- Multi-modal pipelines
- Transformers and Explainers
- Scale-to-zero policies
Model Serving gives you a powerful foundation for building intelligent, governed AI applications — securely and scalably — as part of your Sovereign AI strategy with EDB PG AI.
Could this page be better? Report a problem or suggest an addition!