Model Serving Concepts

Model Serving is a core capability of EDB Postgres® AI (EDB PG AI), enabling scalable, flexible, and high-performance serving of AI/ML models on Kubernetes.

It powers:

  • Gen AI applications
  • Intelligent retrieval systems
  • Advanced data pipelines

Model Serving is implemented using KServe, an open-source Kubernetes-native engine for standardized model inferencing. AI Factory integrates KServe with Hybrid Manager to provide enterprise-grade lifecycle management, security, and observability.

Key to Sovereign AI: Models run on your Kubernetes clusters, under your control, with full observability and governance.


Before you start

Prerequisites for understanding Model Serving:

  • Familiarity with Kubernetes basics (pods, services, deployments)
  • Understanding of InferenceService as a Kubernetes CRD
  • Awareness of Model Library and how models are registered and deployed
  • Understanding of Sovereign AI principles in EDB PG AI

Suggested path:


Why it matters

Model Serving enables your AI models and Postgres data to work together seamlessly — securely and scalably — under Sovereign AI principles:

  • Deploy open-source or commercial models to your Kubernetes cluster.
  • Serve Gen AI models for Assistants, Knowledge Bases, and RAG pipelines.
  • Support multi-modal retrieval (text, image, hybrid search).
  • Optimize performance with GPU acceleration and server-side batching.
  • Maintain full observability, auditing, and governance over model usage.

See also: Hybrid Manager Model Serving integration in production environments.


Core concepts

InferenceService (via KServe)

At the core of Model Serving is the InferenceService — a Kubernetes-native resource that represents a deployed model.

It defines the end-to-end serving pipeline:

  • Predictor — Runs the model server and handles inference.
  • Transformer (optional) — Applies pre-processing or post-processing.
  • Explainer (optional) — Provides model explainability outputs.

Predictor

The Predictor defines:

  • Model format — PyTorch, TensorFlow, ONNX, Triton, etc.
  • Model location — S3-compatible storage, OCI registry, PVC.
  • Resources — CPU, memory, GPU.
  • Autoscaling — Policies for elastic scaling, including scale-to-zero.

ServingRuntime / ClusterServingRuntime

Reusable runtime definitions for serving:

  • ServingRuntime — Namespace-scoped.
  • ClusterServingRuntime — Cluster-wide reusable runtimes.

Benefits:

  • Tailor runtime settings to model type and hardware.
  • Standardize runtime configurations across teams and projects.

How it works

Lifecycle flow

  1. Register model image in Model Library.
  2. Deploy model via Model Serving UI or CLI → creates InferenceService.
  3. AI Factory + KServe provision Kubernetes resources (pods, services).
  4. Model is loaded into runtime container.
  5. Kubernetes service endpoint is exposed.
  6. Clients send HTTP/gRPC inference requests.
  7. Requests may pass through Transformers and Explainers.
  8. Inference response is returned.

Key features

  • Multi-framework support — PyTorch, TensorFlow, ONNX, XGBoost, Triton, and more.
  • GPU acceleration — Native NVIDIA GPU support.
  • Autoscaling — Including scale-to-zero via Knative.
  • Observability — Prometheus metrics, Kubernetes logging, and Hybrid Manager dashboards.
  • Batching — Server-side batching for improved throughput.
  • Explainability — Support for model explainability tooling.
  • Security and Sovereign AI — Models run on your Kubernetes clusters under your control.

Patterns of use

Gen AI Builder

  • Serve LLMs and multi-modal models powering Assistants and Agents.

Knowledge Bases

  • Serve embedding and retrieval models used in:
  • Knowledge Base indexing
  • RAG pipelines

Custom applications

  • Expose InferenceService endpoints to:
  • Business applications
  • Microservices
  • ETL/ELT pipelines

Hybrid + Sovereign AI alignment

  • All models run on your infrastructure via Hybrid Manager KServe layer.
  • You control:
  • Which models are deployed
  • Resource allocations
  • Deployment topology
  • Observability and auditing

Best practices

  • Always deploy models through the Model Library → Model Serving flow to ensure governance.
  • Monitor resource consumption — especially GPU utilization.
  • Test scale-to-zero policies carefully before production use.
  • Use ServingRuntime and ClusterServingRuntime templates for consistency.
  • Tag and document production models clearly in Model Library.
  • Audit InferenceService deployments regularly — critical for Sovereign AI.

In AI Factory

Model Serving powers multiple components in EDB PG AI:

ComponentHow it uses Model Serving
Gen AI BuilderRuns LLMs and specialized models for Assistants
Knowledge BasesServes embedding and retrieval models for RAG
Custom AI appsExposes InferenceService endpoints for business use cases

AI Factory manages Model Serving through:

  • Integrated Model Library for image management.
  • GPU resource management and scheduling.
  • Centralized observability and logging.
  • Seamless Hybrid Manager integration for lifecycle control.


Next steps

  • Explore available models in your Model Library.
  • Deploy your first model using Model Serving.
  • Monitor deployed models through Hybrid Manager observability dashboards.
  • Explore advanced Model Serving capabilities:
  • Multi-modal pipelines
  • Transformers and Explainers
  • Scale-to-zero policies

Model Serving gives you a powerful foundation for building intelligent, governed AI applications — securely and scalably — as part of your Sovereign AI strategy with EDB PG AI.



Could this page be better? Report a problem or suggest an addition!