Model Management on Hybrid Manager v1.4.0 (LTS)

Purpose and Benefits

Model management within Hybrid Manager provides centralized governance and deployment capabilities for AI models running on your Kubernetes infrastructure. This system enables organizations to maintain complete control over their AI capabilities while leveraging enterprise-grade Model Serving infrastructure.

The integration addresses critical requirements for organizations deploying AI at scale: model governance through approved registries, scalable inference serving with GPU acceleration, and unified management through Hybrid Manager's control plane. By running models within your controlled infrastructure, you maintain data sovereignty while accessing state-of-the-art AI capabilities.

Core Concepts

Model Library

The Model Library serves as your centralized governance system for AI model images. Operating within Hybrid Manager's Asset Library infrastructure, it provides a curated view of validated models ready for production deployment.

The library implements multi-stage governance:

  • Automated synchronization from trusted container registries
  • Security scanning and vulnerability assessment
  • Approval workflows based on organizational policies
  • Metadata management for versioning and documentation

Models in the library power all Agent Factory capabilities including Langflow flows, Pipeline Designer knowledge bases, and custom inference applications. Only models validated through the library's governance framework can reach production environments.

Model Serving

Model Serving transforms approved models into scalable inference endpoints using KServe within your Kubernetes clusters. This infrastructure provides production-grade model deployment with automatic scaling, health management, and resource optimization.

Key serving capabilities include:

  • InferenceService resources that define deployed model endpoints
  • ServingRuntime configurations optimized for different model frameworks
  • GPU allocation and scheduling for high-performance inference — see GPU recommendations
  • Internal and external endpoint access with authentication

Management Interface

Hybrid Manager provides unified management through its web console, abstracting Kubernetes complexity while maintaining full configurability. The interface enables:

  • Visual workflows for model deployment from library to serving
  • Resource allocation and scaling configuration
  • Monitoring dashboards for inference metrics and GPU utilization
  • Access control and endpoint management

Implementation Workflow

Model Registration

Organizations begin by configuring repository connections to trusted model sources. The Model Library synchronizes with external registries based on defined rules, automatically discovering and validating new model versions.

External Registry → Repository Rules → Security Scanning → Model Library

Repository rules determine which models enter your environment, implementing organizational policies at the point of ingestion. This automated approach reduces manual overhead while maintaining governance standards.

Model Deployment

Validated models deploy through guided workflows that configure serving infrastructure:

  1. Model Selection: Browse available models in the library with metadata including version, performance characteristics, and resource requirements
  2. Runtime Configuration: Select or create ServingRuntimes optimized for the model framework (vLLM, TensorRT-LLM, custom)
  3. Resource Allocation: Define GPU, memory, and CPU requirements based on expected workload
  4. Endpoint Configuration: Set up internal cluster access or external API endpoints with authentication (see Access KServe endpoints)

The system creates InferenceService resources that KServe manages, handling pod scheduling, health monitoring, and traffic routing automatically.

Operational Management

Deployed models operate under continuous monitoring with automatic scaling based on demand. Hybrid Manager provides visibility through:

  • Real-time inference metrics including latency and throughput
  • GPU utilization tracking for resource optimization
  • Error rates and health status for proactive maintenance
  • Cost analysis based on resource consumption

Using deployed models

Once a model cluster is running, there are three ways to consume it.

From applications

Applications call deployed model endpoints directly via KServe InferenceServices. Each endpoint exposes a standard OpenAI-compatible REST API for chat completions, embeddings, or other inference tasks.

From AIDB (SQL patterns)

AIDB lets you call models from SQL, making them available directly inside Postgres — useful for embedding pipelines or enabling in-database inference.

Hybrid Manager specifics

When models are deployed through Hybrid Manager:

  • Service URLs. Each model is exposed as an internal KServe endpoint within your HM project. The URL is visible in the Model Library or the Model Serving details page.
  • Authentication. Endpoints are protected by the platform. Applications running inside the same project can reach them directly. For external access, configure authentication using the HM ingress and project-scoped credentials.
  • Observability. Requests and logs flow into HM observability, giving you usage metrics, latency, and error tracking.