Overview
This guide provides deployment procedures for Agent Factory components through the Hybrid Manager (HM) web interface. The deployment process covers GPU infrastructure setup, NVIDIA NIM model deployment, and Gen AI application configuration.
Prerequisites
Infrastructure Requirements
Before deploying Agent Factory components:
- Kubernetes cluster with GPU nodes configured
- NVIDIA GPU operator installed
- Access to NVIDIA NGC registry or private registry
- Object storage for model profiles (air-gapped deployments)
GPU Node Configuration
Verify GPU nodes meet NIM requirements — see GPU recommendations.
Step 1: Configure Registry Authentication
Internet-Connected Deployments
For clusters with internet access, configure NVIDIA NGC authentication:
Obtain NGC API key from NVIDIA NGC Portal
Create authentication secrets via HM UI:
Navigate to Project Settings > Secrets.
Create
nvidia-nim-secretswith NGC_API_KEY.Create
ngc-credDocker registry secret.
Alternatively, use kubectl:
NGC_API_KEY=<your-ngc-api-key> # Create runtime secret kubectl -n default create secret generic nvidia-nim-secrets \ --from-literal=NGC_API_KEY=${NGC_API_KEY} kubectl -n default annotate secret nvidia-nim-secrets \ replicator.v1.mittwald.de/replicate-to='m-.*' # Create image pull secret kubectl -n default create secret docker-registry ngc-cred \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=${NGC_API_KEY} kubectl -n default annotate secret ngc-cred \ replicator.v1.mittwald.de/replicate-to='m-.*'
Air-Gapped Deployments
For environments without internet access:
- Mirror NIM Images: Copy required images to private registry using skopeo.
- Update Model URLs: Configure HM to reference private registry locations.
- Cache Profiles: Download and store model profiles in object storage.
- Configure Storage Path: Reference cached profiles during model deployment.
See Air-Gapped Configuration for detailed procedures.
Step 2: Access Model Library
Navigate to the Model Library interface:
- Log into Hybrid Manager console.
- Select your project from the project selector.
- Navigate to Agent Factory → Model Library.
The Model Library displays available NVIDIA NIM models:
- llama-3.3-nemotron-super-49b-v1: Advanced reasoning and chat capabilities (128K context)
- llama-3.2-nemoretriever-300m-embed-v1: High-quality text embeddings
- llama-3.2-nv-rerankqa-1b-v2: Multilingual query-document reranking
- paddleocr: Ultra-lightweight OCR system
- nvclip: Multimodal embeddings for image and text
Step 3: Deploy Model Server Cluster
Select Model for Deployment
- Click Deploy Model in Model Library.
- Select target model from available options.
- Review model requirements and documentation links.
Configuring Deployment Parameters
Configure the model server cluster settings:
Instance Configuration
- Server Instances: Default 1 (increase for high availability)
- Minimum Instances: 1 (0 for scale-to-zero when supported)
- Maximum Instances: Based on load requirements
Resource Allocation
- Memory: Configure based on model requirements
- Text completion models: 64-128 GB
- Embedding models: 32-64 GB
- CPU: Default values typically sufficient
- GPU: Match documented requirements
- llama-3.3-nemotron: 4 GPUs
- Other models: 1 GPU typical
Scaling Configuration
- Concurrent Request Threshold: Configure autoscaling trigger
- Scale to Zero: Enable when supported (may be unavailable in initial release)
Deploy Model
- Review configuration summary.
- Click Deploy to initiate deployment.
- Monitor deployment status in Model Server Clusters view.
Deployment creates:
- KServe InferenceService resource.
- Predictor pods with configured resources.
- Service endpoints for model access.
Step 4: Monitor Deployment Status
View Model Server Clusters
Navigate to Agent Factory → Model Server Clusters to view:
- Cluster display name
- Load balancer ingress URI
- Deployment status (Active/Healthy, Pending, Failed)
- Model details and tags
- Resource utilization metrics
Access Cluster Details
Click on a cluster name to view:
- Detailed configuration parameters
- Real-time health metrics
- Grafana dashboard integration
- Inference latency charts
- System health indicators
Step 5: Configure API Access
Generate API Tokens
For external access to deployed models:
- Navigate to Agent Factory → API Token Management
- Click Create Token
- Provide token reference name
- Store generated token securely
API tokens enable:
- RAG application authentication
- External service integration
- Programmatic model access
Access Endpoints
Models expose OpenAI-compatible endpoints:
Internal Access (within cluster):
http://<service-name>.<namespace>.svc.cluster.local/v1/chat/completions
External Access (with ingress):
https://<ingress-url>/v1/chat/completions
Step 6: Build Gen AI Applications
Gen AI application building in Hybrid Manager is done through Langflow, accessible via Launchpad in the HM console. Langflow provides a visual flow editor with EDB components that connect directly to your HM-managed models, knowledge bases, and Postgres clusters.
For a step-by-step walkthrough of building your first Gen AI application, see Langflow quickstart.
For a full reference of EDB components, flow deployment, and sharing flows between instances, see Langflow.
Step 7: Update and Manage Models
Edit Model Server Cluster
To modify running clusters:
- Navigate to Model Server Clusters
- Select cluster to edit
- Adjust parameters:
- Instance counts
- Resource allocations
- Scaling thresholds
- Apply changes (triggers rolling update)
Rolling Updates
Updates maintain availability through:
- Connection draining before restart
- Health verification between instances
- Automatic rollback on failure
High availability and isolation for Langflow flows
High availability, project-level isolation, stable routing, and access control for a Langflow flow are properties of a deployed flow, not of the Langflow flow builder. The flow builder is a single shared instance: flows you run there are not isolated per project and you can't scale them independently. To run a flow as an isolated, highly available, independently scalable service with its own endpoints, publish and deploy it. See Flow deployment, which covers replicas, endpoints, access levels, and credential handling for deployed flows.
For the model server clusters that Langflow flows depend on, adjust the number of running instances in Estate → InferenceServices → select the inference service → Edit → Inference Service Instances.
Troubleshooting
Common Deployment Issues
Model Fails to Start
- Verify GPU availability matches requirements
- Check registry authentication secrets
- Review pod events and logs
High Inference Latency
- Adjust batch size parameters
- Increase GPU allocation
- Scale replicas for load distribution
API Token Issues
- Verify token hasn't expired
- Check network policies
- Confirm ingress configuration
For detailed diagnostics, see Troubleshooting Guide.
Additional Resources
Reference Documentation
NVIDIA Model Documentation