Known issues Innovation Release

These are the currently known issues and limitations identified in the Hybrid Manager (HM) Innovation Release. Workarounds can sometimes help you mitigate the impact of these issues. These issues are actively tracked and will be resolved in a future release.

Multi-DC

Multi-DC configuration loss after upgrade

Description: Configurations applied by the multi-data center (DC) setup scripts (for cross-cluster communication) don't persist after an HM platform upgrade or operator reconciliation.

Workaround: After every HM upgrade or component reversion, run the multi-DC setup scripts again to reapply the necessary configurations.

Location dropdown list is empty for multi-DC setups

Description: In multi-DC environments, the API call to retrieve available locations fails with a gRPC message size error (429/4.3MB limit exceeded). This is due to the large amount of image set information included in the API response, resulting in an empty location list in the console.

Workaround: This advanced workaround requires cluster administrator privileges to limit the amount of image set information being returned by the API. It involves modifying the image discovery tag rules in the upm-image-library and upm-beacon ConfigMaps, followed by restarting the related pods.

Workaround details

The workaround modifies the regular expressions (tag rules) used by the image library and HM agent components to temporarily limit the number of image tags being indexed. This reduces the API response size, allowing the locations to load.

  1. Find the upm-image-library ConfigMap:

    kubectl get configmaps -n upm-image-library | grep upm-image-library
    # Example Output: upm-image-library-ttkt29fmf7 1 5d3h
  2. Edit the ConfigMap and modify the tags rule under each image discovery rule (edb-postgres-advanced, edb-postgres-extended, postgresql). Replace the existing regex with the limiting regex:

    # Snippet of the YAML you will edit in the ConfigMap
    "imageDiscovery": {
      "rules": {
        "(^|.*/)edb-postgres-advanced$": {
          "readme": "EDB postgres advanced server",
          "tags": [
            "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2})    (?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$"
          ]
        },
        # ... repeat for edb-postgres-extended and postgresql ...
      }
    }
    Note

    If you're running a multi-DC setup, perform this step on the primary HM cluster.

  3. Restart the Image Library pod:

    kubectl rollout restart deployment upm-image-library -n upm-image-library
  4. Get theupm-beacon ConfigMap to modify the HM agent configuration:

    kubectl get configmaps -n upm-beacon beacon-agent-k8s-config
  5. Edit the ConfigMap (beacon-agent-k8s-config) and modify the tag_regex rule under each postgres_repositories entry (edb-postgres-advanced, edb-postgres-extended, postgresql).

    # Snippet of the YAML you will edit in the ConfigMap
    postgres_repositories:
      - name: edb-postgres-advanced
        description: EDB postgres advanced server
        tag_regex: "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(?    P<minute>\\d{2})(?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$"
      # ... repeat for edb-postgres-extended and postgresql ...
  6. Restart the HM agent pod:

    kubectl rollout restart deployment -n upm-beacon upm-beacon-agent-k8s

After completing these steps, the reduced image data size allows the location API call to succeed and the locations to appear correctly in the HM console.

Dedicated object storage for project isolation

Description: In the current iteration, project boundaries aren’t strictly applied, and authorized users on one project may have visibility of the data and databases of other projects. For this reason, granular project access is available for HM.

Workaround: Create dedicated object storage for new projects and enable project isolation.

Workaround details: If you need to isolate project resources, you can configure dedicated object storage for each project.


Core platform and resources

upm-beacon-agent memory limits are insufficient in complex environments

Description: In environments with many databases and backups, the default 1GB memory allocation for the upm-beacon-agent pod is insufficient, which can lead to frequent OOMKill or crashloop issues. This resource limit currently isn't configurable via the standard Helm values or HybridControlPlane CR.

Workaround: Manually patch the Kubernetes deployment to increase the memory resource limits for the upm-beacon-agent pod.

HM console doesn't handle custom CA issuer

Description: The HM console currently doesn't support custom cluster certificate authority (CA) issuers. If a custom CA issuer is configured, the public certificate downloaded from the console doesn't work correctly with the beacon-agent, which can lead to connectivity and reporting errors.

Workaround: An administrator can verify the status of the custom CA. If the following command returns no output, the beacon-agent deployment can proceed without manually providing a root CA certificate, avoiding the issue:

kubectl -n upm-beacon get secret beacon-gw-cert -o jsonpath='{.data.ca\.crt}' | base64 -d

Database cluster engine

Incorrect database name displayed for EDB Postgres Distributed (PGD) clusters

Description: The HM console's Connect tab and connection string incorrectly show the default database name for PGD clusters as edb_admin. PGD clusters must connect to the bdrdb database.

Workaround: For PGD cluster connection information, use one of the following reliable sources: the .PGPASS BLOB, the .PG_SERVICE.CONF file, or the full connection string from the cluster details page.

PGD-X cluster creation stuck in the "PGD - Reconcile application user" phase

Description: PGD-X cluster creation, particularly when involving a witness-only region, may stall due to either:

  • Global RAFT leadership being unexpectedly held by the witness-only node
  • A subgroup's enable_routing being disabled

Workaround for RAFT Leadership issue: Manually trigger the transfer of the global RAFT lead to a node in a data group. Connect to the PGD cluster's bdrdb and execute:

bdr.raft_leadership_transfer(node_name:='<target node>', wait_for_completion:='true', node_group_name:='world');

Workaround for enable_routing issue: Manually enable routing for the subgroup. Connect to bdrdb and execute:

SELECT bdr.alter_node_group_option('<subgroup name>','enable_routing','true');

Failure to create 3-node PGD cluster when max_connections is non-default

Description: Creating a 3-data-node PGD cluster fails if the configuration parameter max_connections is set to a non-default value during initial cluster provisioning.

Workaround: Create the PGD 3-data-node cluster using the default max_connections value. Update the value after the cluster is successfully provisioned.

PGD database settings aren't duplicated when creating or duplicating a second data group

Description: When creating or duplicating a second data group in a PGD cluster, Postgres settings (like max_connections, max_worker_processes, and so on) aren't copied from the first data group. This can lead to inconsistent settings and cluster health issues, as the replica group settings can't be lower than the primary group.

Workaround: Before provisioning, manually edit the configuration for the second PGD group to ensure the database settings are identical to the first data group.

AHA witness node resources are over-provisioned

Description: For advanced high-availability (AHA) clusters with witness nodes, the witness node incorrectly inherits the CPU, memory, and disk configuration of the larger data nodes, leading to unnecessary resource over-provisioning.

Workaround: Manually update the pgdgroup YAML configuration to specify and configure the minimal resources needed by the witness node.

HA clusters use verify-ca instead of verify-full for streaming replication certificate authentication

Description: Replica clusters use the less strict verify-ca setting for streaming replication authentication instead of the recommended, most secure verify-full. This is currently necessary because the underlying CloudNativePG (CNP) clusters don't support IP subject alternative names (IP SANs), which are required for verify-full in certain environments (like GKE load balancers).

Workaround: None. A fix depends on the underlying CNP component supporting IP SANs.

Second node is too slow to join large HA clusters

Description: For large clusters, the pg_basebackup process used by a second node (standby) to join an HA cluster is too slow. This can cause the standby node to fail to join, which prevents scaling a single node to HA. It also causes issues when restoring a cluster directly into an HA configuration.

Workaround: Avoid the best practice of loading data into a single node and then scaling to HA. Instead, load data directly into an HA cluster from the start. There's no workaround for restoring a large cluster into an HA configuration.

EDB Postgres Distributed (PGD) cluster with 2 data groups and 1 witness group not healthy

Description: PGD clusters provisioned with the topology of two data groups and one witness group may fail to reach a healthy state upon creation. This failure is caused by an underlying conflict between the bdr extension (used for replication) and the edb_wait_states extension. The combination of these two extensions in this particular topology prevents the cluster from initializing successfully.

Backup and recovery

Replica cluster creation fails when using volume snapshot recovery across regions

Description: Creating a replica cluster in a second location that's in a different region fails with an InvalidSnapshot.NotFound error because volume snapshot recovery doesn't support cross-region restoration.

Workaround: Manually trigger a Barman backup from the primary cluster first. Then use that Barman backup (instead of the volume snapshot) to provision the cross-region replica cluster.

WAL archiving is slow due to default parallel configuration

Description: The default setting for wal.maxParallel is too restrictive, which slows down WAL archiving during heavy data loads. This can cause a backlog of ready-to-archive WAL files, potentially leading to disk-full conditions. This parameter isn't yet configurable on the HM console.

Workaround: Manually edit the objectstores.barmancloud.cnpg.io Kubernetes resource for the specific backup object store and increase the wal.maxParallel value (for example, to 20) to accelerate archiving.

transporter-db disaster recovery (DR) process may fail due to WAL gaps

Description: The DR process for the internal transporter-db service may fail when restoring from the latest available backup. This occurs in low-activity scenarios where a backup was completed, but no subsequent write-ahead log (WAL) file was archived immediately following that backup. This gap prevents the restore process from successfully completing a reliable point-in-time recovery.

Workaround: To ensure a successful restore, select an older backup to restore that has at least one archived WAL file immediately following it. This makes the needed transactional logs available for the recovery process.

AI Factory and model management

Failure to deploy nim-nvidia-nvclip model with profile cache

Description: Model creation for the nim-nvidia-nvclip model fails in the AI Factory when the profile cache is used during the deployment process.

Workaround: An administrator must manually download the necessary model profile from the NVIDIA registry to a local machine. They must then upload the profile files directly to HM's object storage path. Then they deploy the model by patching the Kubernetes InferenceService YAML with a specific environment variable to force it to use the pre-cached files instead of attempting a failed network download.

Workaround details
  1. Log in to the NVIDIA Container Registry (nvcr.io) using your NGC API key:

    docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
  2. Pull the Docker image to your local machine:

    docker pull nvcr.io/nim/nvidia/nvclip:latest
  3. Prepare a local directory for the downloaded profiles:

    mkdir -p ./model-cache
    chmod -R a+w ./model-cache
  4. Select the profile for your target GPU.

    For example, A100 GPU profile: 9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6

  5. Run the container to download the profile. The container is run in CPU-only mode (NIM\_CPU\_ONLY=1) to prevent GPU-specific initialization issues on the download machine.

    export NIM_MANIFEST_PROFILE=9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6
    export NIM_CPU_ONLY=1
    
    docker run -v ./model-cache:/opt/nim/.cache -u $(id -u) -e NGC_API_KEY -e NIM_CPU_ONLY -e NIM_MANIFEST_PROFILE --rm nvcr.io/nim/nvidia/nvclip:latest

    This container doesn't exit. You must manually stop the run (Ctrl+C) after you see the line Health method called in the logs, which confirms the profile download is complete.

  6. Upload the profiles from your local machine to the object storage bucket used by your HM deployment:

    gcloud storage cp -r ./model-cache gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip
    Note

    Adjust the gs:// path to match your deployment's configured object storage location.

  7. Create the model nim-nvidia-nvclip using the HM console, specifying the Model Profiles Path field as the previous location (for example, /model-cache/nim-nvidia-nvclip). The deployment will initially fail or become stuck.

  8. Export the InferenceService YAML from the HM Kubernetes cluster.

  9. Add the necessary environment variable, NIM_IGNORE_MODEL_DOWNLOAD_FAIL, to the env section of the spec.predictor.model block in the exported YAML. This flag tells the NIM container to use the locally available cache (the files you uploaded) and ignore the network download failure.

    # --- Snippet of the modified InferenceService YAML ---
    spec:
      predictor:
        minReplicas: 1
        model:
          modelFormat:
            name: nim-nvidia-nvclip
          name: ""
          env:
          - name: NIM_IGNORE_MODEL_DOWNLOAD_FAIL  # <-- ADD THIS LINE
            value: "1"                         # <-- ADD THIS LINE
          resources:
            # ... resource requests/limits ...
          runtime: nim-nvidia-nvclip
          storageUri: gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip
    # ---------------------------------------------------
  10. Apply the modified YAML using kubectl to force the deployment to use the pre-downloaded profiles:

    kubectl apply -f <modified-inference-service-file.yaml> -n <model-cluster-namespace>

The pods now start successfully, using the model profiles you manually provided using object storage.


AI Model cluster deployment stalls if object storage path for model profiles is empty

Description: Creating an AI Model cluster and specifying an object storage path in the Model Profiles Path field causes the deployment to stall at the pending stage. This issue occurs if the specified path contains no content (that is, the model profile doesn't yet exist).

Workaround: Ensure that the object storage path specified in the Model Profiles Path field contains a correct, valid profile before initiating the model cluster deployment.

Incorrect model name causes 404 error when calling LLM remotely

Description: When calling a deployed NVIDIA NIM model via the API endpoint, the model name displayed on the model card may not be the correct name required by the API. This results in a 404 Not Found error.

Workaround: To find the exact model name required for the API call (for example, nvidia/llama-3.3-nemotron-super-49b-v1), query the /v1/models API endpoint first.

Model configuration settings reset after Innovation Release upgrade

Description: After upgrading HM from the 2025.11 to the 2025.12 Innovation Release, the existing model configuration settings are unintentionally reset to empty. This results in a model_not_found error, preventing access to AI services and causing issues like knowledge bases (KBs) failing to display in the HM console.

Workaround: In the HM console, manually reenter or set your required model configurations to restore functionality.

Error page appears when editing knowledge base credentials

Description: When editing a knowledge base (KB) that was created from a pipeline in the HM console, skipping the username field and immediately navigating to the password field triggers a client-side JavaScript error (Cannot read properties of null), which results in an Unexpected Application Error! page.

Workaround: To prevent the error page from appearing, fill in the Username field immediately after selecting Edit KB and before attempting to enter the password.

Missing aidb_users role on PGD witness node prevents AI functionality

Description: The aidb_users role (necessary for AI Factory functionality) and its related extension aren't being successfully replicated to the witness node of a PGD cluster during cluster initialization. This issue is specific to PGD clusters that use a single witness node (as opposed to a witness group) and results in the AI Factory encountering errors due to the missing required role.

Workaround: To manually install the necessary role and allow the AI Factory to function, execute the following SQL commands directly on the PGD witness node:

begin;
set local bdr.ddl_replication = off;
SET LOCAL bdr.commit_scope = 'local';
create user aidb_users;
commit;

GenAI Builder structures and tools execution failure

Description: GenAI Builder's structures and tools capabilities don't function correctly unless specific environment variables are configured during their creation. If these variables are missing or incorrect, execution fails with the following error:

[Errno -2] (Name or service not known)

Workaround: When creating a structure in GenAI Builder, set the following environment variables in the Create Structure panel:

  • GT_CLOUD_BASE_URL: Must be set to the full project path: https://<PORTAL URL>/apps/genai-builder/projects/<PROJECT ID>
  • REQUESTS_CA_BUNDLE: Must be set to the following certificate path: /etc/ssl/certs/hcp.crt
Screenshot of variables

Screenshot of environment variables to be set in the UI


Pipeline Designer-created knowledge bases not visible to GenAI Builder

Description: In HM 2025.12, knowledge bases (KBs) created with Pipeline Designer (PD) are assigned to the visual_pipeline_user role. This is done to enforce strict isolation, ensuring HM users can't access SQL-created KBs (and vice versa) by default. However, this isolation prevents GenAI Builder from querying these KBs out of the box.

Workaround: Explicitly share PD-created KBs with your specific PostgreSQL user account to make them queryable in GenAI Builder.

Workaround details

Example scenario

Alice connects to HM as alice@acme.org and to PostgreSQL as the database user alice. She must share the PD KBs with the alice database user to enable GenAI Builder agents to query them using her credentials.

  1. Prerequisite: Ensure user existence

    Ensure the target PostgreSQL user (alice in this example) exists and is assigned the aidb_users role. If the user doesn't exist, execute the following (assuming EDB documentation was followed for AIDB installation):

    CREATE USER alice WITH PASSWORD '********';
    GRANT CONNECT ON DATABASE <some_db> TO alice;
    GRANT CREATE ON SCHEMA <some_schema> TO alice;
    GRANT aidb_users TO alice;
  2. Grant role access (the workaround)

    To allow the user to view and query PD-created knowledge bases, grant them membership in the visual_pipeline_user role:

    GRANT visual_pipeline_user TO alice;
  3. Configure

    Update the GenAI Builder agent configuration to use the alice credentials (username and password) you set.


Analytics and tiered tables

Updating tiered, partitioned tables fails with PGAA error

Description: When attempting to execute an UPDATE statement on a large, tiered and partitioned table, the operation fails with the message ERROR: system columns are not supported by PGAA scan. This issue occurs even when the target partition for the update uses a standard heap access method (that is, it isn't an actively tiered Iceberg table), indicating a conflict in how the Analytics Accelerator (PGAA) processes the partitioned table structure during a modification query.

HM console and observability

Tags for active model clusters aren't displayed on the model details screen

Description: In the table showing active model clusters using a specific model on the Model Cluster Details page, the Tags field is empty.

Workaround: View the model cluster tags on the dedicated Model Cluster Details page.

Chat model cluster metrics are missing from the Grafana model overview dashboard

Description: Metrics for deployed chat model clusters aren't displayed on the Grafana model overview dashboard, impacting observability for these specific AI components.

User-created Grafana dashboards don't persist after platform redeployment/upgrade

Description: Dashboards created by users directly in the Grafana application aren't stored in persistent storage. They disappear when the Grafana pods are updated, redeployed, or restarted, for example, during an EKS auto-update or an HM upgrade.

Workaround: Back up any custom dashboards externally by exporting the dashboard as JSON. After an upgrade, manually import them back into Grafana.

Migrations

Known issues pertaining to the HM Migration Portal, data migration workflows, and schema ingestion workflows are maintained on a dedicated page in the Migrating databases documentation. See Known issues, limitations, and notes for a complete list.