Troubleshooting Hybrid Manager

You can troubleshoot common problems in Hybrid Manager (HM).

Error codes

Error Code	Category	Error Message	Explanation	Potential solution
na	Installation	[upm-istio] - failed to install components: [upm-istio]	An error from bootstrap: the LoadBalancer service is pending.	Fix the loadbalancer IAM role first, then delete the LB pod to trigger re-creation.
na	Installation	upm-beaco-ff-base install failed		kubectl edit sc gp2
na	Installation	upm-thanos installation failed
na	Installation	Failed to pull image
na	Installation	Loki pods crash	The loki pods crashed and the installation failed.
na	Portal Login	No healthy upstream	“No healthy upstream” is returned after the login.
401	Portal Login	401 Access Denied	401 error returned when logging into the portal after the installation.
500	Portal Login	HTTP Error 500	The back ingress gateway can't access incoming connections.
na	Database Provisioning	Postgres Cluster Provisioning Stuck at X% Complete
na	Database Provisioning	CPU and Memory range is much smaller than the nodegroup
na	Database Update	Disk scale up	The new value is shown on cluster yaml but not applied on pvc/pv.	Edit the storageclass.

Errors in detail

[upm-istio] - failed to install components: [upm-istio]

Detailed error

5:03:42AM: ---- waiting on 1 changes [4/5 done] ----
5:03:42AM: ongoing: reconcile service/istio-ingressgateway (v1) namespace: istio-system
5:03:42AM:  ^ Load balancer ingress is empty
{"level":"error","msg":"failed to install component: failed to execute kapp command, error: Timed out waiting after 15m0s for resources: [service/istio-ingressgateway (v1) namespace: istio-system], message: "}
{"level":"error","msg":"Failed to install components: [upm-istio]"}
{"level":"error","msg":"Installation failed, error: failed to install components: [upm-istio]"}

```bash
k get svc -n istio-system

NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                     AGE
istio-ingressgateway   LoadBalancer   172.20.184.80    <pending>     80:30270/TCP,443:31858/TCP,9443:32692/TCP   3h3m
istiod                 ClusterIP      172.20.146.223   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP       3h3m

k describe svc istio-ingressgateway -n istio-system

Events:
  Type     Reason            Age    From     Message
  ----     ------            ----   ----     -------
  Warning  FailedBuildModel  53m    service  Failed build model due to operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: 07cc71c0-3113-4b4e-be7f-3de3b475c49b, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

Resolution

Fix the loadbalancer IAM role, then delete the LB pod.

k delete pod aws-load-balancer-controller-57ccd8bc77-f8lrk -n kube-system
k delete pod aws-load-balancer-controller-57ccd8bc77-j58c2 -n kube-system
k get svc -n istio-system

NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                     AGE
istio-ingressgateway   LoadBalancer   172.20.184.80    k8s-istiosys-istioing-38d88046aa-37454e78c9648556.elb.us-east-1.amazonaws.com   80:30270/TCP,443:31858/TCP,9443:32692/TCP   3h8m
istiod                 ClusterIP      172.20.146.223   <none>                                                                          15010/TCP,15012/TCP,443/TCP,15014/TCP       3h8m

upm-beaco-ff-base install failed

Detailed error

❯ 
NAME                    READY   STATUS    RESTARTS   AGE
app-db-1-initdb-k4nbl   0/1     Pending   0          35m

kc get pvc -n upm-beaco-ff-base

NAME       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
app-db-1   Pending                                                     <unset>                 35m

kc describe pvc app-db-1 -n upm-beaco-ff-base

Name:          app-db-1
Namespace:     upm-beaco-ff-base
StorageClass:
Status:        Pending
Volume:
…….
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  64s (x142 over 36m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is se

Resolution

Run kubectl edit sc gp2 and add the following annotation: storageclass.kubernetes.io/is-default-class: "true"

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"

Upm-thanos installation failed

Detailed error

Resolution

Check for any pods in crash loop.
Check the logs to look for any permission-related errors.

kubectl get pod -n monitoring

Check if the secret is correct:

kubectl get secret -n monitoring

The secret must match what you defined in edb-object-storage:

kubectl get secret -n monitoring  thanos-objstore-secret -o yaml

 apiVersion: v1
data:
  objstore.yml: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg==

~ ❯  echo dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg== | base64 -d

To correct the secret:

kubectl delete secret thanos-objstore-secret -n monitoring

kubectl delete pod storage-location-operator-controller-manager-xxxxxx -n storage-location-operator

Rerun helm upgrade and ensure the secret thanos-objstore-secret is correct:

kubectl delete pod <pod in crashloop> -n monitoring

Failed to pull image

Detailed error

NAME                                             READY   STATUS                  RESTARTS   AGE
edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg   0/1     Init:AssetPullBackOff   0          2m49s

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  53s               default-scheduler  Successfully assigned edbpgai-bootstrap/edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg to i-001f23a316e13e905
  Warning  Failed     22s               kubelet            Failed to pull image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to resolve reference "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to do request: Head "https://docker.enterprisedb.com/v2/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks/manifests/v1.0.6-appl": dial tcp 18.67.76.32:443: i/o timeout
  Warning  Failed     22s               kubelet            Error: ErrAssetPull
  Normal   BackOff    22s               kubelet            Back-off pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl"
  Warning  Failed     22s               kubelet            Error: AssetPullBackOff
  Normal   Pulling    7s (x2 over 52s)  kubelet            Pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl"

Resolution

In your AWS console, edit your related subnet and enable Auto-assign public IPv4 address.
Delete nodes to let EKS pick up the change:

k get node

NAME                  STATUS   ROLES    AGE   VERSION
i-001f23a316e13e905   Ready    <none>   19h   v1.31.4-eks-0f56d01
i-012fd44b2f575a514   Ready    <none>   20h   v1.31.4-eks-0f56d01
i-076d6a447c6551a21   Ready    <none>   20h   v1.31.4-eks-0f56d01

k delete node i-001f23a316e13e905 i-012fd44b2f575a514 i-076d6a447c6551a21

node "i-001f23a316e13e905" deleted
node "i-012fd44b2f575a514" deleted
node "i-076d6a447c6551a21" deleted

kgp -n edbpgai-bootstrap

NAME                                             READY   STATUS    RESTARTS   AGE
edbpgai-bootstrap-job-v1.0.6-appl-5ee7l7-xzfqt   1/1     Running   0          118s

Loki pods crash

Detailed error

k get pods -n logging

NAME                         READY   STATUS             RESTARTS         AGE
loki-backend-0               1/2     CrashLoopBackOff   16 (2m1s ago)    59m
loki-read-5b8555fb64-9srmq   0/1     CrashLoopBackOff   16 (2m40s ago)   59m
loki-read-5b8555fb64-fw6ls   0/1     CrashLoopBackOff   16 (2m7s ago)    59m
loki-read-5b8555fb64-nw2vq   0/1     CrashLoopBackOff   16 (2m17s ago)   59m
loki-write-0                 0/1     CrashLoopBackOff   16 (2m37s ago)   59m
loki-write-1                 0/1     CrashLoopBackOff   16 (2m5s ago)    59m
loki-write-2                 0/1     CrashLoopBackOff   16 (2m16s ago)   59m

k logs loki-write-0 -n logging

level=info ts=2025-02-26T09:32:35.173785968Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)"
level=info ts=2025-02-26T09:32:35.173821129Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2025-02-26T09:32:35.174421237Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2025-02-26T09:32:35.175391511Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=loki-write-0-a0cfbc5d
level=info ts=2025-02-26T09:32:35.17599149Z caller=shipper.go:160 index-store=tsdb-2020-05-15 msg="starting index shipper in WO mode"
level=info ts=2025-02-26T09:32:35.176072171Z caller=table_manager.go:136 index-store=tsdb-2020-05-15 msg="uploading tables"
level=info ts=2025-02-26T09:32:35.176269663Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=0
level=info ts=2025-02-26T09:32:35.176305024Z caller=manager.go:86 index-store=tsdb-2020-05-15 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=0 indices=0 failures=0
level=info ts=2025-02-26T09:32:35.176325994Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=1
level=info ts=2025-02-26T09:32:35.180393332Z caller=module_service.go:82 msg=starting module=server
level=info ts=2025-02-26T09:32:35.180477643Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=error ts=2025-02-26T09:32:35.180526964Z caller=loki.go:524 msg="module failed" module=memberlist-kv error="starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180549544Z caller=loki.go:524 msg="module failed" module=ring error="failed to start ring, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180582375Z caller=loki.go:524 msg="module failed" module=store error="failed to start store, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180588655Z caller=loki.go:524 msg="module failed" module=ingester error="failed to start ingester, because it depends on module analytics, which has failed: context canceled"
level=error ts=2025-02-26T09:32:35.180593435Z caller=loki.go:524 msg="module failed" module=distributor error="failed to start distributor, because it depends on module analytics, which has failed: context canceled"
level=error ts=2025-02-26T09:32:35.180597415Z caller=loki.go:524 msg="module failed" module=analytics error="failed to start analytics, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=info ts=2025-02-26T09:32:35.180687877Z caller=modules.go:1832 msg="server stopped"
level=info ts=2025-02-26T09:32:35.180702097Z caller=module_service.go:120 msg="module stopped" module=server
level=info ts=2025-02-26T09:32:35.180708417Z caller=loki.go:508 msg="Loki stopped" running_time=85.855531ms
failed services
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:566
main.main
	/src/loki/cmd/loki/main.go:129
runtime.main
	/usr/local/go/src/runtime/proc.go:271
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1695
level=error ts=2025-02-26T09:32:35.180772078Z caller=log.go:216 msg="error running loki" err="failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:566\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"

Resolution

There is a known issue from Loki due to the CIDR not 10.*

Check the IPv4 CIDR in your VPC. For example, if it's set to 11.1.0.0/20, set it to 10.1.0.0/20.

No healthy upstream

Detailed error

The pod at upm-dex is in a crash loop. Check the pod logs to view the error details. In this case, pgai.portal.authentication.staticPasswords isn't configured correctly, so update the correct configuration and rerun the deployment:

upm-dex                      upm-dex-66b96b5857-pdcg2                                        1/2     CrashLoopBackOff   206 (3m34s ago)   17h

kubectl logs upm-dex-66b96b5857-pdcg2 -n upm-dex
error parse config file /tmp/dex.config.yaml-565347154: error unmarshaling JSON: malformed bcrypt hash: crypto/bcrypt: hashedSecret too short to be a bcrypted password

Resolution

Added the 4 parameter on the upgrade command:

helm upgrade \
    -n edbpgai-bootstrap \
    --install \
    --set parameters.global.portal_domain_name="${PORTAL_DOMAIN_NAME}" \
    --set parameters.transporter-rw-service.domain_name="${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}" \
    --set parameters.transporter-dp-agent.rw_service_url="https://${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}/transporter" \
    --set parameters.upm-beacon.server_host="${BEACON_SERVICE_DOMAIN_NAME}" \
    --set parameters.upm-beaco-ff-base.cookie_aeskey="${AES_256_KEY}" \
    --set system="eks" \
    --set remoteContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \
    --set internalContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \
    --set bootstrapAsset="${REGISTRY_PACKAGE_URL}/edbpgai-bootstrap/bootstrap-eks:${EDBPGAI_BOOTSTRAP_IMAGE_VERSION}" \
    --set pgai.portal.authentication.staticPasswords[0].email="owner@mycompany.com" \
    --set pgai.portal.authentication.staticPasswords[0].hash='$2y$10$STTzuGJa3PydqHvlzi2z6OgDU1G/JLTqiuGblH.RemIutWxkztN5m' \
    --set pgai.portal.authentication.staticPasswords[0].username="owner@mycompany.com" \
    --set pgai.portal.authentication.staticPasswords[0].userID="c5998173-a605-449a-a9a5-4a9c33e26df7" \
    --version "${EDBPGAI_BOOTSTRAP_HELM_CHART_VERSION}" \
    edbpgai-bootstrap enterprisedb-edbpgai/edbpgai-bootstrap

401 Access Denied

Detailed error

Resolution

Check the config.yaml file. The staticPasswords section isn't set properly in this case. Rerun helm upgrade with the correct values.

k get secret upm-dex -n upm-dex -o yaml

apiVersion: v1
data:
  config.yaml: aXNzdWVyOiBodHRwczovL3BvcnRhbC5iYXN1cHBvcnQub3JnL2F1dGgKc3RvcmFnZToKICB0eXBlOiBwb3N0Z3JlcwogIGNvbmZpZzoKICAgIGhvc3Q6IGFwcC1kYi1ydy51cG0tYmVhY28tZmYtYmFzZS5zdmMuY2x1c3Rlci5sb2NhbAogICAgcG9ydDogNTQzMgogICAgZGF0YWJhc2U6IHVwbQogICAgdXNlcjogdXBtCiAgICBwYXNzd29yZDogdXBtCiAgICBzc2w6CiAgICAgIG1vZGU6IHJlcXVpcmUKd2ViOgogIGh0dHA6IDAuMC4wLjA6NTU1Ngpmcm9udGVuZDoKICBpc3N1ZXI6IEVEQgogIGxvZ29VUk…

echo "xxx" | base64 -d
staticPasswords:
  - email: owner@mycompany.com
    hash: XXX
    userID: c5998173-a605-449a-a9a5-4a9c33e26df7
    username: owner@mycompany.com

As an alternative, use this one-line command:

k get secret -n upm-dex upm-dex -ojsonpath='{.data.config\.yaml}' | base64 -d

HTTP Error 500

Detailed error

The back ingress gateway can't access incoming connections. If you're accessing the webpage from a public network, it must be through port 443 since https is the only allowed protocol.

Resolution

Check the status of the ingressgateway. Tne ingressgateway is an istio-ingressgateway.

k get all -n istio-system

NAME                                        READY   STATUS    RESTARTS   AGE
pod/istio-ingressgateway-7956f6d57c-jd8jm   1/1     Running   0          14m
pod/istio-ingressgateway-7956f6d57c-jmmv8   1/1     Running   0          14m
pod/istio-ingressgateway-7956f6d57c-r65tk   1/1     Running   0          14m
pod/istiod-7f7bbcdcbb-fzggv                 1/1     Running   0          3h26m
pod/istiod-7f7bbcdcbb-mxjpz                 1/1     Running   0          3h26m

NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                     AGE
service/istio-ingressgateway   LoadBalancer   172.20.92.132    k8s-istiosys-istioing-987da680df-8452f790b4602f2c.elb.us-east-1.amazonaws.com   80:30950/TCP,443:32604/TCP,9443:31744/TCP   3h26m
service/istiod                 ClusterIP      172.20.112.158   <none>                                                                          15010/TCP,15012/TCP,443/TCP,15014/TCP       3h26m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/istio-ingressgateway   3/3     3            3           3h26m
deployment.apps/istiod                 2/2     2            2           3h26m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/istio-ingressgateway-68fcd46fff   0         0         0       84m
replicaset.apps/istio-ingressgateway-7956f6d57c   3         3         3       14m
replicaset.apps/istio-ingressgateway-866d9bd74b   0         0         0       3h26m
replicaset.apps/istiod-7f7bbcdcbb                 2         2         2       3h26m

Check the gateway configuration for configuration errors. The gateway rule is already set to listen on 443 ports:

k get gateway/upm-portal -n istio-system -oyaml

apiVersion: networking.istio.io/v1
kind: Gateway
….
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - portal.basupport.org
    port:
      name: http
      number: 80
      protocol: HTTP
    tls:
      httpsRedirect: true
  - hosts:
    - portal.basupport.org
    port:
      name: https-443
      number: 443
      protocol: HTTPS
  …

Then check the deployment logs:

2025-02-20T03:42:32.203112Z     warning envoy main external/envoy/source/server/server.cc:835   Usage of the deprecated runtime key overload.global_downstream_max_connections, consider switching to `envoy.resource_monitors.downstream_connections` instead.This runtime key will be removed in future.      thread=15
2025-02-20T03:42:32.203389Z     warning envoy main external/envoy/source/server/server.cc:928   There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.   thread=15
2025-02-20T03:42:32.211729Z     info    xdsproxy        connected to delta upstream XDS server: istiod.istio-system.svc:15012   id=1
2025-02-20T03:42:32.268369Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-1
2025-02-20T03:42:32.268437Z     info    cache   returned workload certificate from cache        ttl=23h59m59.731565472s
2025-02-20T03:42:32.268890Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default
2025-02-20T03:42:32.279500Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-2
2025-02-20T03:42:32.279654Z     info    cache   returned workload certificate from cache        ttl=23h59m59.72034871s
2025-02-20T03:42:32.279711Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default
2025-02-20T03:42:32.280287Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-3
2025-02-20T03:42:32.280372Z     info    cache   returned workload trust anchor from cache       ttl=23h59m59.719629366s
2025-02-20T03:42:32.280503Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:1.1kB resource:ROOTCA
2025-02-20T03:42:32.320657Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-oidc from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:32.323034Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-error-transformer from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:32.323073Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-authz-checker from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:35.074684Z     warning envoy wasm external/envoy/source/extensions/common/wasm/context.cc:1198 wasm log: error parsing plugin configuration: Error("aes_key: Invalid Length, got 3 bytes, expected 32", line: 1, column: 12)   thread=15
2025-02-20T03:42:35.074712Z     error   envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:110     Wasm VM failed Failed to configure base Wasm plugin     thread=15
2025-02-20T03:42:35.075923Z     critical        envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:474     Plugin configured to fail closed failed to load thread=15
2025-02-20T03:42:35.076559Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/delta_subscription_state.cc:269  delta config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc        thread=15
2025-02-20T03:42:35.076572Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.076577Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.076584Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.189419Z     info    Readiness succeeded in 3.054667763s
2025-02-20T03:42:35.189690Z     info    Envoy proxy is ready

The error here was because the wasm plugin wasn't bootstrapped successfully, so the ingressgateway isn’t really ready:

k get pod -listio=ingressgateway -n istio-system

Enter MFA code for arn:aws:iam::121858151946:mfa/zane.zhou@enterprisedb.com:
NAME                                    READY   STATUS    RESTARTS   AGE
istio-ingressgateway-7956f6d57c-jd8jm   1/1     Running   0          72m
istio-ingressgateway-7956f6d57c-jmmv8   1/1     Running   0          72m
istio-ingressgateway-7956f6d57c-r65tk   1/1     Running   0          72m

But in the pod, you can see the readiness check actually failed:

k describe pod/istio-ingressgateway-7956f6d57c-jd8jm -n istio-system

 Warning  Unhealthy  20m (x2 over 20m)  kubelet            Readiness probe failed: Get "http://10.0.30.134:15021/healthz/ready": dial tcp 10.0.30.134:15021: connect: connection refused

This issue is caused by a misconfiguration of wasm plugin. The aes_key isn't correctly set. Fix the configuration and redeploy the ingressgateway:

k edit WasmPlugin/upm-oidc

Then set the following aes_key to the default value (or a custom-generated value in a real production environment):

 pluginConfig:
   aes_key: rzkutHl8NJNztPMEJYykZouHslNiA7xmIXH+58ISUVo=

Then restart the deployment:

k rollout restart deployment/istio-ingressgateway -n istio-system

Postgres Cluster Provisioning Stuck at X% Complete

Detailed error

A single-node cluster creation stuck at 87% completion.

Resolution

The CPU and memory range is much smaller than the nodegroup

Detailed error

The nodegroup is m5.4xlarge, which is 16vCPU and 64GB RAM with no node. At cluster creation, the range shown in the console are CPU (0-3.92 Cores) and Memory (0-14.3 Gi)

Resolution

This is due to 0 size nodepool. If the nodepool has node, it’s fine. But when the problem occurs and then you create the node, it doesn’t work. The console still shows the same value after the node created.

Pending: https://enterprisedb.atlassian.net/browse/UPM-45883

Disk scale up

Detailed error

Scale disk on the console. The new value is shown on cluster yaml but not applied on pvc/pv.

kubectl get cluster p-2hgee2782y -o yaml

  storage:
    resizeInUseVolumes: true
    size: 5Gi

k get pv

pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d   2Gi    RWO            Delete           Bound    p-2hgee2782y/p-2hgee2782y-1

Resolution

CNP operator logs contains the error:

kubectl logs postgresql-operator-controller-manager-7cc67597df-4ggdf -n postgresql-operator-system

{
  "level": "error",
  "ts": "2025-01-17T10:02:35.739693483Z",
  "msg": "Reconciler error",
  "controller": "cluster",
  "controllerGroup": "postgresql.k8s.enterprisedb.io",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "p-2hgee2782y",
    "namespace": "p-2hgee2782y"
  },
  "namespace": "p-2hgee2782y",
  "name": "p-2hgee2782y",
  "reconcileID": "2de2ade7-2bf7-45f4-a2af-3f135a4f7ad9",
  "error": "persistentvolumeclaims \"p-2hgee2782y-1\" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"
}

Check and edit storageclass to allowVolumeExpansion: true.

kubecl edit storageclass gp2

k get storageclass gp2 -o yaml

allowVolumeExpansion: true

Delete the CNP operator pod to apply the change:

kubectl  delete pod postgresql-operator-controller-manager-7cc67597df-tgl7b -n postgresql-operator-system

kubectl get pv

pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d   5Gi        RWO            Delete           Bound    p-2hgee2782y/p-2hgee2782y-1

stats-collector-db disk issues

In most cases, if you're having disk issues with stats-collector-db, then you are getting either:

An alert in Alert Manager saying your disk is close to capacity. This happens if your cluster is still running as the disk isn't yet full.
An alert in Alert Manager saying a pod is in a crash loop. This happens if your cluster is crashed due to full disk. You also can't get relevant disk stats in your cluster's Monitoring tab.

Scenario 1: High disk usage while the cluster is still running

If the cluster is operating normally but you notice high disk usage, you have two options.

Solution A: Reduce data retention period

Reducing the data retention period is a quick way to lower disk usage by storing less data. The default retention period is 168 hours (7 days). Using this solution, you reduce the retention period to 96 hours (4 days).

Reducing the retention period frees up approximately 42% more available disk. To reduce the retention rate:

Edit the ConfigMap using the following command from a terminal with kubectl access to the platform:
```
kubectl -n upm-api-stats-collector edit cmd
upm-api-stats-collector-configmap-<your-config-map-string>
```
In the ConfigMap, find the retention-period parameter and change its value from 168h to 96h.
Restart upm-api-stats-collector.

kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector

After completing these steps, DB usage will stay within the stable range and the alert in Alert Manager will disappear.

Solution B: Increase disk volume size

If you need to keep 7 days of data, you must increase the disk's storage capacity to solve the issue.

Edit the cluster configuration.

kubectl edit cluster stats-collector-db -n upm-api-stats-collector

Locate the stats-collector-db specification and increase the value of .spec.storage.size to the preferred size.

Restart upm-api-stats-collector.

kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector

Scenario 2: Disk is full and cluster is crashing

If the disk is full, it can cause the cluster to get stuck in a crash loop, for example:

upm-api-stats-collector      stats-collector-db-1 0/1     CrashLoopBackOff   197 (4m34s ago)   4d7h

You have two options for addressing this issue.

Solution A: Re-create the database

If you don't need to keep the historical stats data:

Delete the stats-collector-db cluster:

kubectl delete cluster stats-collector-db -n upm-api-stats-collector

Delete the stats-collector-db backup folder from your S3 bucket:
```
S3/edb-internal-backups/<your-S3-string>/databases/transporter-db/
```

Run helm upgrade to re-create an empty database with the default settings:

helm upgrade -n edbpgai-bootstrap \
--install -f ./values.yaml \
--version "${EDB_PLATFORM_VERSION}" \
edbpgai-bootstrap "${HCP_HELM_REPO_NAME}/edbpgai-bootstrap

Restart upm-api-stats-collector:

kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector

Solution B: Increase disk volume size

If you need to preserve the existing data, you must increase the disk's volume size.

Edit the cluster configuration:

kubectl edit cluster stats-collector-db -n upm-api-stats-collector

Locate the stats-collector-db specification and increase the value of .spec.storage.size.
Delete the pod in the crashing loop.

Restart ump-api-stats-collector:

kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector

After the restart, the database has a larger volume, which resolves the crash loop.

Warning

Increasing the volume size above the default blocks patch version upgrades for clusters running version 1.2.x.

If you increased the disk volume size but now need to perform a minor version upgrade on HM v1.2.x, then you must first resize the database back to the default size (50GB). Contact the Support team for assistance with database resizing.

← Prev

Deleting a key

↑ Up

Using Hybrid Manager

Using query recommendations in Hybrid Manager

Could this page be better? Report a problem or suggest an addition!

Troubleshooting Hybrid Manager

Error codes

Errors in detail

[upm-istio] - failed to install components: [upm-istio]

Detailed error

Resolution

upm-beaco-ff-base install failed

Detailed error

Resolution

Upm-thanos installation failed

Detailed error

Resolution

Failed to pull image

Detailed error

Resolution

Loki pods crash

Detailed error

Resolution

No healthy upstream

Detailed error

Resolution

401 Access Denied

Detailed error

Resolution

HTTP Error 500

Detailed error

Resolution

Postgres Cluster Provisioning Stuck at X% Complete

Detailed error

Resolution

The CPU and memory range is much smaller than the nodegroup

Detailed error

Resolution

Disk scale up

Detailed error

Resolution

stats-collector-db disk issues

Scenario 1: High disk usage while the cluster is still running

Solution A: Reduce data retention period

Solution B: Increase disk volume size

Scenario 2: Disk is full and cluster is crashing

Solution A: Re-create the database

Solution B: Increase disk volume size

Warning

← Prev

↑ Up

Next →