Troubleshooting PGAIHM

This guide provides instructions for troubleshooting the Hybrid Manager (PGAIHM).

Logfiles

The following log files can be used when troubleshooting the PGAIHM:

Log file nameCategoryLocationExplanation

Error codes

Error CodeCategoryError MessageExplanationPotential Solution
naInstallation[upm-istio] - failed to install components: [upm-istio]An error from boot-strap: the LoadBalancer service is pending.Fix the loadbalancer IAM role first, then delete the LB pod to trigger recreation.
naInstallationupm-beaco-ff-base install failedkubectl edit sc gp2
naInstallationupm-thanos installation failed
naInstallationFailed to pull image
naInstallationLoki pods crashThe loki pods crashed and the installation failed.
naPortal LoginNo healthy upstream“No healthy upstream” is returned after the login.
401Portal Login401 Access Denied401 error returned when logging into the portal after the installation
500Portal LoginHTTP Error 500The back ingress gateway can't access incoming connections.
naDatabase ProvisioningPostgres Cluster Provisioning Stuck at X% Complete
naDatabase ProvisioningCPU and Memory range is much smaller than the nodegroup
naDatabase UpdateDisk scale upThe new value is shown on cluster yaml but not applied on pvc/pv.Edit the storageclass

Errors in detail

[upm-istio] - failed to install components: [upm-istio]

Detailed Error:

5:03:42AM: ---- waiting on 1 changes [4/5 done] ----
5:03:42AM: ongoing: reconcile service/istio-ingressgateway (v1) namespace: istio-system
5:03:42AM:  ^ Load balancer ingress is empty
{"level":"error","msg":"failed to install component: failed to execute kapp command, error: Timed out waiting after 15m0s for resources: [service/istio-ingressgateway (v1) namespace: istio-system], message: "}
{"level":"error","msg":"Failed to install components: [upm-istio]"}
{"level":"error","msg":"Installation failed, error: failed to install components: [upm-istio]"}

```bash
k get svc -n istio-system
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                     AGE
istio-ingressgateway   LoadBalancer   172.20.184.80    <pending>     80:30270/TCP,443:31858/TCP,9443:32692/TCP   3h3m
istiod                 ClusterIP      172.20.146.223   <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP       3h3m
k describe svc istio-ingressgateway -n istio-system
Events:
  Type     Reason            Age    From     Message
  ----     ------            ----   ----     -------
  Warning  FailedBuildModel  53m    service  Failed build model due to operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: 07cc71c0-3113-4b4e-be7f-3de3b475c49b, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

Resolution:

Fix the loadbalancer IAM role then delete the LB pod

k delete pod aws-load-balancer-controller-57ccd8bc77-f8lrk -n kube-system
k delete pod aws-load-balancer-controller-57ccd8bc77-j58c2 -n kube-system
k get svc -n istio-system
NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                     AGE
istio-ingressgateway   LoadBalancer   172.20.184.80    k8s-istiosys-istioing-38d88046aa-37454e78c9648556.elb.us-east-1.amazonaws.com   80:30270/TCP,443:31858/TCP,9443:32692/TCP   3h8m
istiod                 ClusterIP      172.20.146.223   <none>                                                                          15010/TCP,15012/TCP,443/TCP,15014/TCP       3h8m

upm-beaco-ff-base install failed

Detailed Error:

❯ 
NAME                    READY   STATUS    RESTARTS   AGE
app-db-1-initdb-k4nbl   0/1     Pending   0          35m
kc get pvc -n upm-beaco-ff-base
NAME       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
app-db-1   Pending                                                     <unset>                 35m
kc describe pvc app-db-1 -n upm-beaco-ff-base
Name:          app-db-1
Namespace:     upm-beaco-ff-base
StorageClass:
Status:        Pending
Volume:
…….
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  64s (x142 over 36m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is se

Resolution::

Run kubectl edit sc gp2 and add the following annotation: storageclass.kubernetes.io/is-default-class: "true"

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"

Upm-thanos installation failed

Detailed Error:

Resolution:

  1. Check for any pods in crash loop.
  2. Check the logs to look for any permission-related errors.
kubectl get pod -n monitoring

Check if the secret is correct or not:

kubectl get secret -n monitoring

The secret should match what you defined in edb-object-storage:

kubectl get secret -n monitoring  thanos-objstore-secret -o yaml 
 apiVersion: v1
data:
  objstore.yml: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg==

~ ❯  echo dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg== | base64 -d 

To correct the secret:

kubectl delete secret thanos-objstore-secret -n monitoring

kubectl delete pod storage-location-operator-controller-manager-xxxxxx -n storage-location-operator

Re-run helm upgrade and ensure the secret thanos-objstore-secret is correct:

kubectl delete pod <pod in crashloop> -n monitoring

Failed to pull image

Detailed Error:

NAME                                             READY   STATUS                  RESTARTS   AGE
edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg   0/1     Init:AssetPullBackOff   0          2m49s

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  53s               default-scheduler  Successfully assigned edbpgai-bootstrap/edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg to i-001f23a316e13e905
  Warning  Failed     22s               kubelet            Failed to pull image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to resolve reference "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to do request: Head "https://docker.enterprisedb.com/v2/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks/manifests/v1.0.6-appl": dial tcp 18.67.76.32:443: i/o timeout
  Warning  Failed     22s               kubelet            Error: ErrAssetPull
  Normal   BackOff    22s               kubelet            Back-off pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl"
  Warning  Failed     22s               kubelet            Error: AssetPullBackOff
  Normal   Pulling    7s (x2 over 52s)  kubelet            Pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl"

Resolution:

  1. In your AWS console, edit your related subnet and enable Auto-assign public IPv4 address.

  2. Delete nodes to let EKS pick up the change:

k get node
NAME                  STATUS   ROLES    AGE   VERSION
i-001f23a316e13e905   Ready    <none>   19h   v1.31.4-eks-0f56d01
i-012fd44b2f575a514   Ready    <none>   20h   v1.31.4-eks-0f56d01
i-076d6a447c6551a21   Ready    <none>   20h   v1.31.4-eks-0f56d01
k delete node i-001f23a316e13e905 i-012fd44b2f575a514 i-076d6a447c6551a21
node "i-001f23a316e13e905" deleted
node "i-012fd44b2f575a514" deleted
node "i-076d6a447c6551a21" deleted
kgp -n edbpgai-bootstrap
NAME                                             READY   STATUS    RESTARTS   AGE
edbpgai-bootstrap-job-v1.0.6-appl-5ee7l7-xzfqt   1/1     Running   0          118s

Loki pods crash

Detailed Error:

k get pods -n logging
NAME                         READY   STATUS             RESTARTS         AGE
loki-backend-0               1/2     CrashLoopBackOff   16 (2m1s ago)    59m
loki-read-5b8555fb64-9srmq   0/1     CrashLoopBackOff   16 (2m40s ago)   59m
loki-read-5b8555fb64-fw6ls   0/1     CrashLoopBackOff   16 (2m7s ago)    59m
loki-read-5b8555fb64-nw2vq   0/1     CrashLoopBackOff   16 (2m17s ago)   59m
loki-write-0                 0/1     CrashLoopBackOff   16 (2m37s ago)   59m
loki-write-1                 0/1     CrashLoopBackOff   16 (2m5s ago)    59m
loki-write-2                 0/1     CrashLoopBackOff   16 (2m16s ago)   59m
k logs loki-write-0 -n logging
level=info ts=2025-02-26T09:32:35.173785968Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)"
level=info ts=2025-02-26T09:32:35.173821129Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2025-02-26T09:32:35.174421237Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2025-02-26T09:32:35.175391511Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=loki-write-0-a0cfbc5d
level=info ts=2025-02-26T09:32:35.17599149Z caller=shipper.go:160 index-store=tsdb-2020-05-15 msg="starting index shipper in WO mode"
level=info ts=2025-02-26T09:32:35.176072171Z caller=table_manager.go:136 index-store=tsdb-2020-05-15 msg="uploading tables"
level=info ts=2025-02-26T09:32:35.176269663Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=0
level=info ts=2025-02-26T09:32:35.176305024Z caller=manager.go:86 index-store=tsdb-2020-05-15 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=0 indices=0 failures=0
level=info ts=2025-02-26T09:32:35.176325994Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=1
level=info ts=2025-02-26T09:32:35.180393332Z caller=module_service.go:82 msg=starting module=server
level=info ts=2025-02-26T09:32:35.180477643Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=error ts=2025-02-26T09:32:35.180526964Z caller=loki.go:524 msg="module failed" module=memberlist-kv error="starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180549544Z caller=loki.go:524 msg="module failed" module=ring error="failed to start ring, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180582375Z caller=loki.go:524 msg="module failed" module=store error="failed to start store, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=error ts=2025-02-26T09:32:35.180588655Z caller=loki.go:524 msg="module failed" module=ingester error="failed to start ingester, because it depends on module analytics, which has failed: context canceled"
level=error ts=2025-02-26T09:32:35.180593435Z caller=loki.go:524 msg="module failed" module=distributor error="failed to start distributor, because it depends on module analytics, which has failed: context canceled"
level=error ts=2025-02-26T09:32:35.180597415Z caller=loki.go:524 msg="module failed" module=analytics error="failed to start analytics, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided"
level=info ts=2025-02-26T09:32:35.180687877Z caller=modules.go:1832 msg="server stopped"
level=info ts=2025-02-26T09:32:35.180702097Z caller=module_service.go:120 msg="module stopped" module=server
level=info ts=2025-02-26T09:32:35.180708417Z caller=loki.go:508 msg="Loki stopped" running_time=85.855531ms
failed services
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:566
main.main
	/src/loki/cmd/loki/main.go:129
runtime.main
	/usr/local/go/src/runtime/proc.go:271
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1695
level=error ts=2025-02-26T09:32:35.180772078Z caller=log.go:216 msg="error running loki" err="failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:566\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"

Resolution:

There is a known issue from Loki https://github.com/grafana/loki/issues/8634 due to the CIDR not 10.*

  1. Check the IPv4 CIDR in your VPC. For example, if it's set to 11.1.0.0/20, it should be 10.1.0.0/20.

No healthy upstream

Detailed Error:

The pod at upm-dex is in a crash loop. Check the pod logs to view the error details. In this case, pgai.portal.authentication.staticPasswords is not configured correctly, so update the correct configuration and re-run the deployment.

upm-dex                      upm-dex-66b96b5857-pdcg2                                        1/2     CrashLoopBackOff   206 (3m34s ago)   17h

kubectl logs upm-dex-66b96b5857-pdcg2 -n upm-dex
error parse config file /tmp/dex.config.yaml-565347154: error unmarshaling JSON: malformed bcrypt hash: crypto/bcrypt: hashedSecret too short to be a bcrypted password

Resolution:

Added the 4 parameter on the upgrade command:

helm upgrade \
    -n edbpgai-bootstrap \
    --install \
    --set parameters.global.portal_domain_name="${PORTAL_DOMAIN_NAME}" \
    --set parameters.transporter-rw-service.domain_name="${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}" \
    --set parameters.transporter-dp-agent.rw_service_url="https://${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}/transporter" \
    --set parameters.upm-beacon.server_host="${BEACON_SERVICE_DOMAIN_NAME}" \
    --set parameters.upm-beaco-ff-base.cookie_aeskey="${AES_256_KEY}" \
    --set system="eks" \
    --set remoteContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \
    --set internalContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \
    --set bootstrapAsset="${REGISTRY_PACKAGE_URL}/edbpgai-bootstrap/bootstrap-eks:${EDBPGAI_BOOTSTRAP_IMAGE_VERSION}" \
    --set pgai.portal.authentication.staticPasswords[0].email="owner@mycompany.com" \
    --set pgai.portal.authentication.staticPasswords[0].hash='$2y$10$STTzuGJa3PydqHvlzi2z6OgDU1G/JLTqiuGblH.RemIutWxkztN5m' \
    --set pgai.portal.authentication.staticPasswords[0].username="owner@mycompany.com" \
    --set pgai.portal.authentication.staticPasswords[0].userID="c5998173-a605-449a-a9a5-4a9c33e26df7" \
    --version "${EDBPGAI_BOOTSTRAP_HELM_CHART_VERSION}" \
    edbpgai-bootstrap enterprisedb-edbpgai/edbpgai-bootstrap

401 Access Denied

Detailed Error:

Resolution::

Check the config.yaml file. The staticPasswords section is not set properly at the case. Re-run helm upgrade with the proper values.

k get secret upm-dex -n upm-dex -o yaml
apiVersion: v1
data:
  config.yaml: aXNzdWVyOiBodHRwczovL3BvcnRhbC5iYXN1cHBvcnQub3JnL2F1dGgKc3RvcmFnZToKICB0eXBlOiBwb3N0Z3JlcwogIGNvbmZpZzoKICAgIGhvc3Q6IGFwcC1kYi1ydy51cG0tYmVhY28tZmYtYmFzZS5zdmMuY2x1c3Rlci5sb2NhbAogICAgcG9ydDogNTQzMgogICAgZGF0YWJhc2U6IHVwbQogICAgdXNlcjogdXBtCiAgICBwYXNzd29yZDogdXBtCiAgICBzc2w6CiAgICAgIG1vZGU6IHJlcXVpcmUKd2ViOgogIGh0dHA6IDAuMC4wLjA6NTU1Ngpmcm9udGVuZDoKICBpc3N1ZXI6IEVEQgogIGxvZ29VUk…

echo "xxx" | base64 -d
staticPasswords:
  - email: owner@mycompany.com
    hash: XXX
    userID: c5998173-a605-449a-a9a5-4a9c33e26df7
    username: owner@mycompany.com

As an alternative, use this one line command:

k get secret -n upm-dex upm-dex -ojsonpath='{.data.config\.yaml}' | base64 -d

HTTP Error 500

Detailed Error:

The back ingress gateway can't access incoming connections. If you're accessing the webpage from a public network, it must be through port 443 since https is the only allowed protocol.

Resolution:

Check the status of the ingressgateway. Our ingressgateway is an istio-ingressgateway.

k get all -n istio-system
NAME                                        READY   STATUS    RESTARTS   AGE
pod/istio-ingressgateway-7956f6d57c-jd8jm   1/1     Running   0          14m
pod/istio-ingressgateway-7956f6d57c-jmmv8   1/1     Running   0          14m
pod/istio-ingressgateway-7956f6d57c-r65tk   1/1     Running   0          14m
pod/istiod-7f7bbcdcbb-fzggv                 1/1     Running   0          3h26m
pod/istiod-7f7bbcdcbb-mxjpz                 1/1     Running   0          3h26m

NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                     AGE
service/istio-ingressgateway   LoadBalancer   172.20.92.132    k8s-istiosys-istioing-987da680df-8452f790b4602f2c.elb.us-east-1.amazonaws.com   80:30950/TCP,443:32604/TCP,9443:31744/TCP   3h26m
service/istiod                 ClusterIP      172.20.112.158   <none>                                                                          15010/TCP,15012/TCP,443/TCP,15014/TCP       3h26m

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/istio-ingressgateway   3/3     3            3           3h26m
deployment.apps/istiod                 2/2     2            2           3h26m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/istio-ingressgateway-68fcd46fff   0         0         0       84m
replicaset.apps/istio-ingressgateway-7956f6d57c   3         3         3       14m
replicaset.apps/istio-ingressgateway-866d9bd74b   0         0         0       3h26m
replicaset.apps/istiod-7f7bbcdcbb                 2         2         2       3h26m

Check the gateway configuration for configuration errors. The gateway rule is already set to listen on 443 ports:

k get gateway/upm-portal -n istio-system -oyaml
apiVersion: networking.istio.io/v1
kind: Gateway
….
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - portal.basupport.org
    port:
      name: http
      number: 80
      protocol: HTTP
    tls:
      httpsRedirect: true
  - hosts:
    - portal.basupport.org
    port:
      name: https-443
      number: 443
      protocol: HTTPS
  …

Then check the deployment logs:

2025-02-20T03:42:32.203112Z     warning envoy main external/envoy/source/server/server.cc:835   Usage of the deprecated runtime key overload.global_downstream_max_connections, consider switching to `envoy.resource_monitors.downstream_connections` instead.This runtime key will be removed in future.      thread=15
2025-02-20T03:42:32.203389Z     warning envoy main external/envoy/source/server/server.cc:928   There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.   thread=15
2025-02-20T03:42:32.211729Z     info    xdsproxy        connected to delta upstream XDS server: istiod.istio-system.svc:15012   id=1
2025-02-20T03:42:32.268369Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-1
2025-02-20T03:42:32.268437Z     info    cache   returned workload certificate from cache        ttl=23h59m59.731565472s
2025-02-20T03:42:32.268890Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default
2025-02-20T03:42:32.279500Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-2
2025-02-20T03:42:32.279654Z     info    cache   returned workload certificate from cache        ttl=23h59m59.72034871s
2025-02-20T03:42:32.279711Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default
2025-02-20T03:42:32.280287Z     info    ads     ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-3
2025-02-20T03:42:32.280372Z     info    cache   returned workload trust anchor from cache       ttl=23h59m59.719629366s
2025-02-20T03:42:32.280503Z     info    ads     SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:1.1kB resource:ROOTCA
2025-02-20T03:42:32.320657Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-oidc from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:32.323034Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-error-transformer from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:32.323073Z     info    wasm    fetching image staging_pgai-platform/upm-beaco-filters/upm-authz-checker from registry docker.enterprisedb.com with tag v1.1.44
2025-02-20T03:42:35.074684Z     warning envoy wasm external/envoy/source/extensions/common/wasm/context.cc:1198 wasm log: error parsing plugin configuration: Error("aes_key: Invalid Length, got 3 bytes, expected 32", line: 1, column: 12)   thread=15
2025-02-20T03:42:35.074712Z     error   envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:110     Wasm VM failed Failed to configure base Wasm plugin     thread=15
2025-02-20T03:42:35.075923Z     critical        envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:474     Plugin configured to fail closed failed to load thread=15
2025-02-20T03:42:35.076559Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/delta_subscription_state.cc:269  delta config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc        thread=15
2025-02-20T03:42:35.076572Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.076577Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.076584Z     warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138    gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15
2025-02-20T03:42:35.189419Z     info    Readiness succeeded in 3.054667763s
2025-02-20T03:42:35.189690Z     info    Envoy proxy is ready

The error here was because the wasm plugin was not bootstrapped successfully, so the ingressgateway isn’t really ready:

k get pod -listio=ingressgateway -n istio-system
Enter MFA code for arn:aws:iam::121858151946:mfa/zane.zhou@enterprisedb.com:
NAME                                    READY   STATUS    RESTARTS   AGE
istio-ingressgateway-7956f6d57c-jd8jm   1/1     Running   0          72m
istio-ingressgateway-7956f6d57c-jmmv8   1/1     Running   0          72m
istio-ingressgateway-7956f6d57c-r65tk   1/1     Running   0          72m

but in the pod we can see the readiness check actually failed:

k describe pod/istio-ingressgateway-7956f6d57c-jd8jm -n istio-system
 Warning  Unhealthy  20m (x2 over 20m)  kubelet            Readiness probe failed: Get "http://10.0.30.134:15021/healthz/ready": dial tcp 10.0.30.134:15021: connect: connection refused

This issue is caused by a misconfiguration of the wasm plugin, the aes_key is not correctly set. Fix the configuration and redeploy the ingressgateway:

k edit WasmPlugin/upm-oidc

Then set the following aes_key to the default value (or a custom-generated value in a real production environment):

 pluginConfig:
   aes_key: rzkutHl8NJNztPMEJYykZouHslNiA7xmIXH+58ISUVo=

Then restart the deployment:

k rollout restart deployment/istio-ingressgateway -n istio-system

Postgres Cluster Provisioning Stuck at X% Complete

Detailed Error:

A single-node cluster creation stuck at 87% completion.

Resolution:


The CPU and Memory range is much smaller than the nodegroup

Detailed Error:

The nodegroup is m5.4xlarge which is 16vCPU and 64GB RAM with no node. At cluster creation, the range shown at UI are CPU (0-3.92 Cores) and Memory (0-14.3 Gi)

Resolution:

This is due to 0 size nodepool. If the nodepool has node, it’s fine. But when the problem occurs, then I create the node, it doesn’t work, UI still show the same value after the node created.

Pending: https://enterprisedb.atlassian.net/browse/UPM-45883


Disk scale up

Detailed Error:

Scale disk on UI, the new value is shown on cluster yaml but not applied on pvc/pv.

kubectl get cluster p-2hgee2782y -o yaml
  storage:
    resizeInUseVolumes: true
    size: 5Gi
k get pv
pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d   2Gi    RWO            Delete           Bound    p-2hgee2782y/p-2hgee2782y-1  

Resolution:

CNP operator logs contains the error:

kubectl logs postgresql-operator-controller-manager-7cc67597df-4ggdf -n postgresql-operator-system
{
  "level": "error",
  "ts": "2025-01-17T10:02:35.739693483Z",
  "msg": "Reconciler error",
  "controller": "cluster",
  "controllerGroup": "postgresql.k8s.enterprisedb.io",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "p-2hgee2782y",
    "namespace": "p-2hgee2782y"
  },
  "namespace": "p-2hgee2782y",
  "name": "p-2hgee2782y",
  "reconcileID": "2de2ade7-2bf7-45f4-a2af-3f135a4f7ad9",
  "error": "persistentvolumeclaims \"p-2hgee2782y-1\" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"
}

Check and edit storageclass to allowVolumeExpansion: true

kubecl edit storageclass gp2
k get storageclass gp2 -o yaml
allowVolumeExpansion: true

Delete the CNP operator pod to apply the new change:

kubectl  delete pod postgresql-operator-controller-manager-7cc67597df-tgl7b -n postgresql-operator-system
kubectl get pv
pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d   5Gi        RWO            Delete           Bound    p-2hgee2782y/p-2hgee2782y-1  

Could this page be better? Report a problem or suggest an addition!