Troubleshooting PGAIHM
This guide provides instructions for troubleshooting the Hybrid Manager (PGAIHM).
Logfiles
The following log files can be used when troubleshooting the PGAIHM:
Log file name | Category | Location | Explanation |
---|---|---|---|
Error codes
Error Code | Category | Error Message | Explanation | Potential Solution |
---|---|---|---|---|
na | Installation | [upm-istio] - failed to install components: [upm-istio] | An error from boot-strap: the LoadBalancer service is pending. | Fix the loadbalancer IAM role first, then delete the LB pod to trigger recreation. |
na | Installation | upm-beaco-ff-base install failed | kubectl edit sc gp2 | |
na | Installation | upm-thanos installation failed | ||
na | Installation | Failed to pull image | ||
na | Installation | Loki pods crash | The loki pods crashed and the installation failed. | |
na | Portal Login | No healthy upstream | “No healthy upstream” is returned after the login. | |
401 | Portal Login | 401 Access Denied | 401 error returned when logging into the portal after the installation | |
500 | Portal Login | HTTP Error 500 | The back ingress gateway can't access incoming connections. | |
na | Database Provisioning | Postgres Cluster Provisioning Stuck at X% Complete | ||
na | Database Provisioning | CPU and Memory range is much smaller than the nodegroup | ||
na | Database Update | Disk scale up | The new value is shown on cluster yaml but not applied on pvc/pv. | Edit the storageclass |
Errors in detail
[upm-istio] - failed to install components: [upm-istio]
Detailed Error:
5:03:42AM: ---- waiting on 1 changes [4/5 done] ---- 5:03:42AM: ongoing: reconcile service/istio-ingressgateway (v1) namespace: istio-system 5:03:42AM: ^ Load balancer ingress is empty {"level":"error","msg":"failed to install component: failed to execute kapp command, error: Timed out waiting after 15m0s for resources: [service/istio-ingressgateway (v1) namespace: istio-system], message: "} {"level":"error","msg":"Failed to install components: [upm-istio]"} {"level":"error","msg":"Installation failed, error: failed to install components: [upm-istio]"} ```bash k get svc -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE istio-ingressgateway LoadBalancer 172.20.184.80 <pending> 80:30270/TCP,443:31858/TCP,9443:32692/TCP 3h3m istiod ClusterIP 172.20.146.223 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 3h3m
k describe svc istio-ingressgateway -n istio-system
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedBuildModel 53m service Failed build model due to operation error Elastic Load Balancing v2: DescribeLoadBalancers, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: 07cc71c0-3113-4b4e-be7f-3de3b475c49b, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
Resolution:
Fix the loadbalancer IAM role then delete the LB pod
k delete pod aws-load-balancer-controller-57ccd8bc77-f8lrk -n kube-system k delete pod aws-load-balancer-controller-57ccd8bc77-j58c2 -n kube-system k get svc -n istio-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE istio-ingressgateway LoadBalancer 172.20.184.80 k8s-istiosys-istioing-38d88046aa-37454e78c9648556.elb.us-east-1.amazonaws.com 80:30270/TCP,443:31858/TCP,9443:32692/TCP 3h8m istiod ClusterIP 172.20.146.223 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 3h8m
upm-beaco-ff-base install failed
Detailed Error:
❯ NAME READY STATUS RESTARTS AGE app-db-1-initdb-k4nbl 0/1 Pending 0 35m
kc get pvc -n upm-beaco-ff-base
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE app-db-1 Pending <unset> 35m
kc describe pvc app-db-1 -n upm-beaco-ff-base
Name: app-db-1 Namespace: upm-beaco-ff-base StorageClass: Status: Pending Volume: ……. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal FailedBinding 64s (x142 over 36m) persistentvolume-controller no persistent volumes available for this claim and no storage class is se
Resolution::
Run kubectl edit sc gp2
and add the following annotation: storageclass.kubernetes.io/is-default-class: "true"
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true"
Upm-thanos installation failed
Detailed Error:
Resolution:
- Check for any pods in crash loop.
- Check the logs to look for any permission-related errors.
kubectl get pod -n monitoring
Check if the secret is correct or not:
kubectl get secret -n monitoring
The secret should match what you defined in edb-object-storage
:
kubectl get secret -n monitoring thanos-objstore-secret -o yaml
apiVersion: v1 data: objstore.yml: dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg== ~ ❯ echo dHlwZTogUzMKY29uZmlnOgogIGJ1Y2tldDogYXBwbGlhbmNlLXNoYW50ZXN0LWVrcy1jbHVzdGVyLWVkYi1wb3N0Z3JlcwogIHJlZ2lvbjogYXAtc291dGgtMQogIGluc2VjdXJlOiBmYWxzZQogIGVuZHBvaW50OiBzMy5hcC1zb3V0aC0xLmFtYXpvbmF3cy5jb20KcHJlZml4OiAiZWRiLW1ldHJpY3MiCg== | base64 -d
To correct the secret:
kubectl delete secret thanos-objstore-secret -n monitoring kubectl delete pod storage-location-operator-controller-manager-xxxxxx -n storage-location-operator
Re-run helm upgrade and ensure the secret thanos-objstore-secret
is correct:
kubectl delete pod <pod in crashloop> -n monitoring
Failed to pull image
Detailed Error:
NAME READY STATUS RESTARTS AGE edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg 0/1 Init:AssetPullBackOff 0 2m49s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 53s default-scheduler Successfully assigned edbpgai-bootstrap/edbpgai-bootstrap-job-v1.0.6-appl-d12cvm-q2pbg to i-001f23a316e13e905 Warning Failed 22s kubelet Failed to pull image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to resolve reference "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl": failed to do request: Head "https://docker.enterprisedb.com/v2/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks/manifests/v1.0.6-appl": dial tcp 18.67.76.32:443: i/o timeout Warning Failed 22s kubelet Error: ErrAssetPull Normal BackOff 22s kubelet Back-off pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl" Warning Failed 22s kubelet Error: AssetPullBackOff Normal Pulling 7s (x2 over 52s) kubelet Pulling image "docker.enterprisedb.com/staging_pgai-platform/edbpgai-bootstrap/bootstrap-eks:v1.0.6-appl"
Resolution:
In your AWS console, edit your related subnet and enable Auto-assign public IPv4 address.
Delete nodes to let EKS pick up the change:
k get node
NAME STATUS ROLES AGE VERSION i-001f23a316e13e905 Ready <none> 19h v1.31.4-eks-0f56d01 i-012fd44b2f575a514 Ready <none> 20h v1.31.4-eks-0f56d01 i-076d6a447c6551a21 Ready <none> 20h v1.31.4-eks-0f56d01
k delete node i-001f23a316e13e905 i-012fd44b2f575a514 i-076d6a447c6551a21
node "i-001f23a316e13e905" deleted node "i-012fd44b2f575a514" deleted node "i-076d6a447c6551a21" deleted
kgp -n edbpgai-bootstrap
NAME READY STATUS RESTARTS AGE edbpgai-bootstrap-job-v1.0.6-appl-5ee7l7-xzfqt 1/1 Running 0 118s
Loki pods crash
Detailed Error:
k get pods -n logging
NAME READY STATUS RESTARTS AGE loki-backend-0 1/2 CrashLoopBackOff 16 (2m1s ago) 59m loki-read-5b8555fb64-9srmq 0/1 CrashLoopBackOff 16 (2m40s ago) 59m loki-read-5b8555fb64-fw6ls 0/1 CrashLoopBackOff 16 (2m7s ago) 59m loki-read-5b8555fb64-nw2vq 0/1 CrashLoopBackOff 16 (2m17s ago) 59m loki-write-0 0/1 CrashLoopBackOff 16 (2m37s ago) 59m loki-write-1 0/1 CrashLoopBackOff 16 (2m5s ago) 59m loki-write-2 0/1 CrashLoopBackOff 16 (2m16s ago) 59m
k logs loki-write-0 -n logging
level=info ts=2025-02-26T09:32:35.173785968Z caller=main.go:126 msg="Starting Loki" version="(version=release-3.1.x-89fe788, branch=release-3.1.x, revision=89fe788d)" level=info ts=2025-02-26T09:32:35.173821129Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml level=info ts=2025-02-26T09:32:35.174421237Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095 level=info ts=2025-02-26T09:32:35.175391511Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=loki-write-0-a0cfbc5d level=info ts=2025-02-26T09:32:35.17599149Z caller=shipper.go:160 index-store=tsdb-2020-05-15 msg="starting index shipper in WO mode" level=info ts=2025-02-26T09:32:35.176072171Z caller=table_manager.go:136 index-store=tsdb-2020-05-15 msg="uploading tables" level=info ts=2025-02-26T09:32:35.176269663Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=0 level=info ts=2025-02-26T09:32:35.176305024Z caller=manager.go:86 index-store=tsdb-2020-05-15 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=0 indices=0 failures=0 level=info ts=2025-02-26T09:32:35.176325994Z caller=head_manager.go:308 index-store=tsdb-2020-05-15 component=tsdb-head-manager msg="loaded wals by period" groups=1 level=info ts=2025-02-26T09:32:35.180393332Z caller=module_service.go:82 msg=starting module=server level=info ts=2025-02-26T09:32:35.180477643Z caller=module_service.go:82 msg=starting module=memberlist-kv level=error ts=2025-02-26T09:32:35.180526964Z caller=loki.go:524 msg="module failed" module=memberlist-kv error="starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided" level=error ts=2025-02-26T09:32:35.180549544Z caller=loki.go:524 msg="module failed" module=ring error="failed to start ring, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided" level=error ts=2025-02-26T09:32:35.180582375Z caller=loki.go:524 msg="module failed" module=store error="failed to start store, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided" level=error ts=2025-02-26T09:32:35.180588655Z caller=loki.go:524 msg="module failed" module=ingester error="failed to start ingester, because it depends on module analytics, which has failed: context canceled" level=error ts=2025-02-26T09:32:35.180593435Z caller=loki.go:524 msg="module failed" module=distributor error="failed to start distributor, because it depends on module analytics, which has failed: context canceled" level=error ts=2025-02-26T09:32:35.180597415Z caller=loki.go:524 msg="module failed" module=analytics error="failed to start analytics, because it depends on module memberlist-kv, which has failed: invalid service state: Failed, expected: Running, failure: starting module memberlist-kv: invalid service state: Failed, expected: Running, failure: service memberlist_kv failed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided" level=info ts=2025-02-26T09:32:35.180687877Z caller=modules.go:1832 msg="server stopped" level=info ts=2025-02-26T09:32:35.180702097Z caller=module_service.go:120 msg="module stopped" module=server level=info ts=2025-02-26T09:32:35.180708417Z caller=loki.go:508 msg="Loki stopped" running_time=85.855531ms failed services github.com/grafana/loki/v3/pkg/loki.(*Loki).Run /src/loki/pkg/loki/loki.go:566 main.main /src/loki/cmd/loki/main.go:129 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 level=error ts=2025-02-26T09:32:35.180772078Z caller=log.go:216 msg="error running loki" err="failed services\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:566\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"
Resolution:
There is a known issue from Loki https://github.com/grafana/loki/issues/8634 due to the CIDR not 10.*
- Check the IPv4 CIDR in your VPC. For example, if it's set to 11.1.0.0/20, it should be 10.1.0.0/20.
No healthy upstream
Detailed Error:
The pod at upm-dex
is in a crash loop. Check the pod logs to view the error details. In this case, pgai.portal.authentication.staticPasswords
is not configured correctly, so update the correct configuration and re-run the deployment.
upm-dex upm-dex-66b96b5857-pdcg2 1/2 CrashLoopBackOff 206 (3m34s ago) 17h kubectl logs upm-dex-66b96b5857-pdcg2 -n upm-dex error parse config file /tmp/dex.config.yaml-565347154: error unmarshaling JSON: malformed bcrypt hash: crypto/bcrypt: hashedSecret too short to be a bcrypted password
Resolution:
Added the 4 parameter on the upgrade command:
helm upgrade \ -n edbpgai-bootstrap \ --install \ --set parameters.global.portal_domain_name="${PORTAL_DOMAIN_NAME}" \ --set parameters.transporter-rw-service.domain_name="${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}" \ --set parameters.transporter-dp-agent.rw_service_url="https://${TRANSPORTER_RW_SERVICE_DOMAIN_NAME}/transporter" \ --set parameters.upm-beacon.server_host="${BEACON_SERVICE_DOMAIN_NAME}" \ --set parameters.upm-beaco-ff-base.cookie_aeskey="${AES_256_KEY}" \ --set system="eks" \ --set remoteContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \ --set internalContainerRegistryURL="${REGISTRY_PACKAGE_URL}" \ --set bootstrapAsset="${REGISTRY_PACKAGE_URL}/edbpgai-bootstrap/bootstrap-eks:${EDBPGAI_BOOTSTRAP_IMAGE_VERSION}" \ --set pgai.portal.authentication.staticPasswords[0].email="owner@mycompany.com" \ --set pgai.portal.authentication.staticPasswords[0].hash='$2y$10$STTzuGJa3PydqHvlzi2z6OgDU1G/JLTqiuGblH.RemIutWxkztN5m' \ --set pgai.portal.authentication.staticPasswords[0].username="owner@mycompany.com" \ --set pgai.portal.authentication.staticPasswords[0].userID="c5998173-a605-449a-a9a5-4a9c33e26df7" \ --version "${EDBPGAI_BOOTSTRAP_HELM_CHART_VERSION}" \ edbpgai-bootstrap enterprisedb-edbpgai/edbpgai-bootstrap
401 Access Denied
Detailed Error:
Resolution::
Check the config.yaml
file. The staticPasswords
section is not set properly at the case. Re-run helm upgrade
with the proper values.
k get secret upm-dex -n upm-dex -o yaml
apiVersion: v1 data: config.yaml: aXNzdWVyOiBodHRwczovL3BvcnRhbC5iYXN1cHBvcnQub3JnL2F1dGgKc3RvcmFnZToKICB0eXBlOiBwb3N0Z3JlcwogIGNvbmZpZzoKICAgIGhvc3Q6IGFwcC1kYi1ydy51cG0tYmVhY28tZmYtYmFzZS5zdmMuY2x1c3Rlci5sb2NhbAogICAgcG9ydDogNTQzMgogICAgZGF0YWJhc2U6IHVwbQogICAgdXNlcjogdXBtCiAgICBwYXNzd29yZDogdXBtCiAgICBzc2w6CiAgICAgIG1vZGU6IHJlcXVpcmUKd2ViOgogIGh0dHA6IDAuMC4wLjA6NTU1Ngpmcm9udGVuZDoKICBpc3N1ZXI6IEVEQgogIGxvZ29VUk… echo "xxx" | base64 -d staticPasswords: - email: owner@mycompany.com hash: XXX userID: c5998173-a605-449a-a9a5-4a9c33e26df7 username: owner@mycompany.com
As an alternative, use this one line command:
k get secret -n upm-dex upm-dex -ojsonpath='{.data.config\.yaml}' | base64 -d
HTTP Error 500
Detailed Error:
The back ingress gateway can't access incoming connections. If you're accessing the webpage from a public network, it must be through port 443 since https is the only allowed protocol.
Resolution:
Check the status of the ingressgateway
. Our ingressgateway
is an istio-ingressgateway
.
k get all -n istio-system
NAME READY STATUS RESTARTS AGE pod/istio-ingressgateway-7956f6d57c-jd8jm 1/1 Running 0 14m pod/istio-ingressgateway-7956f6d57c-jmmv8 1/1 Running 0 14m pod/istio-ingressgateway-7956f6d57c-r65tk 1/1 Running 0 14m pod/istiod-7f7bbcdcbb-fzggv 1/1 Running 0 3h26m pod/istiod-7f7bbcdcbb-mxjpz 1/1 Running 0 3h26m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/istio-ingressgateway LoadBalancer 172.20.92.132 k8s-istiosys-istioing-987da680df-8452f790b4602f2c.elb.us-east-1.amazonaws.com 80:30950/TCP,443:32604/TCP,9443:31744/TCP 3h26m service/istiod ClusterIP 172.20.112.158 <none> 15010/TCP,15012/TCP,443/TCP,15014/TCP 3h26m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/istio-ingressgateway 3/3 3 3 3h26m deployment.apps/istiod 2/2 2 2 3h26m NAME DESIRED CURRENT READY AGE replicaset.apps/istio-ingressgateway-68fcd46fff 0 0 0 84m replicaset.apps/istio-ingressgateway-7956f6d57c 3 3 3 14m replicaset.apps/istio-ingressgateway-866d9bd74b 0 0 0 3h26m replicaset.apps/istiod-7f7bbcdcbb 2 2 2 3h26m
Check the gateway configuration for configuration errors. The gateway rule is already set to listen on 443 ports:
k get gateway/upm-portal -n istio-system -oyaml
apiVersion: networking.istio.io/v1 kind: Gateway …. spec: selector: istio: ingressgateway servers: - hosts: - portal.basupport.org port: name: http number: 80 protocol: HTTP tls: httpsRedirect: true - hosts: - portal.basupport.org port: name: https-443 number: 443 protocol: HTTPS …
Then check the deployment logs:
2025-02-20T03:42:32.203112Z warning envoy main external/envoy/source/server/server.cc:835 Usage of the deprecated runtime key overload.global_downstream_max_connections, consider switching to `envoy.resource_monitors.downstream_connections` instead.This runtime key will be removed in future. thread=15 2025-02-20T03:42:32.203389Z warning envoy main external/envoy/source/server/server.cc:928 There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor. thread=15 2025-02-20T03:42:32.211729Z info xdsproxy connected to delta upstream XDS server: istiod.istio-system.svc:15012 id=1 2025-02-20T03:42:32.268369Z info ads ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-1 2025-02-20T03:42:32.268437Z info cache returned workload certificate from cache ttl=23h59m59.731565472s 2025-02-20T03:42:32.268890Z info ads SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default 2025-02-20T03:42:32.279500Z info ads ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-2 2025-02-20T03:42:32.279654Z info cache returned workload certificate from cache ttl=23h59m59.72034871s 2025-02-20T03:42:32.279711Z info ads SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:4.0kB resource:default 2025-02-20T03:42:32.280287Z info ads ADS: new connection for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system-3 2025-02-20T03:42:32.280372Z info cache returned workload trust anchor from cache ttl=23h59m59.719629366s 2025-02-20T03:42:32.280503Z info ads SDS: PUSH request for node:istio-ingressgateway-68fcd46fff-g4nth.istio-system resources:1 size:1.1kB resource:ROOTCA 2025-02-20T03:42:32.320657Z info wasm fetching image staging_pgai-platform/upm-beaco-filters/upm-oidc from registry docker.enterprisedb.com with tag v1.1.44 2025-02-20T03:42:32.323034Z info wasm fetching image staging_pgai-platform/upm-beaco-filters/upm-error-transformer from registry docker.enterprisedb.com with tag v1.1.44 2025-02-20T03:42:32.323073Z info wasm fetching image staging_pgai-platform/upm-beaco-filters/upm-authz-checker from registry docker.enterprisedb.com with tag v1.1.44 2025-02-20T03:42:35.074684Z warning envoy wasm external/envoy/source/extensions/common/wasm/context.cc:1198 wasm log: error parsing plugin configuration: Error("aes_key: Invalid Length, got 3 bytes, expected 32", line: 1, column: 12) thread=15 2025-02-20T03:42:35.074712Z error envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:110 Wasm VM failed Failed to configure base Wasm plugin thread=15 2025-02-20T03:42:35.075923Z critical envoy wasm external/envoy/source/extensions/common/wasm/wasm.cc:474 Plugin configured to fail closed failed to load thread=15 2025-02-20T03:42:35.076559Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/delta_subscription_state.cc:269 delta config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15 2025-02-20T03:42:35.076572Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138 gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15 2025-02-20T03:42:35.076577Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138 gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15 2025-02-20T03:42:35.076584Z warning envoy config external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138 gRPC config for type.googleapis.com/envoy.config.core.v3.TypedExtensionConfig rejected: Unable to create Wasm HTTP filter istio-system.upm-oidc thread=15 2025-02-20T03:42:35.189419Z info Readiness succeeded in 3.054667763s 2025-02-20T03:42:35.189690Z info Envoy proxy is ready
The error here was because the wasm plugin
was not bootstrapped successfully, so the ingressgateway
isn’t really ready:
k get pod -listio=ingressgateway -n istio-system
Enter MFA code for arn:aws:iam::121858151946:mfa/zane.zhou@enterprisedb.com: NAME READY STATUS RESTARTS AGE istio-ingressgateway-7956f6d57c-jd8jm 1/1 Running 0 72m istio-ingressgateway-7956f6d57c-jmmv8 1/1 Running 0 72m istio-ingressgateway-7956f6d57c-r65tk 1/1 Running 0 72m
but in the pod we can see the readiness check actually failed:
k describe pod/istio-ingressgateway-7956f6d57c-jd8jm -n istio-system
Warning Unhealthy 20m (x2 over 20m) kubelet Readiness probe failed: Get "http://10.0.30.134:15021/healthz/ready": dial tcp 10.0.30.134:15021: connect: connection refused
This issue is caused by a misconfiguration of the wasm plugin
, the aes_key
is not correctly set. Fix the configuration and redeploy the ingressgateway
:
k edit WasmPlugin/upm-oidc
Then set the following aes_key
to the default value (or a custom-generated value in a real production environment):
pluginConfig: aes_key: rzkutHl8NJNztPMEJYykZouHslNiA7xmIXH+58ISUVo=
Then restart the deployment:
k rollout restart deployment/istio-ingressgateway -n istio-system
Postgres Cluster Provisioning Stuck at X% Complete
Detailed Error:
A single-node cluster creation stuck at 87% completion.
Resolution:
The CPU and Memory range is much smaller than the nodegroup
Detailed Error:
The nodegroup is m5.4xlarge which is 16vCPU and 64GB RAM with no node. At cluster creation, the range shown in the console are CPU (0-3.92 Cores) and Memory (0-14.3 Gi)
Resolution:
This is due to 0 size nodepool. If the nodepool has node, it’s fine. But when the problem occurs, then I create the node, it doesn’t work; the console still shows the same value after the node created.
Pending: https://enterprisedb.atlassian.net/browse/UPM-45883
Disk scale up
Detailed Error:
Scale disk on the console, the new value is shown on cluster yaml but not applied on pvc/pv.
kubectl get cluster p-2hgee2782y -o yaml
storage: resizeInUseVolumes: true size: 5Gi
k get pv
pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d 2Gi RWO Delete Bound p-2hgee2782y/p-2hgee2782y-1
Resolution:
CNP operator logs contains the error:
kubectl logs postgresql-operator-controller-manager-7cc67597df-4ggdf -n postgresql-operator-system
{ "level": "error", "ts": "2025-01-17T10:02:35.739693483Z", "msg": "Reconciler error", "controller": "cluster", "controllerGroup": "postgresql.k8s.enterprisedb.io", "controllerKind": "Cluster", "Cluster": { "name": "p-2hgee2782y", "namespace": "p-2hgee2782y" }, "namespace": "p-2hgee2782y", "name": "p-2hgee2782y", "reconcileID": "2de2ade7-2bf7-45f4-a2af-3f135a4f7ad9", "error": "persistentvolumeclaims \"p-2hgee2782y-1\" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize", "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\tcloud-native-postgres/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224" }
Check and edit storageclass
to allowVolumeExpansion: true
kubecl edit storageclass gp2
k get storageclass gp2 -o yaml
allowVolumeExpansion: true
Delete the CNP operator pod to apply the new change:
kubectl delete pod postgresql-operator-controller-manager-7cc67597df-tgl7b -n postgresql-operator-system
kubectl get pv
pvc-4a35b3ba-507e-4d4c-9017-360326e57b9d 5Gi RWO Delete Bound p-2hgee2782y/p-2hgee2782y-1
stats-collector-db disk issues
In most cases, if you are having disk issues with stats-collector-db
, then you are getting either:
- An alert in alert manager saying your disk is close to capacity, if your cluster is still running as the disk is not yet full.
- An alert in alert manager saying a pod is in a crash loop, if your cluster is crashed due to full disk. You also won't be able to get relevant disk stats in your cluster's Monitoring tab.
You can review the two scenarios and the available workarounds below.
Scenario 1: High disk usage while the cluster is still running
If the cluster is operating normally, but you notice high disk usage, you have two options.
Solution A: Reduce data retention period
Reducing the data retention period is a quick way to lower disk usage by storing less data. The default retention period is 168 hours (7 days). Using this solution, you will reduce the retention period to 96 hours (4 days).
Reducing the retention period frees up approximately 42% more available disk. Perform the following steps to reduce the retention rate:
Edit the
ConfigMap
using the following command from a terminal withkubectl
access to the platform:kubectl -n upm-api-stats-collector edit cmd upm-api-stats-collector-configmap-<your-config-map-string>
In the
ConfigMap
, find theretention-period
parameter and change its value from168h
to96h
.Restart
upm-api-stats-collector
.
kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector
After completing these steps, you should notice DB usage staying within the stable range and the alert in Alert Manager should disappear.
Solution B: Increase disk volume size
If you need to keep 7 days of data, you must increase the disk's storage capacity to solve the issue.
Edit the cluster configuration.
kubectl edit cluster stats-collector-db -n upm-api-stats-collector
Locate the
stats-collector-db
specification and increase the value of.spec.storage.size
to the preferred size.Restart
upm-api-stats-collector
.kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector
Scenario 2: Disk is full and cluster is crashing
If the disk is full, it can cause the cluster to get stuck in a crash loop, for example:
upm-api-stats-collector stats-collector-db-1 0/1 CrashLoopBackOff 197 (4m34s ago) 4d7h
Choose one of the following options to mitigate the issue:
Solution A: Recreate the database
If the historical stats data is not necessary to keep, follow these steps:
Delete the
stats-collector-db
cluster.kubectl delete cluster stats-collector-db -n upm-api-stats-collector
Delete the
stats-collector-db
backup folder from your S3 bucket.S3/edb-internal-backups/<your-S3-string>/databases/transporter-db/
Run
helm upgrade
to recreate an empty database with the default settings:helm upgrade -n edbpgai-bootstrap \ --install -f ./values.yaml \ --version "${EDB_PLATFORM_VERSION}" \ edbpgai-bootstrap "${HCP_HELM_REPO_NAME}/edbpgai-bootstrap
Restart
upm-api-stats-collector
:kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector
Solution B: Increase disk volume size
If you need to preserve the existing data, you must increase the disk's volume size.
Edit the cluster configuration.
kubectl edit cluster stats-collector-db -n upm-api-stats-collector
Locate. the
stats-collector-db
specification and increase the value of.spec.storage.size
.Delete the pod in the crashing loop.
Restart
ump-api-stats-collector
.kubectl rollout restart deploy/upm-api-stats-collector -n upm-api-stats-collector
Once restarted, the database has a larger volume, which resolves the crash loop.
Warning
Increasing the volume size above the default blocks patch version upgrades for clusters running version 1.2.x
.
How to re-enable cluster upgrades
If you have previously increased the disk volume size but now need to perform a minor version upgrade on HM v1.2.x, then you must first resize the database back to the default size (50GB).
Note
Please contact the Support team for assistance with the database resizing procedure.
- On this page
- Logfiles
- Error codes
- Errors in detail
- upm-istio - failed to install components: upm-istio
- upm-beaco-ff-base install failed
- Upm-thanos installation failed
- Failed to pull image
- Loki pods crash
- No healthy upstream
- 401 Access Denied
- HTTP Error 500
- Postgres Cluster Provisioning Stuck at X% Complete
- The CPU and Memory range is much smaller than the nodegroup
- Disk scale up
- stats-collector-db disk issues
Could this page be better? Report a problem or suggest an addition!