Troubleshooting in Grafana

Suggest edits

The no healthy upstream warning in Grafana dashboards indicates that a service can't connect to its intended backend. This issue can be temporary or persist for a longer duration, impacting the usability of your monitoring and debugging tools. Understanding the common causes can help you to diagnose and resolve these issues.

Potential causes

Pod eviction or movement

Dynamic node provisioning and pod scheduling tools like Karpenter can cause pods to be moved between nodes. During this process, a pod might temporarily become unavailable, leading to no healthy upstream errors until it's rescheduled and becomes ready on a new node.

Resource overscheduling and out-of-memory (OOM) errors

Services, especially those handling complex or large queries (like Loki), can consume significant memory. These tools generally don’t constrain queries. They will attempt to execute a query that might return terabytes of logs or gigabytes of metrics and OOM rather than identifying and flagging queries that may not succeed with the memory allocations.

If the cluster nodes are oversubscribed or if pods aren't configured with appropriate resource limits, Kubernetes can kill them due to memory pressure on the host. Even if the pod recovers quickly, the brief unavailability during the OOM event can trigger no healthy upstream warnings.

Transient network issues or connection termination

Intermittent network problems, connection resets, or abrupt connection terminations before headers are received can also appear as no healthy upstream.

Service unavailability or misconfiguration

While less common for transient issues, a no healthy upstream message can also indicate that the backend service is unhealthy, not running, or misconfigured, preventing Istio from routing traffic to it successfully.

Troubleshooting steps

Optimize queries: Address the most common and direct cause of OOMs for these specific backends. Since Loki, Thanos, and Prometheus don’t guardrail queries, the first thing to look at when these backends start failing is whether overly broad or complex queries are causing them to have OOM issues.
Try narrowing your queries by reducing their time window, selecting a smaller set of log metrics, or using more specific filtering (stream selectors, line matching, JSON queries, avoiding regex expressions). See Query best practices for more information.
Monitor resource utilization: Check the resource usage of your monitoring workloads (Loki, Thanos, Prometheus). Review resource utilization metrics for the affected pods and their underlying nodes (CPU, memory). This helps to identify whether resource overscheduling or lack of capacity is the root cause.
For example, check the resource usage of Loki workloads. Verify the configured resource commitments, such as resource limits and requests for the pods.
If resource utilization looks high and OOM errors persist, then consider adjusting the pod's resource requests/limits or the node's capacity. If the node has enough memory capacity but the pod is undersubscribed, consider increasing pod memory limits. If the pods are oversubscribed and the node needs more memory capacity to host the subscribed pods, consider increasing the node memory.
Check Kubernetes events and logs: Check Kubernetes events (cluster’s logs) for pod evictions, scheduling failures, or node changes. Logs are a good source for looking at a pod's health and specific OOM events. They help you understand which pods and processes are failing.
Verify network health: This is a more general check and less likely to be the primary cause of OOM issues. Verify the health of the underlying network infrastructure.