Hybrid Manager Architecture

High-level diagram of Hybrid Manager's architecture

The outer-most containers of the diagram illustrate a Kubernetes cluster running on a Cloud Service Provider's infrastructure, such as EKS on AWS or GKE on GCP, or on-premises infrastructure. "On-premises" in this context includes deployments on physical servers (bare metal), virtual machines, or private clouds hosted within an organization’s infrastructure. Working inward through the diagram, there are three main logical groupings to note:

  1. The Kubernetes control plane: The Kubernetes control plane is comprised of the core Kubernetes components, such as kube-apiserver, etcd, kube-scheduler, kube-controller-manager, and, if on a cloud setup, cloud-controller-manager and implemented by two or three Kubernetes control nodes.

  2. The Control Compute Grouping: The Control Compute Grouping of HM is a logical grouping of 70+ pods running on 2 or 3 Kubernetes worker nodes that implement the HM's core management components, such as Grafana, Loki, Thanos, Prometheus for observability, Cert Manager for certification management, Istio for networking, a trust manager for securing software releases, and many others. The 70+ pods for the components are distributed across the worker nodes according to how the kube-scheduler places them, which is not as simply as pictured in the diagram. If needed, more worker nodes can be added to support more database-as-a-service resources.

  3. The Data Compute Grouping: The Data Compute Grouping of HM is a logical grouping of Postgres clusters organized by namespaces and running on a number of pods distributed across at least 3 Kubernetes worker nodes as the kube-scheduler places them, which may not be as simply as pictured. The number of worker nodes can be increased from three as more databases are added using HM, but maybe require either adding the worker nodes manually or configuring auto scaling alongside your infrastructure to do this automatically.

Both the Control Compute Grouping's worker nodes and the and the Data Compute Grouping's worker nodes are backed up nightly to either Snapshots (on-premises using a SAN, or using EBS with EKS) or S3 compatible object store using Barman (for the cloud).

Scalability

To support scalability, HM integrates with Kubernetes' auto-scaling capabilities. This means that on platforms like AWS, worker nodes in the Data Compute Grouping can be automatically added to the cluster as needed using AWS's manged service Auto Scaling groups. However, the HM's Kubernetes auto-scaling capability can be disabled to control costs. In addition, it's important to note that enabling auto-scaling functionality for on-premises deployments requires additional infrastructure configuration since there is no out-of-the-box auto-scaling feature like AWS's Auto Scaling groups for on-premises deployments. Monitoring the distribution and resource usage of pods and worker nodes implementing the database clusters can be achieved through the included Grafana dashboards or by using Kubernetes kubectl commands.

Backup architecture

HM provides robust backup options to ensure data integrity and recoverability. Each HM worker node in the data compute grouping performs nightly backups by default, stored using Barman (for the cloud) or using snapshot backups.

Snapshot backups are available for faster recovery times, but require a compatible CSI driver in the storage layer (for both cloud-based snapshots and local disk snapshots). Given the constraints of Logical Volume Manager (LVM) setups, snapshots are optimal for local recovery but may face limitations when used in multi-site or global recovery scenarios setups due to the lack of shared access across sites.

For more robust, multi-availability zone recovery, Barman Cloud backups can be stored in object stores like AWS S3, which natively supports multi-zone replication. Alternatively, local object stores might require additional configuration to ensure backups are replicated to remote sites. Stretch clusters mitigate these limitations by enabling data replication and redundancy across geographically dispersed locations.

High Availability and resilience

HM leverages Kubernetes' proven technologies to ensure high availability through automated failover mechanisms. These mechanisms function within stretch clusters, across availability zones, or between zones in the same region. The architecture supports faster backups, via snapshot backups, as well as recovery for large datasets, though multi-site recovery capabilities depend on the underlying storage infrastructure. For customers with advanced storage configurations, such as SANs (Storage Area Networks) spanning multiple racks or data centers, HM can leverage the infrastructure to enhance resilience and ensure rapid failover. However, these configurations may not be common among all users.


Could this page be better? Report a problem or suggest an addition!