PGD data tiering Innovation Release

Suggest edits

This documentation covers the current Innovation Release of EDB Postgres AI. You may also want the docs for the current LTS version.

Integrate Advanced High Availability or Distributed High Availability cluster (PGD) with Analytics Accelerator lakehouse to manage your data lifecycle by moving Postgres data from expensive transactional storage to cost-effective object storage while maintaining transparent query access.

This architecture implements two distinct storage tiers optimized for different access patterns:

"Hot" Tier stores current operational data requiring sub-millisecond latency and full transactional capabilities in PostgreSQL storage.
"Cold" Tier holds historical data in columnar formats optimized for analytical queries, such as Apache Iceberg® or Delta Lake, in object storage.

Why use PGD data tiering

Unify data management and analytics by separating active data from historical data, optimizing both cost and speed.

Cost & storage efficiency

Reduce your transactional database size by moving historical data to a low-cost data lake.
Reduce storage costs and improve operational efficiency since smaller databases backup faster, require less storage, and restore more quickly during disaster recovery.
Track storage statistics and size reduction, quantify cost savings, and monitor the partition lifecycle.

Performance & query access

Optimize queries through partition pruning (instantly discarding irrelevant partitions) and parallel processing across both storage tiers.
Run queries on the fastest available engine utilizing standard PostgreSQL processing for recent "hot" data, and the Analytics Accelerator's vectorized engine for compressed, columnar "cold" data in object storage.
Improve efficiency of active data by removing historical data, resulting in more efficient indexes, faster background operations, and improved cache hit rates.
Merge results transparently, presenting a unified dataset to applications.

Data integrity

Drive efficient lifecycle management and ensure consistent visibility across the PGD cluster by utilizing comprehensive metadata to track all partition locations, offload timestamps, and lifecycle states.
Improve cluster reliability and maintain data integrity with automatic retries and rollbacks for failed offload operations.

Use cases

Regulatory compliance: Maintain lean, high-performance transactional systems by offloading aged data. Ensure regulatory compliance through complete, long-term data retention and maintenance of comprehensive audit trails.
Time-series analytics (IoT platforms): Optimize data management for sensor and event streams. Ensure operational dashboards access recent, "hot" data while historical analytics utilize aggregated, cost-effective "cold" data.
Financial services: Cost-efficiently manage massive transaction histories that must be retained for years due to regulatory requirements but are rarely queried.
Retail & seasonal business cycles: Manage order histories by setting adaptive tiering policies, ensuring order data from peak seasons remains "hot" longer to support customer service and extended return periods.

How to configure PGD data tiering

There are three distinct methods for managing data lifecycle and offloading data from your PGD cluster to object storage for analytics, based on your requirements.

Automated tiered tables

This method enables tiered storage within a table by automatically creating new partitions and offloading old data based on a specified threshold. It leverages PGD AutoPartition to perform automatic analytics offload to object storage.

The PGD Task Manager automatically monitors the age of the table's partitions. When a partition becomes older than the defined offload period, it bulk-copies the entire partition to the object store, and then truncates it locally.

The main benefit of this method is zero-touch data lifecycle management that optimizes storage costs.

Analytics replication

Use this method to continuously synchronize an entire table in near real-time with an analytical copy in object storage, regardless of the age of the data.

The system uses continuous logical replication that captures every DML operation on the table and streams it to the external object storage destination.

This method converts your origin heap table to an HTAP type. An HTAP table is a heap table that is mapped to an analytics table at the destination object storage where PGD replicates any changes.

You may also configure replication for partitions within a tiered table newer than the threshold offload period.

The main benefit is near real-time, analytical replica of an entire transactional table for immediate analytics.

Disable replication to switch the table back to heap and only use transactional storage.

Table offload

This method offers a one-time, surgical way to clear local storage, leaving only the copy in object storage. You can only perform this operation on tables that are already set with replication (HTAP tables).

The system converts a replicated HTAP table into PGAA table. A PGAA table is a pure, "cold" analytics table in object storage, available in the Delta or Iceberg format. This process truncates the local copy, effectively eliminating local storage overhead.

The main benefit is a rapid release of local disk space for tables that are no longer actively transactional.

Restore the table back to HTAP to recreate that local storage and resume replication.

Method	Purpose	Use case	Origin Table	Destination table
Tiered tables	Automated Lifecycle Management	High-volume time-series data where historical retention is critical and automated.	Heap	Heap/HTAP ("hot" partitios) PGAA ("cold" partitions)
Replication	Continuous Synchronization	Replicating a transactional table in near real-time for general analytics.	Heap	HTAP
Offload	Selective Archiving	One-time, surgical removal and archival of a specific table.	HTAP	PGAA

Other operations

Trigger compaction

Perform compaction to improve the performance and storage efficiency of your external analytics tables residing in object storage.

Compaction optimizes performance by consolidating the multiple small Parquet files created during offloading an replication into fewer, larger, and more compressed files, similar to an OPTIMIZE command.

Metadata management

There are two primary ways to manage the metadata of your object store tables:

No-catalog offload: The metadata is stored alongside the data files and managed directly by the object storage file system. PGD uses the PGFS extension to connect directly to the object storage.
Catalog-managed offload: This method decouples the data from the metadata. The metadata is stored and managed by a centralized service, such as Iceberg REST, creating a universal registry that allows Postgres and other tools (like Spark or Trino) to query the same data lake tables. When you offload data, the system writes the Parquet files to object storage, but registers the table's metadata in the external catalog. You may also read tables managed by an external catalog by attaching them to your data group.

Operational Considerations

Ensure cost efficiency and high availability by proactively managing your systems.

Management & Planning

Monitor partition growth rates, offload success rates, and query patterns across tiers. Use dashboards to track storage utilization trends and validate cost savings.
Perform capacity planning for both the "hot" tier (local ingestion rates) and the "cold" tier (predictable growth in object storage).
Schedule your workload for offload and restore operations to run during maintenance windows to minimize impact on production.

Maintenance & Resilience

Routinely maintain efficiency through regular jobs, such as partition merging in "cold" storage, and orphan file cleanup.
Account for the distributed nature of the data in your backup strategy, setting different frequencies for the "hot" tier and relying on the durability guarantees of object storage for the "cold" tier.
Ensure all offload operations are reversible through restore operations, allowing temporary transactional access to historical data when needed.

Configure PGD data tiering ► Learn how to configure your PGD cluster to manage your date lifecycle by moving Postgres data to object storage.

↑ Up

Analytics Accelerator