Apache Iceberg
Apache Iceberg is a high-performance, open table format designed for data lakes built on object storage. It provides scalable metadata management, schema evolution, and ACID transactions.
EDB enables Iceberg support within the Analytics Accelerator, allowing Postgres to query and manage large data lake tables efficiently.
For implementation in Hybrid Manager (HM), see Working with Apache Iceberg in EDB HM.
What is Apache Iceberg
Iceberg is a table format designed for data lakes. It defines how data files (Parquet, ORC, Avro) are organized and tracked, adding many database-like capabilities to object storage.
Key goals
- Reliability: Ensures data consistency with concurrent operations
- Performance: Enables fast queries through partition pruning and metadata optimizations
- Scalability: Designed to manage petabyte-scale tables
- Openness: Apache Software Foundation project with broad ecosystem support
Related concepts: Generic concepts - open table formats
Key features and benefits
- Schema evolution: Safely add, drop, or modify columns without rewriting table data
- Hidden partitioning: Partitions data transparently, enabling efficient partition pruning
- Time travel and versioning: Supports querying historical table versions
- ACID transactions: Provides transactional guarantees using optimistic concurrency
- Multiple file format support: Manages Parquet, ORC, and Avro files
- Efficient metadata management: Uses a metadata tree structure for fast query planning and data skipping
- Catalog integration: Supports REST-based catalogs (Project Nessie, AWS Glue), Hive Metastore, and others
Why Apache Iceberg matters for EDB analytics
Iceberg support enables the Analytics Accelerator to:
- Unify operational and analytical data: Query Iceberg tables alongside traditional Postgres data
- Leverage cost-effective object storage: Store large historical datasets efficiently
- Improve performance: Use Iceberg metadata and partition pruning with vectorized engines
- Support tiered storage: PGD can offload data to Iceberg as part of a tiered table strategy
- Enable interoperability: Share data with tools like Spark, Presto, Trino, Flink, and Dremio
Related concepts: Analytics Accelerator concepts - EDB Postgres Lakehouse
How EDB solutions implement Iceberg support
PGAA extensions
The Postgres Analytical Appliance (PGAA) components within Lakehouse nodes and PGD nodes can read and interact with Iceberg metadata and data files.
Catalog connectivity
PGAA connects to Iceberg catalogs (such as Lakekeeper, AWS Glue, or REST catalogs) using:
SELECT pgaa.add_catalog(...); SELECT pgaa.attach_catalog(...);
This enables discovery of tables, reading schemas, and querying data.
Querying Iceberg tables Users can define external tables in Postgres that reference Iceberg tables:
CREATE TABLE my_table USING PGAA WITH ( pgaa.format = 'iceberg', pgaa.managed_by = 'my_catalog', pgaa.catalog_namespace = 'my_db', pgaa.catalog_table = 'my_table' );
Once defined, the table is fully queryable using Postgres SQL.
Writing to Iceberg tables
PGD can offload data to Iceberg format in object storage:
New Iceberg tables can be created
Existing Iceberg tables can be appended to
Offload operations typically follow tiered table strategies
Related concepts: Tiered Tables
Related topics
Could this page be better? Report a problem or suggest an addition!