Apache Iceberg

Suggest edits

Apache Iceberg is a high-performance, open table format designed for data lakes built on object storage. It provides scalable metadata management, schema evolution, and ACID transactions.

EDB enables Iceberg support within the Analytics Accelerator, allowing Postgres to query and manage large data lake tables efficiently.

For implementation in Hybrid Manager (HM), see Working with Apache Iceberg in EDB HM.

What is Apache Iceberg

Iceberg is a table format designed for data lakes. It defines how data files (Parquet, ORC, Avro) are organized and tracked, adding many database-like capabilities to object storage.

Key goals

Reliability: Ensures data consistency with concurrent operations
Performance: Enables fast queries through partition pruning and metadata optimizations
Scalability: Designed to manage petabyte-scale tables
Openness: Apache Software Foundation project with broad ecosystem support

Related concepts: Generic concepts - open table formats

Key features and benefits

Schema evolution: Safely add, drop, or modify columns without rewriting table data
Hidden partitioning: Partitions data transparently, enabling efficient partition pruning
Time travel and versioning: Supports querying historical table versions
ACID transactions: Provides transactional guarantees using optimistic concurrency
Multiple file format support: Manages Parquet, ORC, and Avro files
Efficient metadata management: Uses a metadata tree structure for fast query planning and data skipping
Catalog integration: Supports REST-based catalogs (Project Nessie, AWS Glue), Hive Metastore, and others

Why Apache Iceberg matters for EDB analytics

Iceberg support enables the Analytics Accelerator to:

Unify operational and analytical data: Query Iceberg tables alongside traditional Postgres data
Leverage cost-effective object storage: Store large historical datasets efficiently
Improve performance: Use Iceberg metadata and partition pruning with vectorized engines
Support tiered storage: PGD can offload data to Iceberg as part of a tiered table strategy
Enable interoperability: Share data with tools like Spark, Presto, Trino, Flink, and Dremio

How EDB solutions implement Iceberg support

PGAA extensions

The Postgres Analytical Appliance (PGAA) components within Lakehouse nodes and PGD nodes can read and interact with Iceberg metadata and data files.

Catalog connectivity

PGAA connects to Iceberg catalogs (such as Lakekeeper, AWS Glue, or REST catalogs) using:

SELECT pgaa.add_catalog(...);
SELECT pgaa.attach_catalog(...);

This enables discovery of tables, reading schemas, and querying data.

Querying Iceberg tables Users can define external tables in Postgres that reference Iceberg tables:

CREATE TABLE my_table USING PGAA WITH (
  pgaa.format = 'iceberg',
  pgaa.managed_by = 'my_catalog',
  pgaa.catalog_namespace = 'my_db',
  pgaa.catalog_table = 'my_table'
);

Once defined, the table is fully queryable using Postgres SQL.

Writing to Iceberg tables

PGD can offload data to Iceberg format in object storage:

New Iceberg tables can be created

Existing Iceberg tables can be appended to

Offload operations typically follow tiered table strategies

Related concepts: Tiered Tables

← Prev

Delta Lake

↑ Up

Analytics/Lakehouse

EDB Postgres Lakehouse

Could this page be better? Report a problem or suggest an addition!