Performance tuning for Delta Lake queries

Suggest edits

This guide provides tips for tuning and optimizing the performance of Delta Lake queries executed from a Hybrid Manager (HM) Lakehouse cluster.

Lakehouse clusters use a vectorized query engine (via PGAA) to efficiently query Delta Lake tables stored in object storage.

For background on Delta Lake integration, see Delta Lake in Hybrid Manager.

Goals

After completing this guide, you will:

✅ Understand common tuning patterns for Delta Lake queries from Lakehouse clusters ✅ Be able to write queries that take advantage of partition pruning and data skipping ✅ Know how to monitor query performance and adjust Lakehouse setup if needed ✅ Unlock faster, more cost-efficient analytics over Delta Lake data from Postgres tools

When to use this

You should apply these practices when:

You are querying existing Delta Lake tables from an HM Lakehouse cluster
You want to improve query speed and resource efficiency
You are preparing Lakehouse clusters for BI tool integration (Tableau, Power BI, etc.)
You are troubleshooting long-running Delta Lake queries

General best practices

Use partitioning effectively

Delta Lake supports partitioning — design your Delta Lake tables with partition columns that match common query filters.
Queries that filter on partition columns can prune unnecessary Parquet files for faster scans.

Example:

SELECT order_id, sale_amount
FROM public.sales_delta_table
WHERE sale_year = 2024 AND sale_month = 5;

Leverage data skipping

Delta Lake supports data skipping through statistics stored in the Delta transaction log.
If your Delta Lake table is written by Spark or another tool that collects statistics, PGAA may leverage this for more efficient query planning.

Use LIMIT and targeted projections

For exploratory queries, always use LIMIT and only SELECT necessary columns to minimize I/O.

Example:

SELECT order_id, sale_amount
FROM public.sales_delta_table
LIMIT 100;

Filter early and specifically

Push filters into your WHERE clause to enable maximum pruning.

Good:

SELECT * FROM public.sales_delta_table
WHERE sale_date >= '2023-01-01' AND product_category = 'Electronics';

Avoid:

SELECT * FROM public.sales_delta_table;

(full table scan)

Validate PGFS location proximity

Ensure that your Lakehouse cluster is provisioned in a cloud region close to your object storage to minimize data transfer latency.

Monitor query performance

Use HM monitoring dashboards to observe query times and resource usage on Lakehouse clusters.
Identify and tune queries with long runtimes or large I/O.

Known limitations

PGAA currently supports read-only access to Delta Lake tables.
Schema evolution is supported — PGAA will read the current schema from the Delta transaction log.
Complex ACID features (MERGE, DELETE) are not supported for writes via PGAA — use Spark or supported ETL tools for Delta table modifications.

Next steps

Now that you have optimized your Delta Lake queries:

You can Query Delta Lake tables with better performance.
You can Configure PGFS for Delta Lake to ensure fast, reliable storage access.
You can Integrate Lakehouse queries with BI tools to unlock new analytics use cases.

For architecture overview, see Analytics in Hybrid Manager.

← Prev

Offload PGD data to Apache Iceberg

↑ Up

How-To Guides for Analytics in Hybrid Manager

Query Delta Lake tables

Could this page be better? Report a problem or suggest an addition!