Performance tuning for Delta Lake queries
This guide provides tips for tuning and optimizing the performance of Delta Lake queries executed from a Hybrid Manager (HM) Lakehouse cluster.
Lakehouse clusters use a vectorized query engine (via PGAA) to efficiently query Delta Lake tables stored in object storage.
For background on Delta Lake integration, see Delta Lake in Hybrid Manager.
Goals
After completing this guide, you will:
✅ Understand common tuning patterns for Delta Lake queries from Lakehouse clusters ✅ Be able to write queries that take advantage of partition pruning and data skipping ✅ Know how to monitor query performance and adjust Lakehouse setup if needed ✅ Unlock faster, more cost-efficient analytics over Delta Lake data from Postgres tools
When to use this
You should apply these practices when:
- You are querying existing Delta Lake tables from an HM Lakehouse cluster
- You want to improve query speed and resource efficiency
- You are preparing Lakehouse clusters for BI tool integration (Tableau, Power BI, etc.)
- You are troubleshooting long-running Delta Lake queries
General best practices
Use partitioning effectively
- Delta Lake supports partitioning — design your Delta Lake tables with partition columns that match common query filters.
- Queries that filter on partition columns can prune unnecessary Parquet files for faster scans.
Example:
SELECT order_id, sale_amount FROM public.sales_delta_table WHERE sale_year = 2024 AND sale_month = 5;
Leverage data skipping
- Delta Lake supports data skipping through statistics stored in the Delta transaction log.
- If your Delta Lake table is written by Spark or another tool that collects statistics, PGAA may leverage this for more efficient query planning.
Use LIMIT and targeted projections
- For exploratory queries, always use LIMIT and only SELECT necessary columns to minimize I/O.
Example:
SELECT order_id, sale_amount FROM public.sales_delta_table LIMIT 100;
Filter early and specifically
Push filters into your WHERE clause to enable maximum pruning.
Good:
SELECT * FROM public.sales_delta_table WHERE sale_date >= '2023-01-01' AND product_category = 'Electronics';
Avoid:
SELECT * FROM public.sales_delta_table;
(full table scan)
Validate PGFS location proximity
- Ensure that your Lakehouse cluster is provisioned in a cloud region close to your object storage to minimize data transfer latency.
Monitor query performance
- Use HM monitoring dashboards to observe query times and resource usage on Lakehouse clusters.
- Identify and tune queries with long runtimes or large I/O.
Known limitations
- PGAA currently supports read-only access to Delta Lake tables.
- Schema evolution is supported — PGAA will read the current schema from the Delta transaction log.
- Complex ACID features (MERGE, DELETE) are not supported for writes via PGAA — use Spark or supported ETL tools for Delta table modifications.
Next steps
Now that you have optimized your Delta Lake queries:
- You can Query Delta Lake tables with better performance.
- You can Configure PGFS for Delta Lake to ensure fast, reliable storage access.
- You can Integrate Lakehouse queries with BI tools to unlock new analytics use cases.
For architecture overview, see Analytics in Hybrid Manager.
← Prev
Offload PGD data to Apache Iceberg
↑ Up
How-To Guides for Analytics in Hybrid Manager
Next →
Query Delta Lake tables
Could this page be better? Report a problem or suggest an addition!