Analytics Accelerator quickstart guide v1.6
In this quickstart guide, you will:
- Install the PGAA extension on a standalone Community Postgres instance.
- Create a storage location pointing to sample benchmark datasets in object storage.
- Run a complex analytical query and verify that Seafowl engine is working.
- Perform an analysis of the query plan.
Prerequisites
- An environment running a Postgres instance. See Compatibility for a full list of supported platforms.
- An EDB access token.
Downloading and installing the package on your Postgres instance
Download the package from the EDB repository:
export EDB_SUBSCRIPTION_TOKEN=<your-token> curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/standard/setup.deb.sh" | sudo -E bash sudo apt-get install -y postgresql-<version>-pgaa
Where <version> is your Postgres version.
export EDB_SUBSCRIPTION_TOKEN=<your-token> curl -1sSLf "https://downloads.enterprisedb.com/$EDB_SUBSCRIPTION_TOKEN/standard/setup.rpm.sh" | sudo -E bash sudo dnf install -y postgresql-<version>-pgaa
Where <version> is your Postgres version.
Edit your postgresql.conf file to add pgaa to to your shared_preload_libraries, and enable the automatic management and startup of the Seafowl query engine via the Postgres background worker:
shared_preload_libraries = 'pgaa' pgaa.autostart_seafowl = on
After saving the changes, restart your Postgres service. Once restarted, log in to your Postgres instance and create the PGAA extension:
CREATE EXTENSION IF NOT EXISTS pgaa CASCADE;
Using CASCADE ensures that any required dependencies (like PGFS) are also created. Use \dx to verify that the extensions have been installed and check their versions.
Configuring a storage location
Create a storage location pointing to our public S3 bucket containing TPC-H benchmark data:
SELECT pgfs.create_storage_location( 'quickstart-sample-data', 's3://beacon-analytics-demo-data-us-east-1-prod', '{"skip_signature": "true", "region": "us-east-1"}' );
Verify that the storage location was created:
SELECT pgfs.list_storage_locations();
Creating analytical tables
Create the following analytical tables. These tables map directly to Delta Lake files in our sample benchmark datasets.
CREATE TABLE supplier () USING PGAA WITH (pgaa.storage_location = 'quickstart-sample-data', pgaa.path = 'tpch_sf_1/supplier', pgaa.format = 'delta'); CREATE TABLE lineitem () USING PGAA WITH (pgaa.storage_location = 'quickstart-sample-data', pgaa.path = 'tpch_sf_1/lineitem', pgaa.format = 'delta'); CREATE TABLE orders () USING PGAA WITH (pgaa.storage_location = 'quickstart-sample-data', pgaa.path = 'tpch_sf_1/orders', pgaa.format = 'delta'); CREATE TABLE nation () USING PGAA WITH (pgaa.storage_location = 'quickstart-sample-data', pgaa.path = 'tpch_sf_1/nation', pgaa.format = 'delta');
Running an analytical query
This query identifies suppliers in Saudi Arabia who were the cause of delays in multi-supplier orders. This is a heavy analytical task involving large joins and multiple subqueries.
SELECT s_name, COUNT(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = 'SAUDI ARABIA' GROUP BY s_name ORDER BY numwait DESC, s_name LIMIT 100;
Analyzing the query plan
To see how PGAA accelerates this query, prepend EXPLAIN to the statement above and inspect the output:
QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SeafowlDirectScan: Logical Plan Limit: skip=0, fetch=100 Sort: numwait DESC NULLS FIRST, supplier.s_name ASC NULLS LAST Projection: supplier.s_name, count(Int64(1)) AS count(*) AS numwait Aggregate: groupBy=[[supplier.s_name]], aggr=[[count(Int64(1))]] Filter: supplier.s_suppkey = l1.l_suppkey AND orders.o_orderkey = l1.l_orderkey AND orders.o_orderstatus = Utf8("F") AND l1.l_receiptdate > l1.l_commitdate AND EXISTS (<subquery>) AND NOT EXISTS (<subquery>) AND supplier.s_nationkey = nation.n_nationkey AND nation.n_name = Utf8("SAUDI ARABIA") Subquery: Projection: l2.l_orderkey, l2.l_partkey, l2.l_suppkey, l2.l_linenumber, l2.l_quantity, l2.l_extendedprice, l2.l_discount, l2.l_tax, l2.l_returnflag, l2.l_linestatus, l2.l_shipdate, l2.l_commitdate, l2.l_receiptdate, l2.l_shipinstruct, l2.l_shipmode, l2.l_comment Filter: l2.l_orderkey = outer_ref(l1.l_orderkey) AND l2.l_suppkey != outer_ref(l1.l_suppkey) SubqueryAlias: l2 TableScan: lineitem Subquery: Projection: l3.l_orderkey, l3.l_partkey, l3.l_suppkey, l3.l_linenumber, l3.l_quantity, l3.l_extendedprice, l3.l_discount, l3.l_tax, l3.l_returnflag, l3.l_linestatus, l3.l_shipdate, l3.l_commitdate, l3.l_receiptdate, l3.l_shipinstruct, l3.l_shipmode, l3.l_comment Filter: l3.l_orderkey = outer_ref(l1.l_orderkey) AND l3.l_suppkey != outer_ref(l1.l_suppkey) AND l3.l_receiptdate > l3.l_commitdate SubqueryAlias: l3 TableScan: lineitem Cross Join: Cross Join: Cross Join: TableScan: supplier SubqueryAlias: l1 TableScan: lineitem TableScan: orders TableScan: nation (24 rows)
The presence of SeafowlDirectScan at the top of the plan indicates that PGAA has taken full control of the query execution, bypassing the standard Postgres executor to run the query directly against the data lake.
Instead of pulling billions of rows into Postgres to process them, the entire logical plan—including joins, filters, and aggregations—is pushed down into the optimized analytical engine.
Subquery pushdown: Notice the
EXISTSandNOT EXISTSsubqueries on thelineitemtable. These are not being executed one-by-one by Postgres. Instead, they are part of the global logical plan being processed in parallel against the Parquet files.Vectorized joins: The
Cross Joinnodes combined with theFilterhigher up indicate that the engine is performing hash joins or nested loop joins at the storage layer.Predicate pushdown: Filters like
n_name = Utf8("SAUDI ARABIA")ando_orderstatus = Utf8("F")are applied during the initial scan. This means only the relevant data is read from the lake, significantly reducing I/O.Final aggregation and sort: The
Aggregateand theSorthappen at the end of the pipeline before the final 100 rows are handed back to the Postgres client.
Conclusion
You have successfully installed PGAA, configured it to reach cloud-based object storage, and verified that complex SQL logic—including subqueries and joins—is being natively offloaded.
By leveraging Seafowl DirectScan, your Postgres instance can now act as a gateway to petabyte-scale data lakes, providing analytical performance that scales independently of your transactional compute.