Analytics Accelerator generic concepts

Suggest edits

This hub explains key industry concepts, architectures, and technologies that form the foundation of modern data analytics platforms.

Understanding these concepts helps you see how EDB’s Analytics Accelerator builds on proven patterns. To learn how EDB implements these ideas, visit the Analytics Accelerator concepts hub.

Architectures

Data warehouses

What: Centralized repositories optimized for structured, SQL-based analytics. They use schema-on-write and are designed for fast analytical queries.

Why: Business intelligence teams need trusted, consistent data for reporting and historical analysis.

How: Data is loaded through ETL processes into relational tables with strong typing and constraints.

Why use it: When you need high performance for known, consistent reporting needs and strong governance over data quality.

Learn how this influences Analytics Accelerator concepts.

Data lakes

What: Flexible repositories for raw data (structured, semi-structured, unstructured). Schema-on-read.

Why: Data lakes let you store all your data at low cost, in its native format.

How: Data is loaded to object storage (e.g., S3, GCS) as files, often in Parquet or ORC format.

Why use it: When you need to capture all available data — even before knowing the full set of use cases — and run flexible processing later.

Data lakehouse

What: A modern architecture that combines the flexibility of data lakes with the performance and reliability of data warehouses.

Why: Organizations want one system for both data science and BI, without moving large amounts of data.

How: Data is stored in the lake (object storage) using open table formats that add structure, transactions, and metadata.

Why use it: When you want scalable, low-cost storage but also need ACID guarantees and high-performance queries.

Lakehouse architecture paper (Armbrust et al.) Apache Iceberg Delta Lake Apache Hudi

Learn how this influences Analytics Accelerator concepts.

Key technologies

Columnar storage formats

What: Data formats that store columns together rather than rows.

Why: Most analytical queries only access a subset of columns. Reading columns directly improves performance and reduces I/O.

How: Popular formats are Apache Parquet and Apache ORC, which support compression and column pruning.

Why use it: When you run large, read-heavy analytical queries and want maximum scan speed.

See how this is used in Analytics Accelerator concepts.

Vectorized query engines

What: Query engines that process data in columnar batches (vectors) instead of row by row.

Why: Columnar batches match CPU cache behavior and SIMD instructions, making analytics much faster.

How: Engines like Apache DataFusion and Velox implement vectorized execution.

Why use it: When your analytical workload is CPU-bound and performance matters.

See how this is used in Analytics Accelerator concepts.

Separation of storage and compute

What: Architectures where compute (query engines) and storage (object storage) are decoupled.

Why: This allows you to scale them independently, optimizing cost and performance.

How: Systems like the Analytics Accelerator use object storage (e.g., S3) for data and stateless compute nodes for queries.

Why use it: When your compute needs are bursty or unpredictable, and you want to avoid over-provisioning.

See how this is used in Analytics Accelerator concepts.

Open table formats

What: Formats that bring database features (ACID transactions, schema evolution) to data lakes.

Why: Without an open table format, lake data is a set of loose files. You need structure and governance.

How:

Apache Iceberg uses metadata layers and manifest files
Delta Lake uses transaction logs (delta-rs)
Apache Hudi focuses on stream-first updates

Why use it: When you want strong data management for lake-based data.

See how this is used in Analytics Accelerator concepts.

Core concepts

OLAP / OLTP

What:

OLAP: Analytical queries (aggregations, trends) on large datasets
OLTP: High-volume, small transactions (updates, inserts)

Why: Different workloads have different performance profiles.

How: OLAP is optimized with columnar storage and parallel execution. OLTP uses row-based storage and transactional guarantees.

Why use it: To choose the right architecture for each type of workload.

See how this is used in Analytics Accelerator concepts.

ETL / ELT

What:

ETL: Extract, Transform, Load
ELT: Extract, Load, Transform (in the lake or warehouse)

Why: ELT takes advantage of modern lakehouses’ ability to process raw data at scale.

How: Data is staged in object storage, then transformed using query engines or pipelines.

Why use it: To simplify data ingestion and make pipelines more flexible.

See how this is used in Analytics Accelerator concepts.

Data tiering

What: Classifying data into hot, warm, cold based on access patterns.

Why: Not all data needs fast storage — moving cold data to cheaper storage saves cost.

How: Data is physically separated across storage types (fast SSDs, object storage).

Why use it: To optimize storage cost vs. performance.

See how this is used in Analytics Accelerator concepts.

Semantic search

What: Search that understands meaning, not just keywords.

Why: Users expect smarter search experiences.

How: Data is represented as vector embeddings and compared using similarity search.

Why use it: To support search applications in AI and modern apps.

See how this is used in Analytics Accelerator concepts.

Vector embeddings

What: Numerical representations of data (text, images, audio) in high-dimensional space.

Why: Enable similarity comparisons that go beyond exact matches.

How: Libraries like FAISS, Pinecone, and many ML models generate and use embeddings.

Why use it: For AI/ML search and retrieval tasks.

See how this is used in Analytics Accelerator concepts.

Retrieval augmented generation (RAG)

What: AI pattern where an LLM retrieves documents from an external source before answering.

Why: Makes LLM responses more accurate and grounded in your data.

How: Vector search retrieves relevant embeddings; LLM combines them with the prompt.

Why use it: To improve factual accuracy in AI-driven apps.

Intro to RAG

See how this is used in Analytics Accelerator concepts.

Additional concepts

Multi-engine interoperability

What: Architecture where multiple query engines can access shared data in the lake.

Why: Organizations want flexibility — different tools for different use cases (SQL engines, data science notebooks, streaming engines).

How: Open table formats (Iceberg, Delta Lake) enable multi-engine access:

Spark writes data → Trino, Postgres Lakehouse, and other engines can query it.
PGD offloads data → Lakehouse clusters and external engines can analyze it.

When to use: Build flexible data architectures without vendor lock-in.

Data freshness and latency

What: Ability to query data as soon as it lands in the lake, with minimal delay.

Why: Business users increasingly expect near-real-time access to new data.

How: Lakehouse architectures support this:

ELT pipelines can continuously update lake data.
Open table formats support atomic commits and transaction visibility.
Vectorized Lakehouse engines can query updated data immediately.

When to use: Analytics on operational or streaming data, without requiring full warehouse ETL cycles.

Explore next

← Prev

Analytics Accelerator concepts

↑ Up

Analytics Accelerator concepts and technologies

Analytics Terminology

Could this page be better? Report a problem or suggest an addition!

Analytics Accelerator generic concepts

Architectures

Data warehouses

Data lakes

Data lakehouse

Key technologies

Columnar storage formats

Vectorized query engines

Separation of storage and compute

Open table formats

Core concepts

OLAP / OLTP

ETL / ELT

Data tiering

Semantic search

Vector embeddings

Retrieval augmented generation (RAG)

Additional concepts

Multi-engine interoperability

Data freshness and latency

Explore next

← Prev

↑ Up

Next →