Data Lake Explained

Suggest edits

The Data Lake in Gen AI Builder is the foundational object storage layer that supports the entire AI Factory content pipeline. It provides persistent, scalable storage for:

Uploaded files from Data Sources
Indexed data and embeddings used in Libraries and Knowledge Bases
Griptape Structures and Tools
Temporary artifacts used in AI workflows

In short: The Data Lake is where your AI system’s knowledge and operational data live.

Before you start

You’ll get the most out of this section if you have:

Familiarity with object storage concepts (S3, GCS, MinIO, etc.)
Understanding of:
Data Sources Explained
Libraries
Knowledge Bases Explained
Retrievers Explained
Helpful: experience with Hybrid Manager and object storage configuration

Suggested starting points:

What is the Data Lake?

The Data Lake is a required object storage backend that powers the AI Factory pipeline. It stores and manages:

Files ingested from Data Sources
Processed and transformed content staged in Libraries
Indexed embeddings used in Knowledge Bases
Griptape Structures and Tools required for Assistants and Agents
Temporary artifacts created during AI workflows

Without a configured Data Lake:

Data ingestion would fail
Libraries and Knowledge Bases could not be built
Retrieval-Augmented Generation (RAG) pipelines would not function
Structures and Tools would be unavailable to AI Agents and Assistants

The Data Lake is a core dependency of AI Factory.

Why is the Data Lake required?

Griptape-powered services rely on object storage to function:

Data Sources store and retrieve files here
Libraries use the Data Lake to store processed content
Knowledge Bases are built from content staged in the Data Lake
Structures and Tools required by AI Agents are stored here

The Data Lake underpins the entire AI Factory content pipeline:

Data Sources → Data Lake → Libraries → Knowledge Bases → Retrievers → Assistants → AI Applications

How the Data Lake fits into the content pipeline

Data Sources → Files are ingested into the Data Lake
Libraries → Processed content is staged and indexed via the Data Lake
Knowledge Bases → Built from the indexed content in the Data Lake
Retrievers and Assistants → Retrieve content ultimately powered by Data Lake-backed pipelines

The Data Lake provides durable, scalable storage across this entire flow.

When to configure the Data Lake

Configure the Data Lake when:

Deploying a new Gen AI Builder instance
Connecting an external object storage backend (S3, GCS, MinIO)
Before adding Data Sources or creating Knowledge Bases
When changing storage provider or bucket

Important: The Data Lake must be configured before using Libraries or Knowledge Bases.

Patterns of use

Dedicated bucket → Best practice is to provision a dedicated object storage bucket per Gen AI Builder deployment
Isolated permissions → Object storage credentials should be scoped to this bucket only
CORS configured → Cross-Origin Resource Sharing (CORS) must allow console interaction with the Data Lake
S3-compatible → Supports AWS S3, GCS with S3 interoperability, or any compatible object storage service

Best practices

Use a dedicated bucket per Gen AI Builder deployment
Follow the principle of least privilege for storage credentials
Configure CORS correctly to enable portal-based interactions
Regularly review bucket permissions and audit access
Monitor storage utilization and cost

Governance and Sovereign AI

The Data Lake plays a key role in Sovereign AI:

You control where your data resides — your object storage, in your infrastructure
All AI Factory pipelines rely on content staged in the Data Lake — ensuring data sovereignty
All artifacts are fully observable and governed through AI Factory and Hybrid Manager

By following best practices for bucket isolation and permissions, you ensure that your AI content pipelines remain secure, auditable, and aligned with Sovereign AI principles.

Next steps

Explore other AI Factory Explained pages
Configure your first Data Lake → How-to: Configure the Data Lake
Continue through the AI Factory 101 Path → AI Factory 101

← Prev

Assistants Explained

↑ Up

AI Factory Explained

AI Factory generic concepts

Could this page be better? Report a problem or suggest an addition!