Data Lake Explained

The Data Lake in Gen AI Builder is the foundational object storage layer that supports the entire AI Factory content pipeline. It provides persistent, scalable storage for:

  • Uploaded files from Data Sources
  • Indexed data and embeddings used in Libraries and Knowledge Bases
  • Griptape Structures and Tools
  • Temporary artifacts used in AI workflows

In short: The Data Lake is where your AI system’s knowledge and operational data live.


Before you start

You’ll get the most out of this section if you have:

Suggested starting points:


What is the Data Lake?

The Data Lake is a required object storage backend that powers the AI Factory pipeline. It stores and manages:

  • Files ingested from Data Sources
  • Processed and transformed content staged in Libraries
  • Indexed embeddings used in Knowledge Bases
  • Griptape Structures and Tools required for Assistants and Agents
  • Temporary artifacts created during AI workflows

Without a configured Data Lake:

  • Data ingestion would fail
  • Libraries and Knowledge Bases could not be built
  • Retrieval-Augmented Generation (RAG) pipelines would not function
  • Structures and Tools would be unavailable to AI Agents and Assistants

The Data Lake is a core dependency of AI Factory.


Why is the Data Lake required?

Griptape-powered services rely on object storage to function:

  • Data Sources store and retrieve files here
  • Libraries use the Data Lake to store processed content
  • Knowledge Bases are built from content staged in the Data Lake
  • Structures and Tools required by AI Agents are stored here

The Data Lake underpins the entire AI Factory content pipeline:

Data Sources → Data Lake → Libraries → Knowledge Bases → Retrievers → Assistants → AI Applications

How the Data Lake fits into the content pipeline

  • Data Sources → Files are ingested into the Data Lake
  • Libraries → Processed content is staged and indexed via the Data Lake
  • Knowledge Bases → Built from the indexed content in the Data Lake
  • Retrievers and Assistants → Retrieve content ultimately powered by Data Lake-backed pipelines

The Data Lake provides durable, scalable storage across this entire flow.


When to configure the Data Lake

Configure the Data Lake when:

  • Deploying a new Gen AI Builder instance
  • Connecting an external object storage backend (S3, GCS, MinIO)
  • Before adding Data Sources or creating Knowledge Bases
  • When changing storage provider or bucket

Important: The Data Lake must be configured before using Libraries or Knowledge Bases.


Patterns of use

  • Dedicated bucket → Best practice is to provision a dedicated object storage bucket per Gen AI Builder deployment
  • Isolated permissions → Object storage credentials should be scoped to this bucket only
  • CORS configured → Cross-Origin Resource Sharing (CORS) must allow UI interaction with the Data Lake
  • S3-compatible → Supports AWS S3, GCS with S3 interoperability, or any compatible object storage service

Best practices

  • Use a dedicated bucket per Gen AI Builder deployment
  • Follow the principle of least privilege for storage credentials
  • Configure CORS correctly to enable portal-based interactions
  • Regularly review bucket permissions and audit access
  • Monitor storage utilization and cost

Governance and Sovereign AI

The Data Lake plays a key role in Sovereign AI:

  • You control where your data resides — your object storage, in your infrastructure
  • All AI Factory pipelines rely on content staged in the Data Lake — ensuring data sovereignty
  • All artifacts are fully observable and governed through AI Factory and Hybrid Manager

By following best practices for bucket isolation and permissions, you ensure that your AI content pipelines remain secure, auditable, and aligned with Sovereign AI principles.



Next steps



Could this page be better? Report a problem or suggest an addition!