Configure Data Sources in Gen AI Builder

Who is this for

Platform users setting up Gen AI Builder in Hybrid Manager AI Factory, preparing to ingest content into AI Factory Pipelines and Knowledge Bases.

This includes platform administrators, content owners, AI builders, and teams responsible for providing data to AI Assistants and Retrieval-Augmented Generation (RAG) pipelines.

What you will accomplish

You will configure Data Sources — the ingestion endpoints that bring content into the AI Factory content pipeline.

You will:

Configure connectors for Data Sources (web, S3, GCS, Confluence, and more)
Test connectivity and ingest content
Validate that your content appears in Pipelines and Knowledge Bases

Why configure Data Sources

Data Sources are the entry point for content in AI Factory:

They feed content into the Data Lake → Libraries → Knowledge Bases → AI Applications pipeline.
Without Data Sources, Pipelines and Knowledge Bases have no content to operate on.
Data Sources allow you to manage content refresh, scope, and metadata — all key for Sovereign AI.

Data Sources can include:

Documents (PDFs, Word, HTML)
Web pages
Object storage (S3, GCS)
Internal content systems (Confluence, shared drives)
Custom systems (via Structures or API integrations)

For background:

Complexity and time to complete

Complexity: Low to moderate (depends on Data Source type)
Estimated time: 5–30 minutes per Data Source

Prerequisites

The Data Lake must be configured — see Configure the Data Lake.
For cloud object storage:
Access to S3 / GCS / Azure storage
Appropriate API credentials
For Confluence:
Confluence URL
User account with API token and permissions
For web pages:
Public or intranet-accessible URLs
For custom Data Sources:
Griptape Structure packaged and ready to deploy

How to configure Data Sources

1. Navigate to Data Sources

In Gen AI Builder, go to the Data Sources tab.
Click Add Data Source.

2. Select your Data Source type

Supported types include:

Web Page
S3
GCS
Confluence
Data Lake
Custom (via Structure)

3. Configure connection settings

For S3

Bucket name
Path prefix (optional)
Endpoint URL (for private / compatible S3)
Access Key ID
Secret Access Key

For GCS

Bucket name
Path prefix (optional)
Project ID
HMAC keys (Access Key ID / Secret Access Key)

For Confluence

Base URL
API Token
User email
Space(s) to index

For Web Page

Base URL
Optional URL filters (regex patterns)

For Custom Structure

Upload Structure zip or link to GitHub repo
Configure parameters as required by your Structure

4. Test connection

Click Test Connection to validate credentials and connectivity.
Resolve any errors before proceeding.

5. Set sync options

Full sync — Initial full load of Data Source content.
Incremental sync — Ongoing refresh of updated content.

6. Launch sync

Click Sync Now to trigger ingestion.
Monitor sync progress.
Verify that content appears in the Data Lake.

7. Monitor Data Source status

View sync history and logs.
Monitor for errors or partial loads.
Adjust sync frequency as needed.

Best practices

Always validate Data Source scope — be intentional about which content is ingested.
Review content metadata and tags — useful for Hybrid Knowledge Bases.
Monitor initial sync results — large Data Sources may take time.
For sensitive content:
Ensure appropriate Data Source permissions.
Configure fine-grained scoping and sync filters.
Use custom Structures for complex or proprietary content ingestion.

By configuring Data Sources, you enable AI Factory to ingest, process, and index your content — powering Knowledge Bases, Assistants, and AI pipelines across your EDB PG AI and Hybrid Manager environment.

← Prev

Gen AI How-To Guides

↑ Up