Configure Data Sources in Gen AI Builder

Who is this for

Platform users setting up Gen AI Builder in Hybrid Manager AI Factory, preparing to ingest content into AI Factory Pipelines and Knowledge Bases.

This includes platform administrators, content owners, AI builders, and teams responsible for providing data to AI Assistants and Retrieval-Augmented Generation (RAG) pipelines.

What you will accomplish

You will configure Data Sources — the ingestion endpoints that bring content into the AI Factory content pipeline.

You will:

  • Configure connectors for Data Sources (web, S3, GCS, Confluence, and more)
  • Test connectivity and ingest content
  • Validate that your content appears in Pipelines and Knowledge Bases

Why configure Data Sources

Data Sources are the entry point for content in AI Factory:

  • They feed content into the Data Lake → Libraries → Knowledge Bases → AI Applications pipeline.
  • Without Data Sources, Pipelines and Knowledge Bases have no content to operate on.
  • Data Sources allow you to manage content refresh, scope, and metadata — all key for Sovereign AI.

Data Sources can include:

  • Documents (PDFs, Word, HTML)
  • Web pages
  • Object storage (S3, GCS)
  • Internal content systems (Confluence, shared drives)
  • Custom systems (via Structures or API integrations)

For background:

Complexity and time to complete

  • Complexity: Low to moderate (depends on Data Source type)
  • Estimated time: 5–30 minutes per Data Source

Prerequisites

  • The Data Lake must be configured — see Configure the Data Lake.
  • For cloud object storage:
  • Access to S3 / GCS / Azure storage
  • Appropriate API credentials
  • For Confluence:
  • Confluence URL
  • User account with API token and permissions
  • For web pages:
  • Public or intranet-accessible URLs
  • For custom Data Sources:
  • Griptape Structure packaged and ready to deploy

How to configure Data Sources

1. Navigate to Data Sources

  1. In Gen AI Builder, go to the Data Sources tab.
  2. Click Add Data Source.

2. Select your Data Source type

Supported types include:

  • Web Page
  • S3
  • GCS
  • Confluence
  • Data Lake
  • Custom (via Structure)

3. Configure connection settings

For S3

  • Bucket name
  • Path prefix (optional)
  • Endpoint URL (for private / compatible S3)
  • Access Key ID
  • Secret Access Key

For GCS

  • Bucket name
  • Path prefix (optional)
  • Project ID
  • HMAC keys (Access Key ID / Secret Access Key)

For Confluence

  • Base URL
  • API Token
  • User email
  • Space(s) to index

For Web Page

  • Base URL
  • Optional URL filters (regex patterns)

For Custom Structure

  • Upload Structure zip or link to GitHub repo
  • Configure parameters as required by your Structure

4. Test connection

  • Click Test Connection to validate credentials and connectivity.
  • Resolve any errors before proceeding.

5. Set sync options

  • Full sync — Initial full load of Data Source content.
  • Incremental sync — Ongoing refresh of updated content.

6. Launch sync

  • Click Sync Now to trigger ingestion.
  • Monitor sync progress.
  • Verify that content appears in the Data Lake.

7. Monitor Data Source status

  • View sync history and logs.
  • Monitor for errors or partial loads.
  • Adjust sync frequency as needed.

Best practices

  • Always validate Data Source scope — be intentional about which content is ingested.
  • Review content metadata and tags — useful for Hybrid Knowledge Bases.
  • Monitor initial sync results — large Data Sources may take time.
  • For sensitive content:
  • Ensure appropriate Data Source permissions.
  • Configure fine-grained scoping and sync filters.
  • Use custom Structures for complex or proprietary content ingestion.


By configuring Data Sources, you enable AI Factory to ingest, process, and index your content — powering Knowledge Bases, Assistants, and AI pipelines across your EDB PG AI and Hybrid Manager environment.


Could this page be better? Report a problem or suggest an addition!