Configure Data Sources in Gen AI Builder
Who is this for
Platform users setting up Gen AI Builder in Hybrid Manager AI Factory, preparing to ingest content into AI Factory Pipelines and Knowledge Bases.
This includes platform administrators, content owners, AI builders, and teams responsible for providing data to AI Assistants and Retrieval-Augmented Generation (RAG) pipelines.
What you will accomplish
You will configure Data Sources — the ingestion endpoints that bring content into the AI Factory content pipeline.
You will:
- Configure connectors for Data Sources (web, S3, GCS, Confluence, and more)
- Test connectivity and ingest content
- Validate that your content appears in Pipelines and Knowledge Bases
Why configure Data Sources
Data Sources are the entry point for content in AI Factory:
- They feed content into the Data Lake → Libraries → Knowledge Bases → AI Applications pipeline.
- Without Data Sources, Pipelines and Knowledge Bases have no content to operate on.
- Data Sources allow you to manage content refresh, scope, and metadata — all key for Sovereign AI.
Data Sources can include:
- Documents (PDFs, Word, HTML)
- Web pages
- Object storage (S3, GCS)
- Internal content systems (Confluence, shared drives)
- Custom systems (via Structures or API integrations)
For background:
Complexity and time to complete
- Complexity: Low to moderate (depends on Data Source type)
- Estimated time: 5–30 minutes per Data Source
Prerequisites
- The Data Lake must be configured — see Configure the Data Lake.
- For cloud object storage:
- Access to S3 / GCS / Azure storage
- Appropriate API credentials
- For Confluence:
- Confluence URL
- User account with API token and permissions
- For web pages:
- Public or intranet-accessible URLs
- For custom Data Sources:
- Griptape Structure packaged and ready to deploy
How to configure Data Sources
1. Navigate to Data Sources
- In Gen AI Builder, go to the Data Sources tab.
- Click Add Data Source.
2. Select your Data Source type
Supported types include:
- Web Page
- S3
- GCS
- Confluence
- Data Lake
- Custom (via Structure)
3. Configure connection settings
For S3
- Bucket name
- Path prefix (optional)
- Endpoint URL (for private / compatible S3)
- Access Key ID
- Secret Access Key
For GCS
- Bucket name
- Path prefix (optional)
- Project ID
- HMAC keys (Access Key ID / Secret Access Key)
For Confluence
- Base URL
- API Token
- User email
- Space(s) to index
For Web Page
- Base URL
- Optional URL filters (regex patterns)
For Custom Structure
- Upload Structure zip or link to GitHub repo
- Configure parameters as required by your Structure
4. Test connection
- Click Test Connection to validate credentials and connectivity.
- Resolve any errors before proceeding.
5. Set sync options
- Full sync — Initial full load of Data Source content.
- Incremental sync — Ongoing refresh of updated content.
6. Launch sync
- Click Sync Now to trigger ingestion.
- Monitor sync progress.
- Verify that content appears in the Data Lake.
7. Monitor Data Source status
- View sync history and logs.
- Monitor for errors or partial loads.
- Adjust sync frequency as needed.
Best practices
- Always validate Data Source scope — be intentional about which content is ingested.
- Review content metadata and tags — useful for Hybrid Knowledge Bases.
- Monitor initial sync results — large Data Sources may take time.
- For sensitive content:
- Ensure appropriate Data Source permissions.
- Configure fine-grained scoping and sync filters.
- Use custom Structures for complex or proprietary content ingestion.
Related topics
- Configure the Data Lake
- Data Lake explained
- AI Factory Concepts
- Structures explained
- Configure Knowledge Bases
- Hybrid Manager: Using Gen AI Builder
By configuring Data Sources, you enable AI Factory to ingest, process, and index your content — powering Knowledge Bases, Assistants, and AI pipelines across your EDB PG AI and Hybrid Manager environment.
Could this page be better? Report a problem or suggest an addition!