Configure an Amazon S3 data source

Who is this for

Platform users who want to ingest files stored in Amazon S3 buckets into Gen AI Builder for use in Knowledge Bases and Retrieval-Augmented Generation (RAG) workflows. Typical users include developers, data engineers, and cloud architects managing enterprise data pipelines.

What you will accomplish

You will configure an Amazon S3 data source in Gen AI Builder to ingest files from one or more S3 URIs. The ingested content will be available for transformation and indexing into Libraries and Knowledge Bases.

Why use an Amazon S3 data source

S3 is a common storage platform for enterprise content such as PDFs, CSVs, and documents.
Directly connecting to S3 allows you to leverage existing cloud storage without duplicating data.
Precise targeting of folders or files ensures high efficiency and control over the ingested content.

For background on how this content powers downstream AI use cases, see:

Complexity and time to complete

Complexity: Moderate. You need to securely configure AWS credentials and understand your S3 URI structure.
Estimated time: 10–15 minutes if IAM permissions are already configured.

Key considerations

Data relevance and quality

Ingest trusted, curated files that contribute meaningfully to your AI application.
Well-structured, clearly written documents lead to better downstream AI results.

Scope of ingestion

Use targeted S3 URIs to ingest only the content you need.
Avoid specifying root-level prefixes that could ingest unnecessary or sensitive content.

Credentials security

Use IAM credentials with least privilege—grant read-only access only to the specific S3 resources Gen AI Builder needs.
Do not use over-permissioned root or admin keys.

Cost awareness

Be mindful of S3 data transfer and retrieval costs, especially for large ingestion jobs or frequent refresh schedules.

How to configure an Amazon S3 data source

In the interface where you manage data sources, select + Add New Data Source.
Choose Amazon S3 from the available data source types.
Configure the following fields:

Name: Provide a clear and unique name. Example: Marketing Assets S3.
Description (optional): Add descriptive context for future reference.

Connect to S3:

AWS Access Key: Enter your AWS Access Key ID.
AWS Secret Key: Enter your AWS Secret Access Key.
URIs: Provide one or more S3 URIs for objects or prefixes.
Example: s3://bucket-name/key_name
Select Add URI to add more URIs.

(Optional) Configure advanced options:

Scheduled refresh: Enable this option to automatically refresh the data on a schedule.
Provide a cron expression. Example: 0 2 * * * (daily at 2 AM).
Transform your data: Enable this option to apply a PG.AI Structure to transform the data during ingestion.
Select an existing Structure from the list.

Select Create to add the Amazon S3 data source.

Supported file types

PDF
CSV
Markdown
Most text-based file types

For more about how content is indexed and retrieved from these files, see Embeddings explained.

Managing and refreshing the data source

Once created, the Amazon S3 data source can be viewed and managed through the Data Sources interface.

Actions available:

Edit: Modify the data source configuration.
Refresh: Manually trigger a data ingestion job.
Delete: Remove the data source.

You can also review the data job history to monitor ingestion performance and troubleshoot issues.

Troubleshooting

Access denied

Verify that your IAM user or role has the necessary s3:GetObject and (if applicable) s3:ListBucket permissions.
Review your bucket policies and S3 URIs for correctness.
Check the data job history for specific permission-related errors.

Bucket or key not found

Double-check S3 URIs for correctness.
Confirm that the specified keys or prefixes exist and are accessible to your IAM user or role.

Example scenario

You want to index quarterly financial reports stored in an S3 folder.

Example configuration:

Name: Financial Reports S3 Archive
AWS Access Key: AKIAxxxxxxxxxxxxxxxx
AWS Secret Key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
URI: s3://my-company-financials/quarterly-reports/pdf/
Scheduled refresh: 0 2 * * * (daily at 2 AM)
Transform your data: Apply a Structure to summarize PDFs and extract key financial figures.

← Prev

Configure a Google Drive data source

↑ Up

Gen AI How-To Guides

Configure a Web Page data source

Could this page be better? Report a problem or suggest an addition!