Configure a Data Lake data source

Who is this for

Platform users who want to ingest file-based content from the platform’s managed Data Lake into Gen AI Builder for use in Knowledge Bases and Retrieval-Augmented Generation (RAG) workflows. This is typically used by developers, data engineers, and business owners building AI applications that require internal structured or unstructured documents.

What you will accomplish

You will configure a Data Lake data source in Gen AI Builder to ingest files from a selected Data Lake bucket and (optionally) specific asset paths. The content will then be available for indexing into Libraries and Knowledge Bases.

Why use a Data Lake data source

  • Many enterprise documents (PDFs, CSVs, Markdown, text files) already reside in internal Data Lake storage.
  • The Data Lake source provides a highly reliable, cost-effective way to bring this content into AI workflows.
  • Unlike public web content, Data Lake files are typically higher-trust, internal documents.
  • The Data Lake source allows for precise targeting (specific folders or files), making it efficient and scalable.

For background on how this content powers downstream AI use cases, see:

Complexity and time to complete

  • Complexity: Low to moderate. You need familiarity with your Data Lake structure (bucket names and paths).
  • Estimated time: 5–10 minutes if paths are known and bucket access is already configured.

Key considerations

Data relevance and quality

  • Ingest only trusted, high-quality files that are useful for your AI applications.
  • Well-structured files (consistent formatting, well-written text) lead to better AI results.

Scope of ingestion

  • Be specific about which files or folders you ingest. Use targeted paths instead of entire buckets where possible.
  • Ingesting too broad a scope (e.g., entire raw data buckets) can lead to large processing costs and irrelevant content.

Permissions and access

  • You must have appropriate permissions to the selected Data Lake bucket and paths.
  • If no specific asset paths are provided, all files in the selected bucket will be ingested.

How to configure a Data Lake data source

  1. In the interface where you manage data sources, select + Add New Data Source.
  2. Choose Data Lake from the available data source types.
  3. Configure the following fields:
  • Name: Provide a clear and unique name. Example: Q1 Product Specs - Data Lake.
  • Description (optional): Add descriptive context for future reference.
  1. Connect to Data Lake:
  • Bucket: Select the Data Lake bucket from the dropdown list.
  • Asset Paths:
  • Enter one or more specific paths to files or folders. Example: manuals/pdf/current-version/
  • Select Add Asset Path to add more paths.
  • If no paths are provided, the system will ingest all files in the bucket.
  1. (Optional) Configure advanced options:
  • Scheduled refresh: Enable this option to automatically refresh the data on a schedule.
  • Provide a cron expression. Example: 0 2 * * * (daily at 2 AM).
  • Transform your data: Enable this option to apply a PG.AI Structure to transform the data during ingestion.
  • Select an existing Structure from the list.
  1. Select Create to add the Data Lake data source.

Supported file types

  • PDF
  • CSV
  • Markdown
  • Most text-based file types

For more about how content is indexed and retrieved from these files, see Embeddings explained.

Managing and refreshing the data source

Once created, the Data Lake data source can be viewed and managed through the Data Sources interface.

Actions available:

  • Edit: Modify the data source configuration.
  • Refresh: Manually trigger a data ingestion job.
  • Delete: Remove the data source.

You can also review the data job history to monitor ingestion performance and troubleshoot issues.

Troubleshooting

Files not found

  • Verify that the bucket and asset paths are correct.
  • Check the data job history for errors.

Access issues

  • Ensure your user or system has the necessary permissions to access the selected bucket and paths.

Example scenario

You want to index a set of PDF product manuals stored in your Data Lake.

Example configuration:

  • Name: Product Manuals Archive
  • Bucket: technical-documentation-prd
  • Asset Path: manuals/pdf/current-version/
  • Scheduled refresh: 0 2 * * * (daily at 2 AM)
  • Transform your data: Apply a Structure to extract specific sections from PDFs.

Could this page be better? Report a problem or suggest an addition!