Configure a Web Page data source

Who is this for

Platform users who want to bring web content into Gen AI Builder for use in Knowledge Bases and Retrieval-Augmented Generation (RAG) workflows. Typical users include developers, AI architects, and business owners working on contextual AI applications.

What you will accomplish

You will configure a Web Page data source in Gen AI Builder to ingest selected web pages. The content will then be available for transformation and indexing into downstream Libraries and Knowledge Bases.

Why use a Web Page data source

  • Web pages often contain valuable, curated, and public-facing information that complements internal data sources.
  • Adding relevant web content can enhance AI answers, RAG performance, and semantic search by expanding your Knowledge Base.
  • Targeted ingestion of specific web pages ensures the system uses only high-quality and relevant external content.

For guidance on how Knowledge Bases and RAG applications use this content, see:

Key considerations

Data relevance and quality

  • Ingest web pages that contain useful, high-quality, and relevant content.
  • Pages with structured and clearly written HTML are the most effective.
  • Avoid dynamic pages that require JavaScript for key content.

Scope of ingestion

  • Target specific pages.
  • Avoid adding entire websites or root-level URLs unless necessary, as this can result in large volumes of irrelevant data and higher processing costs.

Compliance and terms of service

  • Always respect the website's robots.txt file and terms of service.
  • Ensure you are authorized to ingest the content for your intended use.

How to configure a Web Page data source

  1. In the interface where you manage data sources, select + Add New Data Source.
  2. Choose Web Page from the available data source types.
  3. Configure the following fields:
  • Name: Provide a clear and unique name. Example: EDB Main Site FAQs.
  • Description (optional): Add descriptive context for future reference.
  • URLs: Enter one or more full URLs of the web pages to ingest.
  • Example: https://www.enterprisedb.com/
  • Select Add URL to provide additional URLs.
  1. (Optional) Configure advanced options:
  • Scheduled refresh: Enable this option to automatically refresh the data on a schedule.
  • Provide a cron expression. Example: 0 2 * * * (daily at 2 AM).
  • Transform your data: Enable this option to apply a PG.AI Structure that transforms the data during ingestion.
  • Select an existing Structure from the list.
  1. Select Create to add the Web Page data source.

Supported content

  • Text content from HTML pages.
  • Limited support for client-side rendered content or heavily dynamic pages.

Managing and refreshing the data source

Once created, the Web Page data source can be viewed and managed through the Data Sources interface.

Actions available:

  • Edit: Modify the data source configuration.
  • Refresh: Manually trigger a data ingestion job.
  • Delete: Remove the data source.

You can also review the data job history to monitor ingestion performance and troubleshoot issues.

Troubleshooting

URL not accessible

  • Check the URL for typos.
  • Verify the page is public and not protected by login or access controls.

No or incomplete content

  • The page may rely on JavaScript for key content, which is not fully supported.
  • Anti-scraping protections may block ingestion.
  • Review the data job history. If 0 bytes were ingested, dynamic rendering may be the cause.

Example scenario

You want to enrich your Knowledge Base with relevant articles from the public EDB website.

Example configuration:

  • Name: EDB Blog - Latest GenAI Articles
  • URL: https://www.enterprisedb.com/blog/category/genai
  • Scheduled refresh: 0 2 * * * (daily at 2 AM)
  • Transform your data: Use a Structure to strip unwanted boilerplate such as author bios.

Could this page be better? Report a problem or suggest an addition!