Web Scraper

Overview

Web Scraper Drivers can be used to scrape text from the web. They are used by WebLoader to provide its functionality. All Web Scraper Drivers implement the following methods:

  • scrape_url() scrapes text from a website and returns a TextArtifact. The format of the scrapped text is determined by the Driver.

Web Scraper Drivers

Proxy

The ProxyWebScraperDriver uses the requests library with a provided set of proxies to do web scraping. Paid webscraping services like ZenRows or ScraperAPI offer a way to use their API via a set of proxies passed to requests.get()

Example using ProxyWebScraperDriver directly:

import os

from griptape.drivers.web_scraper.proxy import ProxyWebScraperDriver

query_params = [
    "markdown_response=true",
    "js_render=false",
    "premium_proxy=false",
]
proxy_url = f"http://{os.environ['ZENROWS_API_KEY']}:{'&'.join(query_params)}@proxy.zenrows.com:8001"

driver = ProxyWebScraperDriver(
    proxies={
        "http": proxy_url,
        "https": proxy_url,
    },
    params={"verify": False},
)

driver.scrape_url("https://griptape.ai")

Markdownify

Info

This driver requires the drivers-web-scraper-markdownify extra and the playwright browsers to be installed.

To install the playwright browsers, run playwright install in your terminal. If you are using uv, run uv run playwright install instead. The playwright command should already be installed as a dependency of the drivers-web-scraper-markdownify extra. For more details about playwright, see the playwright docs.

Note that if you skip installing the playwright browsers, you will see the following error when you run your code:

playwright._impl._errors.Error: Executable doesn't exist at ...
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝

The MarkdownifyWebScraperDriver outputs the scraped text in markdown format. It uses playwright to render web pages along with dynamically loaded content, and a combination of beautifulsoup4 and markdownify to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human (and LLM) readable.

Example using MarkdownifyWebScraperDriver directly:

from griptape.drivers.web_scraper.markdownify import MarkdownifyWebScraperDriver

driver = MarkdownifyWebScraperDriver()

driver.scrape_url("https://griptape.ai")

Example of using MarkdownifyWebScraperDriver with an agent:

Trafilatura

Info

This driver requires the drivers-web-scraper-trafilatura extra.

The TrafilaturaWebScraperDriver scrapes text from a webpage using the Trafilatura library.

Example of using TrafilaturaWebScraperDriver directly:

from griptape.drivers.web_scraper.trafilatura import TrafilaturaWebScraperDriver

driver = TrafilaturaWebScraperDriver()

driver.scrape_url("https://griptape.ai")

Could this page be better? Report a problem or suggest an addition!