Web Scraper
Overview
Web Scraper Drivers can be used to scrape text from the web. They are used by WebLoader to provide its functionality. All Web Scraper Drivers implement the following methods:
scrape_url()
scrapes text from a website and returns a TextArtifact. The format of the scrapped text is determined by the Driver.
Web Scraper Drivers
Proxy
The ProxyWebScraperDriver uses the requests
library with a provided set of proxies to do web scraping. Paid webscraping services like ZenRows or ScraperAPI offer a way to use their API via a set of proxies passed to requests.get()
Example using ProxyWebScraperDriver
directly:
import os from griptape.drivers.web_scraper.proxy import ProxyWebScraperDriver query_params = [ "markdown_response=true", "js_render=false", "premium_proxy=false", ] proxy_url = f"http://{os.environ['ZENROWS_API_KEY']}:{'&'.join(query_params)}@proxy.zenrows.com:8001" driver = ProxyWebScraperDriver( proxies={ "http": proxy_url, "https": proxy_url, }, params={"verify": False}, ) driver.scrape_url("https://griptape.ai")
Markdownify
Info
This driver requires the drivers-web-scraper-markdownify
extra and the
playwright browsers to be installed.
To install the playwright browsers, run playwright install
in your terminal. If you are using
uv, run uv run playwright install
instead. The playwright
command should already be
installed as a dependency of the drivers-web-scraper-markdownify
extra. For more details about
playwright, see the playwright docs.
Note that if you skip installing the playwright browsers, you will see the following error when you run your code:
playwright._impl._errors.Error: Executable doesn't exist at ... ╔════════════════════════════════════════════════════════════╗ ║ Looks like Playwright was just installed or updated. ║ ║ Please run the following command to download new browsers: ║ ║ ║ ║ playwright install ║ ║ ║ ║ <3 Playwright Team ║ ╚════════════════════════════════════════════════════════════╝
The MarkdownifyWebScraperDriver outputs the scraped text in markdown format. It uses playwright to render web pages along with dynamically loaded content, and a combination of beautifulsoup4 and markdownify to produce a markdown representation of a webpage. It makes a best effort to produce a markdown representation of a webpage that is concise yet human (and LLM) readable.
Example using MarkdownifyWebScraperDriver
directly:
from griptape.drivers.web_scraper.markdownify import MarkdownifyWebScraperDriver driver = MarkdownifyWebScraperDriver() driver.scrape_url("https://griptape.ai")
Example of using MarkdownifyWebScraperDriver
with an agent:
from griptape.drivers.web_scraper.markdownify import MarkdownifyWebScraperDriver from griptape.loaders import WebLoader from griptape.structures import Agent from griptape.tools import WebScraperTool agent = Agent( tools=[ WebScraperTool( web_loader=WebLoader(web_scraper_driver=MarkdownifyWebScraperDriver(timeout=1000)), off_prompt=False, ), ], ) agent.run("List all email addresses on griptape.ai in a flat numbered markdown list.")
[02/27/25 20:24:25] INFO PromptTask c8efa9f397014ff480eccaeaa5791978 Input: List all email addresses on griptape.ai in a flat numbered markdown list. [02/27/25 20:24:39] INFO Subtask 0c9408d854f74c5b9482600a3225234e Actions: [ { "tag": "call_zcOOIWYhARbzGdl1LmcOMFZV", "name": "WebScraperTool", "path": "get_content", "input": { "values": { "url": "https://griptape.ai" } } } ] [02/27/25 20:24:42] INFO Subtask 0c9408d854f74c5b9482600a3225234e Response: We value your privacy This website or its third-party tools process personal data. You can opt out of the sale of your personal information by clicking on the “Do Not Sell or Share My Personal Information” link. Do Not Sell or Share My Personal Information Opt-out Preferences We use third-party cookies that help us analyze how you use this website, store your preferences, and provide the content and advertisements that are relevant to you. However, you can opt out of these cookies by checking "Do Not Sell or Share My Personal Information" and clicking the "Save My Preferences" button. Once you opt out, you can opt in again at any time by unchecking "Do Not Sell or Share My Personal Information" and clicking the "Save My Preferences" button. Do Not Sell or Share My Personal Information Cancel Save My Preferences Powered by * Products [AI Framework](/ai-framework)[AI Cloud](/cloud)[AI Cloud Pricing](/pricing-griptape-cloud) * [Docs](https://docs.griptape.ai/stable/) * How to [Samples](/sample-applications)[Learn](https://le arn.griptape.ai/latest/) * Resources [Blog](/blog)[Community](https://discord.gg/gript ape)[GitHub](https://github.com/griptape-ai/griptap e)[ComfyUI Nodes](/griptape-comfyui-nodes) * [About](/team) * [Request Demo](https://calendly.com/griptape-ai/product-over view-demo?month=2024-12) * [Log In](https://cloud.griptape.ai/) * [Start FREE](https://cloud.griptape.ai/account?signup=true ) Build Production Ready AI Agents ================================ Everything You Need to Build Reliable AI Agents Quickly and Securely, Using Your Data.Built for the Enterprise. Deploy Anywhere. --------------------------------------------------- --------------------------------------------------- -------------------------- [Start Building](https://cloud.griptape.ai/account?signup= true)[GitHub](https://github.com/griptape-ai/gripta pe) Build, Deploy, and Scale End-to-End Solutions, from LLM-Powered Data Prep and Retrieval to AI Agents, Pipelines, and Workflows. =================================================== =================================================== ========================= ### Griptape gives developers everything they need, from the open source AI framework ([Griptape AI Framework](/ai-framework)) to the execution runtime ([Griptape AI Cloud](/cloud)). ### Build &Secure 1. Build your business logic using predictable, programmable python - don’t gamble on prompting. 2. Turn any developer into an AI developer. 3. Off-Prompt™ gives you better security, performance, and lower costs. [Learn More](/ai-framework) ### Deploy &Scale 1. Deploy and run the ETL, RAG, and structures you developed. 2. Simple API abstractions. 3. Skip the infrastructure management. 4. Scale seamlessly so you can grow with your workload requirements. [Learn More](/cloud) ### Manage &Monitor 1. Monitor directly in Griptape Cloud or integrate with any third-party service. 2. Measure performance, reliability, and spending across the organization 3. Enforce policy for each user, structure, task, and query [Learn More](/cloud) Sample Applications Built on Griptape ===================================== [#### Transform Data with Find and Replace](https://github.com/griptape-ai/griptape-sa mple-structures/tree/main/griptape_find_replace_tra nsform) [#### Event Handler for LLM-Powered Slack Applications](https://github.com/griptape-ai/gripta pe-sample-structures/tree/main/griptape_slack_handl er) [#### Keep Private Data ‘Off Prompt’ with TaskMemory](https://github.com/griptape-ai/griptape -sample-structures/tree/main/griptape_off_prompt) [More Apps](/sample-applications) 🎢 Griptape AI Framework ======================= ### Griptape provides clean and clear abstractions for building Gen AI Agents, Systems of Agents, Pipelines, Workflows, and RAG implementations without having to spend weeks learning Gen AI nor need to ever learn Prompt Engineering. ### Build Build ETL pipelines to prep your data for secure LLM access. ### Compose Compose retrieval patterns that give fast, accurate, detailed information. ### Write Write agents, pipelines, and workflows (i.e. structures) to integrate your business logic. [Learn More](/ai-framework) 🌩️ Griptape AI Cloud ==================== ### Skip the infrastructure management. We’ll host and operate everything for you, from the data processing pipeline to the retrieval-ready database to the serverless application runtime. Simple to complex, one layer of the stack or the whole enchilada, we’ve got you covered. ### Automated Data Prep(ETL) Connect any data source and extract. Prep/transform it (extract, clean, chunk, embed, add metadata). Load it into a vector database index. ### Retrieval as a Service(RAG) Generate answers, summaries, and details from your own data. Use ready-made retrieval patterns, customize them to fit your use case, or compose your own from scratch (Modular RAG). ### Structure Runtime(RUN) Build your own AI agents, pipelines, and workflows. Real-time interfaces, transactional processes, batch workloads. Plug them into client applications. [Learn More](/cloud) Learn more... ============= [#### Griptape Rules. No, not like that.](/blog/griptape-rules-no-not-like-that) [#### New Features in Griptape Framework 1.3](/blog/new-features-in-griptape-framework-1-3) [#### Announcing Griptape Framework v1.2 with Structured Output](/blog/announcing-griptape-framework-v1-2-wi th-structured-output) [AI Blog](/blog) Be part of our community. ========================= ### Join our social channels for the latest news, tutorials, and exclusive insights. Our partners ============ Resources [Docs](https://docs.griptape.ai/latest/)[Learning]( https://learn.griptape.ai/latest/)[Github](https:// github.com/griptape-ai/griptape)[Blog](/blog) Company [Brand Guidelines](/brand-guidelines)[Careers](/team)[Crun chbase](https://www.crunchbase.com/organization/gri ptape)[AI Glossary](/ai-glossary) Contact [hello@griptape.ai](mailto:hello@griptpae.ai)[press @griptape.ai](mailto:press@griptpae.ai)[careers@gri ptape.ai](mailto:careers@griptpae.ai) About [Privacy Policy](/privacy-policy)[Terms of Service](/terms-of-service) © Griptape, Inc [Sitemap](https://www.griptape.ai/sitemap.xml) [02/27/25 20:24:43] INFO PromptTask c8efa9f397014ff480eccaeaa5791978 Output: Here are the email addresses found on the Griptape.ai website: 1. [hello@griptape.ai](mailto:hello@griptape.ai) 2. [press@griptape.ai](mailto:press@griptape.ai) 3. [careers@griptape.ai](mailto:careers@griptape.ai)
Trafilatura
Info
This driver requires the drivers-web-scraper-trafilatura
extra.
The TrafilaturaWebScraperDriver scrapes text from a webpage using the Trafilatura library.
Example of using TrafilaturaWebScraperDriver
directly:
from griptape.drivers.web_scraper.trafilatura import TrafilaturaWebScraperDriver driver = TrafilaturaWebScraperDriver() driver.scrape_url("https://griptape.ai")
- On this page
- Overview
- Web Scraper Drivers
Could this page be better? Report a problem or suggest an addition!