Experimental library for leveraging GPT for web scraping.
Project description
scrapeghost
scrapeghost is an experimental library for scraping websites using OpenAI's GPT API.
Source: https://github.com/jamesturk/scrapeghost
Documentation: https://jamesturk.github.io/scrapeghost/
Issues: https://github.com/jamesturk/scrapeghost/issues
Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the OpenAI pricing page and not guaranteed to be accurate.
Features
The purpose of this library is to provide a convenient interface for using GPT for the purpose of web scraping.
Python-based schema definition - Define the shape of the data you want to extract as any Python object.
- Future versions will support optional validation that the response matches the schema.
Token Reduction - Fewer tokens means lower costs, faster responses, and staying under the API's token limits.
- Automatic HTML cleaning - Remove unnecessary HTML tags and attributes to reduce the size of the HTML sent to the model.
- CSS and XPath selectors - Pre-filter the HTML to send to the model by writing a single CSS or XPath selector.
- Auto-splitting - Optionally split the HTML into multiple calls to the model, each of a specified length.
Cost Controls - Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
- Future versions will allow setting a budget and stopping the scraper if the budget is exceeded.
Model Options - Works with GPT-3.5-Turbo or GPT 4, and allows passing additional parameters to the model to customize behavior.
- Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
Error Handling & Logging - Detailed logging and error handling to help debug issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapeghost-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab0cfb544dd1b715a4a7982024242edb8191dd56bb14d7213546206848d44f64 |
|
MD5 | b463e039ebff65b2e67bf9c855b8de82 |
|
BLAKE2b-256 | c64147200196d77114fe61f60645fd7fc650b33a2b49a3d30da1e8fcc793fc73 |