| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| README.md | 2026-04-05 | 4.7 kB | |
| Release v0.4.4 source code.tar.gz | 2026-04-05 | 2.3 MB | |
| Release v0.4.4 source code.zip | 2026-04-05 | 2.4 MB | |
| Totals: 3 Items | 4.6 MB | 0 | |
A new update with important spider improvements and bug fixes 🎉
🚀 New Stuff and quality of life changes
- Added robots.txt compliance to the Spider framework with a new
robots_txt_obeyoption. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, includingDisallow,Crawl-delay, andRequest-ratedirectives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226 - Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
- Added a new
robots_disallowed_countstat toCrawlStatsto track how many requests were blocked by robots.txt rules during a crawl.
Check it out on the website from here
🐛 Bug Fixes
- Fixed a critical MRO issue with
ProxyRotatorwhere the_build_context_with_proxystub was shadowing the real implementation from child classes, causing proxy rotation to always raiseNotImplementedError(Fixes #215). Thanks @yetval - Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
- Fixed a missing type assertion in the static fetcher where
curl_cfficould returnNonefromsession.request(), causing downstream errors.
Other
- Updated dependencies, so expect the latest fingerprints and other stuff.
- Added
protegoas a new dependency under thefetchersoptional group for robots.txt parsing.
🙏 Special thanks to the community for all the continuous testing and feedback