Scrapling - Browse /v0.4.8 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2026-05-11	6.7 kB	0
Release v0.4.8 source code.tar.gz	2026-05-11	2.3 MB	0
Release v0.4.8 source code.zip	2026-05-11	2.4 MB	0
Totals: 3 Items		4.7 MB	0

A big spider update that takes the crawling framework to the next level 🕷️

[!NOTE] Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)

```python from scrapling.spiders import LinkExtractor

extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"]) ```
Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

```python from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor

class QuotesSpider(CrawlSpider): name = "blog" start_urls = ["https://quotes.toscrape.com/"]
```
def rules(self):
    return [
        CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
        CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
    ]

async def parse_author(self, response):
    yield {
        "name": response.css(".author-title::text").get(),
        "birthday": response.css(".author-born-date::text").get(),
        "url": response.url,
    }
```
```
Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

```python from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor

class NewsSitemap(SitemapSpider): name = "news" sitemap_urls = ["https://example.com/robots.txt"]
```
def rules(self):
    return [
        CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
    ]

async def parse_article(self, response):
    yield {"url": response.url, "title": response.css("h1::text").get()}
```
```
Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.
Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

Refreshed older code examples across the documentation to match the current version.
Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback