Download Latest Version Release v0.4.8 source code.tar.gz (2.3 MB)
Email in envelope

Get an email when there's a new version of Scrapling

Home / v0.4.8
Name Modified Size InfoDownloads / Week
Parent folder
README.md 2026-05-11 6.7 kB
Release v0.4.8 source code.tar.gz 2026-05-11 2.3 MB
Release v0.4.8 source code.zip 2026-05-11 2.4 MB
Totals: 3 Items   4.7 MB 0

A big spider update that takes the crawling framework to the next level 🕷️

[!NOTE] Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

  • Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)

    ```python from scrapling.spiders import LinkExtractor

    extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"]) ```

  • Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

    ```python from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor

    class QuotesSpider(CrawlSpider): name = "blog" start_urls = ["https://quotes.toscrape.com/"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
            CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
        ]
    
    async def parse_author(self, response):
        yield {
            "name": response.css(".author-title::text").get(),
            "birthday": response.css(".author-born-date::text").get(),
            "url": response.url,
        }
    

    ```

  • Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

    ```python from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor

    class NewsSitemap(SitemapSpider): name = "news" sitemap_urls = ["https://example.com/robots.txt"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
        ]
    
    async def parse_article(self, response):
        yield {"url": response.url, "title": response.css("h1::text").get()}
    

    ```

  • Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.

  • Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

  • Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
  • Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
  • Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

  • Refreshed older code examples across the documentation to match the current version.
  • Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Source: README.md, updated 2026-05-11