Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
README.md | 2025-07-12 | 17.1 kB | |
v0.7.0_ The Adaptive Intelligence Update source code.tar.gz | 2025-07-12 | 8.1 MB | |
v0.7.0_ The Adaptive Intelligence Update source code.zip | 2025-07-12 | 8.4 MB | |
Totals: 3 Items | 16.6 MB | 0 |
🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
January 28, 2025 • 10 min read
Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
🎯 What's New at a Glance
- Adaptive Crawling: Your crawler now learns and adapts to website patterns
- Virtual Scroll Support: Complete content extraction from infinite scroll pages
- Link Preview with 3-Layer Scoring: Intelligent link analysis and prioritization
- Async URL Seeder: Discover thousands of URLs in seconds with intelligent filtering
- PDF Parsing: Extract data from PDF documents
- Performance Optimizations: Significant speed and memory improvements
🧠 Adaptive Crawling: Intelligence Through Pattern Learning
The Problem: Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
My Solution: I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
Technical Deep-Dive
The Adaptive Crawler maintains a persistent state for each domain, tracking:
- Pattern success rates
- Selector stability over time
- Content structure variations
- Extraction confidence scores
:::python
from crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState
# Initialize with custom learning parameters
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to use learned patterns
max_history=100, # Remember last 100 crawls per domain
learning_rate=0.2, # How quickly to adapt to changes
patterns_per_page=3, # Patterns to learn per page type
extraction_strategy='css' # 'css' or 'xpath'
)
adaptive_crawler = AdaptiveCrawler(config)
# First crawl - crawler learns the structure
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://news.example.com/article/12345",
config=CrawlerRunConfig(
adaptive_config=config,
extraction_hints={ # Optional hints to speed up learning
"title": "article h1",
"content": "article .body-content"
}
)
)
# Crawler identifies and stores patterns
if result.success:
state = adaptive_crawler.get_state("news.example.com")
print(f"Learned {len(state.patterns)} patterns")
print(f"Confidence: {state.avg_confidence:.2%}")
# Subsequent crawls - uses learned patterns
result2 = await crawler.arun(
"https://news.example.com/article/67890",
config=CrawlerRunConfig(adaptive_config=config)
)
# Automatically extracts using learned patterns!
Expected Real-World Impact: - News Aggregation: Maintain 95%+ extraction accuracy even as news sites update their templates - E-commerce Monitoring: Track product changes across hundreds of stores without constant maintenance - Research Data Collection: Build robust academic datasets that survive website redesigns - Reduced Maintenance: Cut selector update time by 80% for frequently-changing sites
🌊 Virtual Scroll: Complete Content Capture
The Problem: Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
My Solution: I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
Implementation Details
:::python
from crawl4ai import VirtualScrollConfig
# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20, # Number of scrolls
scroll_by="container_height", # Smart scrolling by container size
wait_after_scroll=1.0, # Let content load
capture_method="incremental", # Capture new content on each scroll
deduplicate=True # Remove duplicate elements
)
# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
container_selector="main .product-grid",
scroll_count=30,
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=1.5, # Images need time
stop_on_no_change=True # Smart stopping
)
# For news feeds with lazy loading
news_config = VirtualScrollConfig(
container_selector=".article-feed",
scroll_count=50,
scroll_by="page_height", # Viewport-based scrolling
wait_after_scroll=0.5,
wait_for_selector=".article-card", # Wait for specific elements
timeout=30000 # Max 30 seconds total
)
# Use it in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://twitter.com/trending",
config=CrawlerRunConfig(
virtual_scroll_config=twitter_config,
# Combine with other features
extraction_strategy=JsonCssExtractionStrategy({
"tweets": {
"selector": "[data-testid='tweet']",
"fields": {
"text": {"selector": "[data-testid='tweetText']", "type": "text"},
"likes": {"selector": "[data-testid='like']", "type": "text"}
}
}
})
)
)
print(f"Captured {len(result.extracted_content['tweets'])} tweets")
Key Capabilities: - DOM Recycling Awareness: Detects and handles virtual DOM element recycling - Smart Scroll Physics: Three modes - container height, page height, or fixed pixels - Content Preservation: Captures content before it's destroyed - Intelligent Stopping: Stops when no new content appears - Memory Efficient: Streams content instead of holding everything in memory
Expected Real-World Impact:
- Social Media Analysis: Capture entire Twitter threads with hundreds of replies, not just top 10
- E-commerce Scraping: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
- News Aggregation: Get all articles from modern news sites, not just above-the-fold content
- Research Applications: Complete data extraction from academic databases using virtual pagination
🔗 Link Preview: Intelligent Link Analysis and Scoring
The Problem: You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
My Solution: I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
The Three-Layer Scoring System
:::python
from crawl4ai import LinkPreviewConfig
# Configure intelligent link analysis
link_config = LinkPreviewConfig(
# What to analyze
include_internal=True,
include_external=True,
max_links=100, # Analyze top 100 links
# Relevance scoring
query="machine learning tutorials", # Your interest
score_threshold=0.3, # Minimum relevance score
# Performance
concurrent_requests=10, # Parallel processing
timeout_per_link=5000, # 5s per link
# Advanced scoring weights
scoring_weights={
"intrinsic": 0.3, # Link quality indicators
"contextual": 0.5, # Relevance to query
"popularity": 0.2 # Link prominence
}
)
# Use in your crawl
result = await crawler.arun(
"https://tech-blog.example.com",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True
)
)
# Access scored and sorted links
for link in result.links["internal"][:10]: # Top 10 internal links
print(f"Score: {link['total_score']:.3f}")
print(f" Intrinsic: {link['intrinsic_score']:.1f}/10") # Position, attributes
print(f" Contextual: {link['contextual_score']:.1f}/1") # Relevance to query
print(f" URL: {link['href']}")
print(f" Title: {link['head_data']['title']}")
print(f" Description: {link['head_data']['meta']['description'][:100]}...")
Scoring Components:
- Intrinsic Score (0-10): Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
-
URL structure and depth
-
Contextual Score (0-1): Relevance to your query
- Semantic similarity using embeddings
- Keyword matching in link text and title
- Meta description analysis
-
Content preview scoring
-
Total Score: Weighted combination for final ranking
Expected Real-World Impact: - Research Efficiency: Find relevant papers 10x faster by following only high-score links - Competitive Analysis: Automatically identify important pages on competitor sites - Content Discovery: Build topic-focused crawlers that stay on track - SEO Audits: Identify and prioritize high-value internal linking opportunities
🎣 Async URL Seeder: Automated URL Discovery at Scale
The Problem: You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
My Solution: I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
Technical Architecture
:::python
from crawl4ai import AsyncUrlSeeder, SeedingConfig
# Basic discovery - find all product pages
seeder_config = SeedingConfig(
# Discovery sources
source="sitemap+cc", # Sitemap + Common Crawl
# Filtering
pattern="*/product/*", # URL pattern matching
ignore_patterns=["*/reviews/*", "*/questions/*"],
# Validation
live_check=True, # Verify URLs are alive
max_urls=5000, # Stop at 5000 URLs
# Performance
concurrency=100, # Parallel requests
hits_per_sec=10 # Rate limiting
)
seeder = AsyncUrlSeeder(seeder_config)
urls = await seeder.discover("https://shop.example.com")
# Advanced: Relevance-based discovery
research_config = SeedingConfig(
source="crawl+sitemap", # Deep crawl + sitemap
pattern="*/blog/*", # Blog posts only
# Content relevance
extract_head=True, # Get meta tags
query="quantum computing tutorials",
scoring_method="bm25", # Or "semantic" (coming soon)
score_threshold=0.4, # High relevance only
# Smart filtering
filter_nonsense_urls=True, # Remove .xml, .txt, etc.
min_content_length=500, # Skip thin content
force=True # Bypass cache
)
# Discover with progress tracking
discovered = []
async for batch in seeder.discover_iter("https://physics-blog.com", research_config):
discovered.extend(batch)
print(f"Found {len(discovered)} relevant URLs so far...")
# Results include scores and metadata
for url_data in discovered[:5]:
print(f"URL: {url_data['url']}")
print(f"Score: {url_data['score']:.3f}")
print(f"Title: {url_data['title']}")
Discovery Methods: - Sitemap Mining: Parses robots.txt and all linked sitemaps - Common Crawl: Queries the Common Crawl index for historical URLs - Intelligent Crawling: Follows links with smart depth control - Pattern Analysis: Learns URL structures and generates variations
Expected Real-World Impact: - Migration Projects: Discover 10,000+ URLs from legacy sites in under 60 seconds - Market Research: Map entire competitor ecosystems automatically - Academic Research: Build comprehensive datasets without manual URL collection - SEO Audits: Find every indexable page with content scoring - Content Archival: Ensure no content is left behind during site migrations
⚡ Performance Optimizations
This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.
What We Optimized
:::python
# Before v0.7.0 (slow)
results = []
for url in urls:
result = await crawler.arun(url)
results.append(result)
# After v0.7.0 (fast)
# Automatic batching and connection pooling
results = await crawler.arun_batch(
urls,
config=CrawlerRunConfig(
# New performance options
batch_size=10, # Process 10 URLs concurrently
reuse_browser=True, # Keep browser warm
eager_loading=False, # Load only what's needed
streaming_extraction=True, # Stream large extractions
# Optimized defaults
wait_until="domcontentloaded", # Faster than networkidle
exclude_external_resources=True, # Skip third-party assets
block_ads=True # Ad blocking built-in
)
)
# Memory-efficient streaming for large crawls
async for result in crawler.arun_stream(large_url_list):
# Process results as they complete
await process_result(result)
# Memory is freed after each iteration
Performance Gains: - Startup Time: 70% faster browser initialization - Page Loading: 40% reduction with smart resource blocking - Extraction: 3x faster with compiled CSS selectors - Memory Usage: 60% reduction with streaming processing - Concurrent Crawls: Handle 5x more parallel requests
📄 PDF Support
PDF extraction is now natively supported in Crawl4AI.
:::python
# Extract data from PDF documents
result = await crawler.arun(
"https://example.com/report.pdf",
config=CrawlerRunConfig(
pdf_extraction=True,
extraction_strategy=JsonCssExtractionStrategy({
# Works on converted PDF structure
"title": {"selector": "h1", "type": "text"},
"sections": {"selector": "h2", "type": "list"}
})
)
)
🔧 Important Changes
Breaking Changes
link_extractor
renamed tolink_preview
(better reflects functionality)- Minimum Python version now 3.9
CrawlerConfig
split intoCrawlerRunConfig
andBrowserConfig
Migration Guide
:::python
# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)
# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
🤖 Coming Soon: Intelligent Web Automation
I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:
- Crawl Agents: Autonomous crawlers that understand your goals and adapt their strategies
- Auto JS Generation: Automatic JavaScript code generation for complex interactions
- Smart Form Handling: Intelligent form detection and filling
- Context-Aware Actions: Crawlers that understand page context and make decisions
These features are under active development and will revolutionize how we approach web automation. Stay tuned!
🚀 Get Started
:::bash
pip install crawl4ai==0.7.0
Check out the updated documentation.
Questions? Issues? I'm always listening: - GitHub: github.com/unclecode/crawl4ai - Discord: discord.gg/crawl4ai - Twitter: @unclecode
Happy crawling! 🕷️
P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.