Trafilatura

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.

Features

Web crawling and text discovery
Seamless and parallel processing, online and offline
Robust and efficient extraction
Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml)
URLs, HTML files or parsed HTML trees usable as input
Efficient and polite processing of download queues

Project Samples

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow Trafilatura

Trafilatura Web Site

Other Useful Business Software

Ship Agents Faster

Transform your applications and workflows into powerful agentic systems at global scale.

Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.

Get Started Free

Rate This Project

User Reviews

Be the first to post a review of Trafilatura!

Additional Project Details

Programming Language

Python

Related Categories

Python Web Scrapers

Registered

2023-04-12

Similar Business Software

Apify

Apify is a full-stack web scraping and automation platform helping anyone get value from the web. At its core is Apify Store, a marketplace with over 10,000 Actors where developers build, publish, and monetize automation tools. Actors are serverless cloud programs that extract data, automate...

See Software
Oxylabs

Oxylabs is a market leader in web intelligence with enterprise-grade, ethical, and compliant solutions. Its proxy infrastructure spans one of the largest global networks, offering residential, ISP, mobile, datacenter, & dedicated datacenter proxies, along with Web Unblocker – an AI-driven...

See Software
Gaffa

Gaffa is a web scraping and browser automation API that gives developers full, real-browser control with a single API call no headless browsers, proxies, CAPTCHA handling, or scaling infrastructure to manage. JavaScript rendering is handled by default, so pages load exactly as they would for a...

See Software
Bright Data

Bright Data is the world's #1 web data, proxies, & data scraping solutions platform. Fortune 500 companies, academic institutions and small businesses all rely on Bright Data's products, network and solutions to retrieve crucial public web data in the most efficient, reliable and flexible...

See Software
NetNut

Get ready to experience unmatched control and insights with our user-friendly dashboard tailored to your needs. Monitor and adjust your proxies with just a few clicks. Track your usage and performance with detailed statistics. Our team is devoted to providing customers with proxy solutions...

See Software
Price2Spy

Price2Spy makes automatic price adjustments easy to perform saving your most valuable resource - time, allowing your pricing team to focus on strategic planning and management. Since 2010, we have provided pricing intelligence for retailers and brands in 40+ countries, helping them smoothly...

See Software

Report inappropriate content

Trafilatura

Python & command-line tool to gather text on the Web

Get an email when there's a new version of Trafilatura

Features

Project Samples

Project Activity

Categories

License

Follow Trafilatura

User Reviews

Additional Project Details

Programming Language

Related Categories

Registered