unfluff

unfluff is a Node.js library designed to automatically extract the main content from an HTML document — stripping away navigation bars, ads, footers and other boilerplate to leave you with the “body content”, metadata (title, author, date) and other useful fields. It’s a tool very much aimed at content-analysis, web scraping, building datasets, or repurposing article text for downstream processing (like machine-learning or summarization). The API is simple: you feed in raw HTML and it returns a structured object with the extracted text and other fields. It supports caching internal representations to speed up repeated extractions. While its language support is best for English, it is still widely used in web-content-processing pipelines. The repository notes some limitations (e.g., languages like Chinese/Arabic/Korean may not be well-supported). Because of its simplicity and focused purpose, it can be a reliable building block in backend services or CLI tools.

Features

Extracts main textual content (body) from an HTML document
Parses and returns metadata (title, author, date, language detection etc)
Caches intermediate representations for performance when extracting multiple fields
CLI / module support: can be installed globally or used programmatically
Suitable for building datasets, article-scraping, republishing workflows
Open-source under Apache-2.0 license, easy to integrate in Node.js stacks

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow unfluff

unfluff Web Site

Other Useful Business Software

Gen AI apps are built with MongoDB Atlas

The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free

Rate This Project

User Reviews

Be the first to post a review of unfluff!

Additional Project Details

Registered

3 days ago

Similar Business Software

SeaMonkey

SeaMonkey is an all-in-one Internet application suite that combines a web browser, email and newsgroup client, feed reader, IRC chat, and HTML editor into one powerful platform. Developed by the SeaMonkey community, it continues the legacy of the classic Netscape and Mozilla suites with a...

See Software
Adobe Dreamweaver

Responsive websites. Really fast. Build beautiful sites for any browser or device. Quickly create and publish web pages almost anywhere with web design software that supports HTML, CSS, JavaScript, and more. Create, code, and manage dynamic websites easily with a smart, simplified coding engine....

See Software
Apache NetBeans

Apache NetBeans is a versatile, open-source Integrated Development Environment (IDE) used for developing applications across a wide range of programming languages, including Java, JavaScript, PHP, HTML5, and C/C++. Known for its modular architecture, NetBeans provides robust tools and features...

See Software
NoteTab

For some NoteTab is simply the best Notepad replacement out there. For webmasters it’s the fastest HTML editor. For others it’s the most versatile text editor. For power users it’s a unique text-processing work horse. What will it be for you? In a moment, we’re going to reveal to you the 10 most...

See Software
Froala

Discover Froala, the beautiful JavaScript WYSIWYG editor that seamlessly integrates into your projects. Its intuitive smart toolbar packs over 100 features into four simple categories, offering both power and simplicity. With comprehensive documentation and over 30 out-of-the-box plugins,...

See Software
Sublime Text

A sophisticated text editor for code, markup and prose. Use Goto Anything to open files with only a few keystrokes, and instantly jump to symbols, lines or words. Make ten changes at the same time, not one change ten times. Multiple selections allow you to interactively change many lines at...

See Software

Report inappropriate content

unfluff

Automatically extract body content (and other cool stuff) from HTML

Get an email when there's a new version of unfluff

Features

Project Samples

Project Activity

Categories

License

Follow unfluff

User Reviews

Additional Project Details

Registered