Newspaper4k is a Python library designed for extracting, processing, and analyzing news articles from websites. It is a continuation and active fork of the original newspaper3k library, which had stopped receiving updates, with the goal of keeping the ecosystem maintained while adding improvements and bug fixes. It provides developers with tools to automatically download web pages, extract the main article content, and collect associated metadata such as titles, authors, images, and publication dates. Newspaper4k also includes natural language processing capabilities that can generate summaries and identify keywords from extracted article text. Newspaper4k supports both single-article extraction and full news site processing, allowing users to build sources representing entire publications and iterate through their articles. It maintains compatibility with the original project so that existing code written for newspaper3k can continue working with minimal changes.
Features
- Extracts full article text, titles, authors, and publication dates
- Retrieves images, videos, and other metadata from news pages
- Supports keyword extraction and article summarization using NLP
- Processes individual articles or entire news websites as sources
- Provides a Python API and command-line interface for scraping tasks
- Maintains compatibility with the original newspaper3k library