html source extractor free download

Showing 1819 open source projects for "html source extractor"

View related business solutions

Internet Clear Filters & Widen Search

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
1

html-metadata

MetaData html scraper and parser for Node.js (supports Promises

The aim of this library is to be a comprehensive source for extracting all HTML-embedded metadata. Currently, it supports Schema.org microdata using a third-party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Downloads: 0 This Week

Last Update: 2025-04-30
See Project
2

Trafilatura

Python & command-line tool to gather text on the Web

...Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). ...

Downloads: 0 This Week

Last Update: 2024-12-03
See Project
3

jsoup

Java library for working with real-world HTML

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. The parser will make...

Downloads: 0 This Week

Last Update: 2026-01-01
See Project
4

crawley

The unix-way web crawler

Crawls web pages and prints any link it can find. Fast HTML SAX-parser (powered by golang.org/x/net/html) Small (below 1500 SLOC), idiomatic, 100% test-covered codebase. Grabs most of useful resources URLs (pics, videos, audios, forms, etc...) Found URLs are streamed to stdout and guaranteed to be unique (with fragments omitted) Scan depth (limited by starting host and path, by default - 0) can be configured. Can crawl rules and sitemaps from robots.txt. Brute mode - scan HTML comments for...

Downloads: 10 This Week

Last Update: 2026-03-14
See Project
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
5

Happy DOM

Happy DOM is a JavaScript implementation of a web browser

Happy DOM is a JavaScript implementation of a web browser without its graphical user interface. It includes many web standards from WHATWG DOM and HTML. The goal of Happy DOM is to emulate enough of a web browser to be useful for testing, scraping web sites, and server-side rendering. Happy DOM focuses heavily on performance and can be used as an alternative to JSDOM. Happy DOM now supports Declarative Shadow DOM which can be used for server-side rendering of web components. This package...

Downloads: 6 This Week

Last Update: 5 days ago
See Project
6

Maxun

Small event-delegation library for decoupling event binding and handli

Maxun named JsAction by Google serves as a lightweight event delegation library built in JavaScript. It allows developers to separate the logic of binding events from the code that handles those events, helping to keep DOM event wiring cleaner and more maintainable. It is archived and marked as read-only, indicating that the project is no longer actively maintained or intended for production use. The README states that ongoing development has migrated into a larger framework under the...

Downloads: 47 This Week

Last Update: 2026-03-10
See Project
7

geckodriver

WebDriver for Firefox

geckodriver is an implementation of WebDriver, and WebDriver can be used for widely different purposes. How you invoke geckodriver largely depends on your use case. If you are using geckodriver through Selenium, you must ensure that you have version 3.11 or greater. Because geckodriver implements the W3C WebDriver standard and not the same Selenium wire protocol older drivers are using, you may experience incompatibilities and migration problems when making the switch from FirefoxDriver to...

Downloads: 75 This Week

Last Update: 2025-02-25
See Project
8

single-file-cli

CLI tool to save complete web pages as single self-contained HTML file

SingleFile CLI is an open source command-line tool designed to save complete web pages as a single self-contained HTML file. It captures the rendered page in a headless browser and embeds all required resources directly into the output document, including stylesheets, scripts, images, and fonts. By consolidating every dependency into one file, it allows users to preserve a faithful copy of a web page that can be viewed offline without requiring external assets.

Downloads: 9 This Week

Last Update: 2026-03-11
See Project
9

Eruda

Console for mobile browsers

With Eruda you can display JavaScript logs, check dom state, show requests status, show localStorage, cookie information, show url, user agent info, include snippets used most often, Html, js, css source viewer, and install. The JavaScript file size is quite huge(about 100kb gzipped) and therefore not suitable to include in mobile pages. It's recommended to make sure eruda is loaded only when eruda is set to true. When initialization, a configuration object can be passed in. Container element, if not set, it will append an element directly under html root element. ...

Downloads: 21 This Week

Last Update: 2025-06-15
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
10

Spider

High-performance Rust web crawler and scraper for large-scale data

Spider is a high-performance web crawler and web scraping library written in Rust that enables developers to crawl and index websites efficiently. It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large...

Downloads: 6 This Week

Last Update: 2026-03-31
See Project
11

Beacon

Open-source Content Management System (CMS)

Beacon is a modern open-source CMS built with Phoenix LiveView, offering fast server-rendered HTML for content-heavy pages with LiveView interactivity layered on top. It includes runtime content reloading, SEO-optimized rendering, and an admin interface (Beacon LiveAdmin) for managing pages, layouts, and components in a cluster-friendly setup. Developed by DockYard, Beacon aims to deliver high performance content sites fully within the Elixir ecosystem.

Downloads: 5 This Week

Last Update: 2025-07-10
See Project
12

miniblink49

Lighter, faster browser kernel of blink to integrate HTML UI in apps

miniblink is an open source, one file, small browser widget based on chromium. By using C interface, you can create a browser with just some line code. miniblink is an open source, single-file, and currently the smallest known chromium-based browser control. Through its exported pure C interface, a browser control can be created in a few lines of code. C++, C#, Delphi and other language calls (support C++, C#, Delphi language to call). Embedded Nodejs, support electron (with Nodejs, can run...

Downloads: 9 This Week

Last Update: 2025-12-13
See Project
13

openvpn-monitor

openvpn-monitor is a web based OpenVPN monitor

openvpn-monitor is a simple Python program to generate HTML that displays the status of an OpenVPN server, including all current connections. It uses the OpenVPN management console. It typically runs on the same host as the OpenVPN server, however, it does not necessarily need to. OpenVPN-monitor is a web-based OpenVPN monitor, that shows current connection information, such as users, location, and data transferred.

Downloads: 1 This Week

Last Update: 2025-01-02
See Project
14

goclone

Fast CLI tool for cloning entire websites for local browsing offline

goclone is a command-line utility designed to download and mirror complete websites to a local directory for offline access. It retrieves HTML pages, stylesheets, JavaScript files, images, and other assets from a target site and stores them on the user’s computer. It preserves the original site’s structure by maintaining relative links between pages, allowing the mirrored copy to function similarly to the live version when opened locally. Once a site has been cloned, users can browse the...

Downloads: 21 This Week

Last Update: 2026-03-11
See Project
15

Winter

Free, open-source, self-hosted CMS platform based on the Laravel PHP

...Build intricate websites with little more than HTML, CSS and JavaScript through a beautiful, user-friendly and easy backend.

Downloads: 7 This Week

Last Update: 2026-02-20
See Project
16

Browserless

The headless Chrome/Chromium driver on top of Puppeteer

Browserless is an open-source headless browser automation library and service built on top of Puppeteer that simplifies the process of running and scaling Chromium-based browser tasks in production environments. It provides a high-level API for interacting with headless Chrome, allowing developers to perform operations such as generating PDFs, capturing screenshots, extracting text or HTML, and automating web navigation.

Downloads: 6 This Week

Last Update: 4 days ago
See Project
17

newpipeextractor

Library for extracting streaming site data without official APIs

...It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. Each supported service is implemented through its own extractor components that conform to a common interface, enabling consistent access to data across different platforms.

Downloads: 3 This Week

Last Update: 2026-04-10
See Project
18

Lighthouse

Automated auditing, performance metrics, & best practices for the web

Lighthouse is an open-source, automated tool that analyzes and audits web apps and web pages in order to improve their quality. Lighthouse collects modern performance metrics and insights on developer best practices; auditing for performance, accessibility, SEO and more. After auditing it produces a report either in JSON or HTML. Included in the report is a reference doc that explains the importance of the audit and how to fix the problem areas, which you can use to improve the web app or web page. ...

Downloads: 6 This Week

Last Update: 2026-04-07
See Project
19

reveal.js

The HTML Presentation Framework

reveal.js is a framework for creating beautiful interactive presentations using HTML. It comes with a wide range of features, including nested slides, auto-sliding, touch navigation, Markdown support, PDF export, speaker notes, theming and more. It also comes with a JavaScript API that allows you to control various other options, and a list of plugins that can be used to extend reveal.js further. reveal.js currently offers full support for any recently released version of the following...

Downloads: 9 This Week

Last Update: 2026-04-11
See Project
20

WebMagic

A scalable web crawler framework for Java

WebMagic is a scalable crawler framework. It covers the whole lifecycle of crawler, downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler. WebMagic is a simple but scalable crawler framework. You can develop a crawler easily based on it. WebMagic has a simple core with high flexibility, a simple API for html extracting. It also provides annotation with POJO to customize a crawler, and no configuration is needed. Some other...

Downloads: 4 This Week

Last Update: 2025-02-10
See Project
21

Jackett

API Support for your favorite torrent trackers

Jackett works as a proxy server, it translates queries from apps (Sonarr, Radarr, SickRage, CouchPotato, Mylar3, Lidarr, DuckieTV, qBittorrent, Nefarious, etc.) into tracker-site-specific HTTP queries, parses the HTML or JSON response, and then sends results back to the requesting software. This allows for getting recent uploads (like RSS) and performing searches. Jackett is a single repository of maintained indexer scraping & translation logic, removing the burden from other apps. Trackers...

Downloads: 195 This Week

Last Update: 1 day ago
See Project
22

Dillo

Dillo, a multi-platform graphical web browser

...Its goals include enabling web access on old or constrained hardware, using slow or unreliable network connections, minimizing dependencies, and avoiding many of the complexities and overheads of modern full-featured browsers. It omits many modern features (notably JavaScript), instead focusing on rendering HTML (mostly older/standardized subsets), images, and some CSS, while keeping the codebase small. It is free/open source under GPL-3.0.

Downloads: 17 This Week

Last Update: 2025-09-11
See Project
23

ScrapeGraphAI

Python scraper based on AI

Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 14 This Week

Last Update: 2026-04-09
See Project
24

TiddlyWiki

A self-contained JavaScript wiki for the browser, Node.js, AWS Lambda

TiddlyWiki5 is a mature, self-contained open-source personal wiki application and non-linear notebook implemented entirely in JavaScript that runs in the browser or a Node.js environment, letting users create, organize, and interlink small pieces of content called tiddlers without the need for a server backend or traditional hierarchical pages. Its entire application — including content, interface, and logic — can live in a single HTML file that users open and edit directly in a web browser, making it portable, offline-capable, and easy to share or archive without dependencies. ...

Downloads: 2 This Week

Last Update: 2026-01-25
See Project
25

TinyStatus

Tiny status page generated by a Python script

TinyStatus is a simple, customizable status page generator that allows you to monitor the status of various services and display them on a clean, responsive web page.

Downloads: 5 This Week

Last Update: 2024-10-24
See Project