Showing 149 open source projects for "html source extractor"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    ...Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    openvpn-monitor

    openvpn-monitor

    openvpn-monitor is a web based OpenVPN monitor

    openvpn-monitor is a simple Python program to generate HTML that displays the status of an OpenVPN server, including all current connections. It uses the OpenVPN management console. It typically runs on the same host as the OpenVPN server, however, it does not necessarily need to. OpenVPN-monitor is a web-based OpenVPN monitor, that shows current connection information, such as users, location, and data transferred.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    TinyStatus

    TinyStatus

    Tiny status page generated by a Python script

    TinyStatus is a simple, customizable status page generator that allows you to monitor the status of various services and display them on a clean, responsive web page.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    ScrapeGraphAI

    ScrapeGraphAI

    Python scraper based on AI

    Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.
    Downloads: 14 This Week
    Last Update:
    See Project
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 5
    mitmproxy

    mitmproxy

    A free and open source interactive HTTPS proxy

    mitmproxy is an open source, interactive SSL/TLS-capable intercepting HTTP proxy, with a console interface fit for HTTP/1, HTTP/2, and WebSockets. It's the ideal tool for penetration testers and software developers, able to debug, test, and make privacy measurements. It can intercept, inspect, modify and replay web traffic, and can even prettify and decode a variety of message types. Its web-based interface mitmweb gives you a similar experience as Chrome's DevTools, with the addition of...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 6
    Scrapy

    Scrapy

    A fast, high-level web crawling and web scraping framework

    Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 7
    Offline HTML Viewer

    Offline HTML Viewer

    Fast offline HTML viewer for opening local HTML files on Windows

    Echo Offline Viewer is a lightweight offline HTML viewer for Windows designed to open and browse local HTML files without requiring an internet connection or a full web browser. The application provides a simple and clean interface for viewing offline web pages, making it useful for archived websites, documentation, and locally stored HTML content. Key advantages include fast startup, minimal system resource usage, and a fully read-only design that ensures files and system data remain...
    Leader badge
    Downloads: 59 This Week
    Last Update:
    See Project
  • 8
    changedetection.io

    changedetection.io

    The best free open source website change detection and restock service

    Loved by smart shoppers, data journalists, research engineers, data scientists, security researchers, and more. From simply monitoring website pages that have a change (such as watching prices, and restocking notifications), to deep inspection such as PDF text support, JSON and XML monitoring, and extensive text triggers. Monitor out-of-stock products and get alerts when those products are back in stock, get restock alerts via Discord, Slack, email, and many other platforms. Using the...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 9
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website....
    Downloads: 8 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 10
    Pelican

    Pelican

    Static site generator that supports Markdown and reST syntax

    Pelican is a static site generator that requires no database or server-side logic. Chronological content (e.g., articles, blog posts) as well as static pages. Integration with external services. Site themes (created using Jinja2 templates). Publication of articles in multiple languages. Generation of Atom and RSS feeds. Code syntax highlighting via Pygments. Import existing content from WordPress, Dotclear, or RSS feeds. Fast rebuild times due to content caching and selective output writing....
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Twisted

    Twisted

    Event-driven networking engine written in Python

    Twisted is an event-based framework for internet applications, supporting Python 3.6+. It includes modules for many different purposes. Twisted supports all major system event loops, select (all platforms), poll (most POSIX platforms), epoll (Linux), kqueue (FreeBSD, macOS), IOCP (Windows), and various GUI event loops (GTK+2/3, Qt, wxWidgets). Third-party reactors can plug into Twisted, and provide support for additional event loops.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    FinalRecon

    FinalRecon

    All-in-one Python web reconnaissance tool for fast target analysis

    FinalRecon is an all-in-one web reconnaissance tool written in Python that helps security professionals gather information about a target website quickly and efficiently. It combines multiple reconnaissance techniques into a single command-line utility so users do not need to run several separate tools to collect similar data. FinalRecon focuses on providing a fast overview of a web target while maintaining accuracy in the collected results. It includes modules for gathering server...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    Grab Framework Project

    Grab Framework Project

    Web Scraping Framework

    Grab is a python framework for building web scrapers. With Grab you can build web scrapers of various complexity, from simple 5-line scripts to complex asynchronous website crawlers processing millions of web pages. Grab provides an API for performing network requests and for handling the received content e.g. interacting with DOM tree of the HTML document. The single request/response API that allows you to build network request, perform it and work with the received content. The API is...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    CyberScraper 2077

    CyberScraper 2077

    A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

    CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    LinkChecker

    LinkChecker

    Check links in web documents or full websites

    LinkChecker is a free, GPL licensed website validator. LinkChecker checks links in web documents or full websites. It runs on Python 3 systems, requiring Python 3.8 or later. The version in the pip repository may be old, to find out how to get the latest code, plus platform-specific information and other advice see doc/install.txt in the source code archive. If you do not want to install any additional libraries/dependencies you can use the Docker image which is published on GitHub...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    Whakerexa

    Whakerexa

    A minimalist and lightweight web kit for accessible contents

    `Whakerexa` provides a lightweight, modular set of CSS and JavaScript tools for building accessible, consistent, and customizable web interfaces. It is intended to be as simple as possible to make **accessible web content**, and to minimize the use of CSS classes for enhancing the readability of HTML code. It was designed to be easily customizable, allowing users to adjust properties such as fonts, colors, borders, etc., effortlessly. Most of the properties are stored into variables which makes possible to re-define them, then to obtain a custom different style, enabling users to achieve a unique style easily. It can be combined with the use of WhakerPy, an open source library to create dynamic HTML content: <https://whakerpy.sf.net>. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 17
    WhakerKit

    WhakerKit

    A seamless toolkit to manage dynamic websites and shared documents

    WhakerKit is a versatile toolkit for building websites with both static and dynamic HTML pages, developed by Brigitte Bigi, CNRS. WhakerKit offers seamless management of public and authenticated access, and simplifies document sharing for collaborative environments. It is based on the following technologies: * python >= 3.9 * (optional) PyJWT and ldap3 for authentication (install with pip) * WhakerPy >= 1.3: <https://whakerpy.sourceforge.io> (install with pip) * Whakerexa >=...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Buku

    Buku

    Powerful command-line bookmark manager. Your mini web!

    buku is a powerful bookmark manager written in Python3 and SQLite3. buku fetches the title of a bookmarked web page and stores it along with any additional comments and tags. You can use your favourite editor to compose and update bookmarks. With multiple search options, including regex and a deep scan mode (particularly for URLs), it can find any bookmark instantly. Multiple search results can be opened in the browser at once. Though a terminal utility, it's possible to add bookmarks...
    Downloads: 28 This Week
    Last Update:
    See Project
  • 19

    Bookmark to url

    Save browser bookmarks to single shortcut file

    Python program that transfers all browser bookmarks to single windows .url shortcuts, organized in their respective folders, and with corresponding icons. Simply export your bookmarks in html file and open it. Bugs: some link's name, with special characters, can cause problems with icons and links in main root are copied in last folder.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    dirhunt

    dirhunt

    Web crawler that finds hidden web directories without brute force

    Dirhunt is an open source security tool designed to discover web directories and analyze website structures without relying on brute-force techniques. Instead of sending large numbers of guess-based requests, it operates as a specialized crawler that intelligently explores websites to identify accessible or hidden directories. Dirhunt can detect directories that expose “Index Of” listings, which may reveal files and other resources that were not intended to be publicly visible. ...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 21
    AutoScraper

    AutoScraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    This project is made for automatic web scraping to make scraping easy. It gets a URL or the HTML content of a web page and a list of sample data that we want to scrape from that page. This data can be text, URL or any HTML tag value of that page. It learns the scraping rules and returns similar elements. Then you can use this learned object with new URLs to get similar content or the exact same element of those new pages.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    Amazon Braket Ocean Plugin

    Amazon Braket Ocean Plugin

    A Python plugin for using Ocean with Amazon Braket

    The Amazon Braket Ocean Plugin is an open-source library in Python that provides a framework that you can use to interact with Ocean tools on top of Amazon Braket. Before you begin working with the Amazon Braket Ocean Plugin, make sure that you've installed or configured the following prerequisites. Download and install Python 3.7.2 or greater from Python.org. If you are using Windows, choose Add Python to environment variables before you begin the installation. Make sure that your AWS...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    HTTP Prompt

    HTTP Prompt

    An interactive command-line HTTP and API testing client

    HTTP Prompt is an interactive command-line HTTP client featuring autocomplete and syntax highlighting. You'll never have to memorize the whole commands and HTTP headers thanks to autocomplete with fuzzy matching. Improve readability by rendering JSON, HTML and commands with 27 builtin color themes, borrowed from Pygments. Designed to work with and built on top of HTTPie, HTTP Prompt makes a perfect companion for HTTPie. Cookie-based authentication made easy as incoming cookies are...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    ruia

    ruia

    Async Python framework for fast and flexible web scraping spiders

    Ruia is an asynchronous web scraping micro-framework built for Python that focuses on simplicity, speed, and flexibility when creating web crawlers. Ruia is powered by Python’s asyncio library along with aiohttp, enabling developers to perform concurrent network requests efficiently and scrape data from websites with minimal overhead. Ruia follows a “write less, run faster” philosophy, emphasizing concise code and streamlined spider development. It provides a structured approach to building...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 25
    Requests-HTML

    Requests-HTML

    Pythonic HTML Parsing for Humans

    This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When using this library you automatically get full JavaScript support! (Using Chromium, thanks to puppeteer) CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). XPath Selectors, for the faint of heart. Mocked user-agent (like a real web browser). Automatic following of redirects. Connection–pooling and cookie persistence. The Requests experience you know and love, with magical parsing...
    Downloads: 1 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB