Search Results for "extract website content"

Sort By:

Showing 965 open source projects for "extract website content"

View related business solutions

Enterprise-grade ITSM, for every business
Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.

Try it Free
Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
1

Website Stalker

Track changes on websites via git

This tool checks all the websites listed in its config. When a change is detected, the new site is added to a git commit. It can then be inspected via normal git tooling. The config describes a list of sites. Each site has a URL. Additionally, each site can have editors which are used before saving the file. Each editor manipulates the content of the URL.

Downloads: 0 This Week

Last Update: 2025-10-03
See Project
2

ScrapeGraphAI

Python scraper based on AI

Extracting content from websites and local documents using LLM. ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 5 This Week

Last Update: 6 days ago
See Project
3

PdfPig

Read and extract text and other content from PDFs in C#

This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes.

Downloads: 4 This Week

Last Update: 2026-03-22
See Project
4

watercrawl

AI-ready web crawler that extracts and structures website content

WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. ...

Downloads: 0 This Week

Last Update: 2026-03-11
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

Java Design Patterns Website

Next generation website for Java Design Patterns

This project is the VuePress-powered web front end for the well-known “java-design-patterns” project, which documents classic and modern design patterns in Java. Its purpose is to present that large body of content as a browsable, fast, static website with organized navigation, search, and pattern categorization. Instead of reading patterns only in GitHub markdown, users can consume them in a more pleasant documentation format with sections, sidebars, and themed pages. The site structure makes it easier to discover related patterns, see intent and applicability, and jump between creational, structural, and behavioral groups. ...

Downloads: 1 This Week

Last Update: 2026-02-08
See Project
6

Article Extractor

To extract main article from given URL with Node.js

A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.

Downloads: 0 This Week

Last Update: 2025-09-04
See Project
7

Epublifier

Converts some webnovels to epub format

A tool to convert website-based books or lists of pages to ePub format to read on your eReader/Kindle/etc.

Downloads: 0 This Week

Last Update: 2025-07-31
See Project
8

Reader LLM

Convert any URL to an LLM-friendly input with a simple prefix

...In addition to converting individual pages, the service can perform web searches and return relevant content that can be ingested directly by AI systems. The tool relies on specialized models and parsing techniques to handle complex HTML structures and extract meaningful content while preserving important context.

Downloads: 4 This Week

Last Update: 2026-04-16
See Project
9

Microweber

Drag and Drop Website Builder and CMS with E-commerce

...Its revolutionary Real-Time Text Writing & Editing feature alongside its Drag and Drop feature means the user experience is significantly improved, and users are able to achieve a visually appealing website and easier content management with a lot less time and effort.

1 Review

Downloads: 4 This Week

Last Update: 2025-08-14
See Project
Fully Managed MySQL, PostgreSQL, and SQL Server
Automatic backups, patching, replication, and failover. Focus on your app, not your database.

Cloud SQL handles your database ops end to end, so you can focus on your app.

Try Free
10

TikTok MCP

Model Context Protocol (MCP) with TikTok integration

The TikTok MCP integrates TikTok access into AI applications like Claude AI via TikNeuron. It enables analysis and interaction with TikTok content to determine virality factors and extract video content.

Downloads: 0 This Week

Last Update: 2026-02-27
See Project
11

Spider

High-performance Rust web crawler and scraper for large-scale data

...It focuses on speed, concurrency, and reliability by using asynchronous and multi-threaded processing to handle large volumes of web pages. It can rapidly crawl websites to collect links, retrieve page content, and extract structured information from HTML documents. Spider can operate concurrently across many pages, allowing it to gather large datasets in a short period of time. Spider also provides mechanisms for subscribing to crawl events so developers can process page data such as URLs, status codes, or HTML content as it is discovered. ...

Downloads: 2 This Week

Last Update: 2026-03-31
See Project
12

PyPDF

A pure-python PDF library capable of splitting, merging, cropping

pypdf is a pure Python library for working with PDF files, allowing developers to split, merge, rotate, encrypt, and extract content from PDFs. It’s an actively maintained fork of PyPDF2, improving performance, compatibility, and support for modern PDF standards. Suitable for both automation scripts and full-featured applications, pypdf handles PDFs without requiring external dependencies.

Downloads: 5 This Week

Last Update: 2026-04-15
See Project
13

Keiyoushi Extensions

Extension repository for Mihon and variants

...It includes a wide variety of sources covering different languages, regions, and content types, making it highly adaptable for global audiences. The repository is actively maintained by contributors who update sources to keep up with website changes and ensure continued functionality. It also enforces certain standards for extension development, promoting consistency and reliability across different providers.

Downloads: 47 This Week

Last Update: 5 days ago
See Project
14

ldif-extract

Extrect selected entries from LDIF files like grep

ldif-extract is a small 'grep' like tool to extract and convert data from LDIF files. It could be used standalone or also in a pipe together with other tools like ldapsearch.

Downloads: 0 This Week

Last Update: 2026-01-10
See Project
15

LLM Scraper

Extract structured data from webpages using LLM-powered scraping

LLM Scraper is a TypeScript library designed to extract structured data from webpages using large language models. Instead of relying on fragile HTML selectors or manual parsing rules, the tool interprets webpage content with language models and converts it into structured data according to a defined schema. Developers can specify the data structure using tools such as Zod or JSON Schema, enabling the model to extract relevant information directly into typed objects. ...

Downloads: 1 This Week

Last Update: 3 days ago
See Project
16

Skyvern

Automate browser-based workflows with LLMs and Computer Vision

Skyvern uses a combination of computer vision and AI to understand content on a webpage, making it adaptable to any website. Skyvern takes instructions in natural language, allowing it to execute complex objectives with simple commands. Skyvern is an API-first product. Workflows execute in the cloud, allowing it to run hundreds of workflows at the same time. Skyvern's AI decisions come with built-in explanations, providing clear summaries and justifications for every action. ...

Downloads: 1 This Week

Last Update: 2026-04-14
See Project
17

Dendrite

Tools to build web AI agents that can authenticate

Dendrite Python SDK is a toolkit for building web AI agents that can authenticate, interact with, and extract data from any website, facilitating web automation tasks.

Downloads: 0 This Week

Last Update: 2025-01-29
See Project
18

Astro

The web framework for content-driven websites

Astro powers the world's fastest marketing sites, blogs, e-commerce websites, and more. Astro improves website performance by rendering components on the server, sending lightweight HTML to the browser with zero unnecessary JavaScript overhead. Astro was designed to work with your content, no matter where it lives. Load data from your file system, external API, or your favorite CMS. Extend Astro with your favorite tools. Bring your own JavaScript UI components, CSS libraries, themes, integrations, and more. ...

Downloads: 3 This Week

Last Update: 2 days ago
See Project
19

Laravel Sharp

Laravel 10+ Content management framework

Sharp is a content management framework, a toolset that provides help to build a CMS section in a website, with some rules in mind. The public website should not have any knowledge of the CMS, the CMS is a part of the system, not the center of it. In fact, removing the CMS should not have any effect on the project. Content administrators should work with their data and terminology, not CMS terms.

Downloads: 2 This Week

Last Update: 2026-04-01
See Project
20

Hugo

The world’s fastest framework for building websites

Hugo is a popular, fast and flexible open source static site generator written in Go. It’s designed for speed and flexibility, while also being very easy to use. Hugo has the amazing ability to render a typical, moderately-sized website in just a fraction of a second. It takes Hugo around 1 millisecond to render each piece of content, making it the fastest tool of its kind. Hugo supports unlimited content types, and ships with pre-made templates to make SEO, analytics and many other functions quick and easy to achieve. It’s got a robust theming system, capable of producing even the most complex websites. ...

Downloads: 6 This Week

Last Update: 2026-04-08
See Project
21

sm-website

Content management system. Goal is to make it possible to rapidly finn

Downloads: 0 This Week

Last Update: 2026-03-16
See Project
22

Planet

Build and host decentralized blogs and websites on your Mac

...Did you know that you can use an Ethereum Name (ENS) to set up a website? It's true! You can use the Content Hash field, just like you would use an A or CNAME record for a traditional domain name. The standard for this is EIP-1577, and the Content Hash field can accept a few different values.

Downloads: 0 This Week

Last Update: 2026-03-30
See Project
23

model-viewer

Easily display interactive 3D models on the web and in AR

Easily display interactive 3D models on the web & in AR. Use our Editor to test your 3D models and download a starter website. Generate your own 3D Twitter card for any website. :focus-visible is an as-yet unimplemented web platform feature that enables content authors to style a component on the condition that it received focus in such a way that suggests the focus state should be visibly evident. The :focus-visible capability has not been implemented in any stable browsers yet. ...

Downloads: 8 This Week

Last Update: 2026-03-11
See Project
24

Chandra

OCR model for complex documents with layout-aware structured outputs

Chandra is an advanced OCR model designed to extract and structure information from complex documents such as tables, forms, handwritten notes, and mathematical content. It focuses on preserving full document layout, meaning that extracted text is accompanied by positional metadata like bounding boxes for each element. Chandra supports multiple output formats including Markdown, HTML, and JSON, making it suitable for downstream processing and integration into data pipelines. ...

Downloads: 2 This Week

Last Update: 2026-03-18
See Project
25

Link-Preview-JS

Extract web links information: title, description, images, videos, etc

link-preview-js is a lightweight TypeScript library that extracts metadata from URLs or HTML content to generate rich link previews. By parsing Open Graph tags and other metadata, it retrieves information such as titles, descriptions, images, and videos. Designed primarily for Node.js and mobile environments, it facilitates the creation of link previews similar to those found on social media platforms.

Downloads: 0 This Week

Last Update: 2025-11-20
See Project