Join/Login
Business Software
Open Source Software
For Vendors
Blog
About
More

For Vendors Help Create Join Login

Business Software

Open Source Software

SourceForge Podcast

Resources

Articles
Case Studies
Blog

Menu

Help
Create
Join
Login

Home
Open Source Software
Search Results

Search Results for "data extraction tool for website"

x

Sort By:

Relevance

OS

Windows 200
Linux 172
Mac 156
More...
BSD 68
ChromeOS 65
Mobile Operating Systems 14
Desktop Operating Systems 6
Game Consoles 1

Category

Internet 41
Artificial Intelligence 37
Business 36
Software Development 34
Scientific/Engineering 32
Formats and Protocols 25
System 23
Multimedia 20
Security 19
Database 13
Education 9
Communications 4
Games 4
Text Editors 3
Mobile 1
Productivity 1
Social sciences 1

License

OSI-Approved Open Source 158
Creative Commons Attribution License 7
Other License 6
Public Domain 1

Translations

English 42
German 9
French 8
Spanish 4
More...
Portuguese 3
Arabic 2
Chinese (Simplified) 2
Italian 2
Dutch 1
Greek 1
Hindi 1
Indonesian 1
Japanese 1
Lithuanian 1
Polish 1
Russian 1
Swedish 1
Turkish 1
Urdu 1

Programming Language

Status

Production/Stable 49
Beta 25
Mature 9
Planning 8
More...
Pre-Alpha 8
Alpha 8
Inactive 1

Showing 238 open source projects for "data extraction tool for website"

View related business solutions

MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
1

AI-Crawler

Crawl a website starting from a URL, find relevant pages

AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.

Downloads: 2 This Week

Last Update: 2026-04-02
See Project
2

watercrawl

AI-ready web crawler that extracts and structures website content

WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. ...

Downloads: 0 This Week

Last Update: 2026-03-11
See Project
3

Firecrawl

Turn entire websites into LLM-ready markdown or structured data

Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each.

Downloads: 3 This Week

Last Update: 2026-04-10
See Project
4

WebHarvest - web data extraction tool

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

14 Reviews

Downloads: 4 This Week

Last Update: 2025-10-27
See Project
$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
5

Parsera

Lightweight library for scraping web-sites with LLMs

Scrape data from any website with only a link and column descriptions. Parsera is a tool designed to scrape web content, specifically handling poorly structured or messy websites.

Downloads: 0 This Week

Last Update: 2025-10-08
See Project
6

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 0 This Week

Last Update: 2025-06-09
See Project
7

pdfly

CLI tool to extract (meta)data from PDF and manipulate PDF files

A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.

Downloads: 0 This Week

Last Update: 2025-10-13
See Project
8

webclaw

Fast, local-first web content extraction for LLMs

webclaw is a high-performance web content extraction tool designed specifically for AI agents and large language models, focusing on delivering clean, structured data instead of raw HTML. It is built in Rust and operates without a headless browser, using advanced techniques such as TLS fingerprinting to bypass common scraping barriers and mimic real browser behavior. The tool addresses a major inefficiency in AI workflows by removing irrelevant elements like navigation menus, ads, and scripts, significantly reducing token usage when feeding data into language models. ...

Downloads: 0 This Week

Last Update: 2 days ago
See Project
9

OpenDataLoader PDF

PDF Parser for AI-ready data. Automate PDF accessibility

OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing...

Downloads: 15 This Week

Last Update: 5 days ago
See Project
AI-powered service management for IT and enterprise teams
Enterprise-grade ITSM, for every business

Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.

Try it Free
10

MinerU

A high-quality tool for convert PDF to Markdown and JSON

MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.

Downloads: 10 This Week

Last Update: 2 days ago
See Project
11

ContextGem

ContextGem: Effortless LLM extraction from documents

ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.

Downloads: 0 This Week

Last Update: 2026-03-16
See Project
12

Browser Agent

AI Browser Agent is an advanced Browser AI tool

Browser Agent Python is an AI-powered browser automation tool developed by Oxylabs that enables users to control web interactions through natural language instead of traditional scripting. The tool allows developers to describe tasks in plain English, such as navigating pages, clicking elements, filling forms, and extracting data, and the system executes those actions as if a human were interacting with the browser. It is designed to simplify complex automation workflows by removing the need...

Downloads: 1 This Week

Last Update: 2026-04-02
See Project
13

ClatScope

OSINT reconnaissance tool for IP, domain, email, and username lookups

ClatScope is a Python-based OSINT (open source intelligence) utility designed to gather and analyze publicly available information from multiple online sources. It is primarily aimed at investigators, cybersecurity professionals, penetration testers, and researchers who need a centralized platform for reconnaissance tasks. It integrates with numerous public APIs and internet services to retrieve detailed data about IP addresses, domains, email addresses, phone numbers, usernames, and other...

Downloads: 16 This Week

Last Update: 2026-03-07
See Project
14

py-pdf-parser

A Python tool to help extracting information from structured PDFs

py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents.

Downloads: 1 This Week

Last Update: 2025-04-28
See Project
15

mtail

Extract internal monitoring data from application logs

Extract internal monitoring data from application logs for collection in a time-series database. mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding. It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application. ...

Downloads: 6 This Week

Last Update: 2024-08-08
See Project
16

Trafilatura

Python & command-line tool to gather text on the Web

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text-processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Downloads: 1 This Week

Last Update: 2024-12-03
See Project
17

Web RPA

Web Robotics Process Automation Tool

Web RPA is a browser automation framework designed to perform robotic process automation tasks directly within web environments. It enables users to automate repetitive actions such as form filling, data extraction, and workflow execution through programmable scripts. The system focuses on simplicity and flexibility, allowing automation without requiring complex infrastructure. It supports interaction with web elements, navigation flows, and dynamic content handling, making it suitable for scraping and automation scenarios. WebRPA can be integrated into larger systems or used as a standalone tool for automating browser-based operations. ...

Downloads: 0 This Week

Last Update: 2 days ago
See Project
18

audioFlux

A library for audio and music analysis, feature extraction

A library for audio and music analysis, and feature extraction. Can be used for deep learning, pattern recognition, signal processing, bioinformatics, statistics, finance, etc. audioflux is a deep learning tool library for audio and music analysis, feature extraction. It supports dozens of time-frequency analysis transformation methods and hundreds of corresponding time-domain and frequency-domain feature combinations. It can be provided to deep learning networks for training and is used to...

Downloads: 0 This Week

Last Update: 2024-08-09
See Project
19

JS Analyzer

Burp Suite extension for JavaScript static analysis

JS Analyzer is a powerful static analysis tool implemented as a Burp Suite extension that helps security researchers and web developers automatically uncover important artifacts in JavaScript files during web application testing. It parses JavaScript responses intercepted by Burp Suite and intelligently extracts API endpoints, full URLs (including cloud storage links), secrets like API keys or tokens, and email addresses while filtering out noise from irrelevant code patterns. The extension...

Downloads: 3 This Week

Last Update: 2026-01-28
See Project
20

news-please

Python tool for crawling and extracting structured data from news site

news-please is an open source news crawler and information extraction tool designed to collect and structure articles from online news websites. It provides an integrated pipeline that crawls news sites, retrieves article pages, and extracts structured information such as headlines, authors, publication dates, and article text. news-please can recursively follow internal links and read RSS feeds to gather both recent and archived articles from a news outlet when given only the root URL of a site. ...

Downloads: 3 This Week

Last Update: 2 days ago
See Project
21

Unredact

A simple tool for reading in poorly redacted documents

Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...

Downloads: 21 This Week

Last Update: 2026-02-03
See Project
22

web-access

Skill for installing full networking capabilities for Claude Code

web-access is a tool designed to give AI agents structured and controlled access to web content, enabling them to retrieve, navigate, and process information from online sources in real time. It abstracts common web interactions such as page loading, data extraction, and navigation into reusable functions that can be invoked by agents. The system emphasizes safety and control, likely including mechanisms to manage permissions, rate limits, and content filtering. ...

Downloads: 0 This Week

Last Update: 3 days ago
See Project
23

FinalRecon

All-in-one Python web reconnaissance tool for fast target analysis

FinalRecon is an all-in-one web reconnaissance tool written in Python that helps security professionals gather information about a target website quickly and efficiently. It combines multiple reconnaissance techniques into a single command-line utility so users do not need to run several separate tools to collect similar data. FinalRecon focuses on providing a fast overview of a web target while maintaining accuracy in the collected results.

Downloads: 3 This Week

Last Update: 3 hours ago
See Project
24

Article Extractor

To extract main article from given URL with Node.js

A Node.js library for extracting main content from web articles, removing unnecessary clutter like ads and navigation elements.

Downloads: 0 This Week

Last Update: 2025-09-04
See Project
25

Reader LLM

Convert any URL to an LLM-friendly input with a simple prefix

Reader LLM is an open-source tool designed to convert web content into formats that are easier for large language models to process. The system works by transforming a webpage into a clean text or Markdown representation that removes unnecessary formatting and highlights the core information within the page. Developers can use a simple URL prefix to retrieve a version of a webpage that has been optimized for machine consumption, making it suitable for use in AI agents or retrieval-augmented...

Downloads: 2 This Week

Last Update: 2026-04-16
See Project

Previous
You're on page 1
2
3
4
5
Next

Related Searches

phone number location tracking

osint

web scraper

data cleaning

•mobile phone forensics tools

osint tools for android

pdf parser

mtail

forensic audio analysis

m3u url extractor

Related Categories

Internet

Artificial Intelligence

Business

Software Development

Scientific/Engineering

SourceForge

Create a Project
Open Source Software
Business Software
Top Downloaded Projects

Company

About
Team
SourceForge Headquarters
1320 Columbia Street Suite 310
San Diego, CA 92101
+1 (858) 422-6466

Resources

Support
Site Documentation
Site Status
SourceForge Reviews

© 2026 Slashdot Media. All Rights Reserved.

Terms Privacy Opt Out Advertise