Join/Login
Business Software
Open Source Software
For Vendors
Blog
About
More

For Vendors Help Create Join Login

Business Software

Open Source Software

SourceForge Podcast

Resources

Articles
Case Studies
Blog

Menu

Help
Create
Join
Login

Home
Open Source Software
Search Results

Search Results for "extraction" - Page 3

x

Sort By:

Relevance

Clear All Filters

OS

Linux 201
Windows 182
Mac 171
More...
BSD 88
ChromeOS 77
Mobile Operating Systems 5
Desktop Operating Systems 4

Category

Artificial Intelligence 108
Scientific/Engineering 32
Software Development 29
Multimedia 25
Internet 21
Business 13
Security 11
System 11
Education 5
Formats and Protocols 5
Text Editors 4
Database 1
Productivity 1

License

OSI-Approved Open Source 194
Other License 5
Creative Commons Attribution License 4
Public Domain 1

Translations

Programming Language

Python 219
C++ 14
C 9
Unix Shell 9
Java 5
More...
MATLAB 5
JavaScript 4
TypeScript 4
Perl 3
R 2
Assembly 1
C# 1
Common Lisp 1
Julia 1
PHP 1
Ruby 1
Scilab 1

Status

Production/Stable 21
Beta 15
Alpha 14
Pre-Alpha 2
More...
Mature 1
Inactive 1

Showing 219 open source projects for "extraction"

View related business solutions

Python Clear Filters & Widen Search

$300 in Free Credit Towards Top Cloud Services
Build VMs, containers, AI, databases, storage—all in one place.

Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.

Get Started
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
1

X-osint

Open source OSINT tool for gathering data on emails, phones, and IPs

X-osint is an open source intelligence framework designed to collect and analyze publicly available information from multiple sources. It focuses on gathering useful and credible data about entities such as phone numbers, email addresses, and IP addresses using a range of automated OSINT techniques. It provides investigators and researchers with a centralized interface for running information-gathering tasks that would normally require multiple separate tools. X-osint can also perform...

Downloads: 41 This Week

Last Update: 1 day ago
See Project
2

HeartMuLa

A Family of Open Sourced Music Foundation Models

...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.

Downloads: 17 This Week

Last Update: 3 days ago
See Project
3

Unredact

A simple tool for reading in poorly redacted documents

Unredact is a specialized tool that attempts to reconstruct redacted or obscured text in images, PDFs, or screenshots using a combination of image processing and generative AI inference to suggest plausible completions of blurred, black-boxed, or jumbled content. Unlike traditional optical character recognition (OCR), which only reads visible text, Unredact focuses on inferring missing content where redaction has been applied by analyzing surrounding context, font characteristics, and...

Downloads: 16 This Week

Last Update: 2026-02-03
See Project
4

yt-dlp

A youtube-dl fork with additional features and fixes

yt-dlp is a youtube-dl fork based on the now inactive youtube-dlc. The main focus of this project is adding new features and patches while also keeping up to date with the original project

Downloads: 618 This Week

Last Update: 2026-03-17
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

TextFSM

Python module for parsing semi-structured text into python tables

...By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template defines states, transitions, and regex patterns that determine how to interpret text line by line, enabling precise extraction of key information from varied sources. This modular approach allows users to maintain a library of templates for different data formats, improving automation in network operations and system administration. Widely used in network automation workflows, TextFSM integrates easily with Python scripts, making it an essential tool for engineers.

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
6

DINOv3

Reference PyTorch implementation and models for DINOv3

DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...

Downloads: 15 This Week

Last Update: 2026-03-30
See Project
7

River ML

Online machine learning in Python

River is a Python library for online machine learning. It aims to be the most user-friendly library for doing machine learning on streaming data. River is the result of a merger between creme and scikit-multiflow.

Downloads: 0 This Week

Last Update: 2025-11-13
See Project
8

pyLoad

The free and open-source Download Manager written in pure Python

...It uses a plugin-driven architecture that supports hundreds of hosters, link decrypters, and extensions that extend its capabilities. pyLoad includes a modern web-based interface that allows users to remotely manage downloads from a browser, enabling full control over queues, links, and download settings. The system supports features such as premium account integration, automated captcha solving, and link extraction from container files or encrypted link lists.

Downloads: 14 This Week

Last Update: 2026-03-13
See Project
9

GLM-OCR

Accurate × Fast × Comprehensive

GLM-OCR is an open-source multimodal optical character recognition (OCR) model built on a GLM-V encoder–decoder foundation that brings robust, accurate document understanding to complex real-world layouts and modalities. Designed to handle text recognition, table parsing, formula extraction, and general information retrieval from documents containing mixed content, GLM-OCR excels across major benchmarks while remaining highly efficient with a relatively compact parameter size (~0.9B), enabling deployment in high-concurrency services and edge environments. The model’s multimodal capabilities allow it to reason across image and text content holistically, capturing structured and unstructured information from pages that include dense tables, seals, code snippets, and varied document graphics. ...

Downloads: 20 This Week

Last Update: 5 days ago
See Project
Forever Free Full-Stack Observability | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
10

spaCy models

Models for the spaCy Natural Language Processing (NLP) library

...The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. If your application needs to process entire web dumps, spaCy is the library you want to be using. Since its release in 2015, spaCy has become an industry standard with a huge ecosystem. Choose from a variety of plugins, integrate with your machine learning stack and build custom components and workflows.

Downloads: 13 This Week

Last Update: 2026-03-18
See Project
11

Scrapy

A fast, high-level web crawling and web scraping framework

Scrapy is a fast, open source, high-level framework for crawling websites and extracting structured data from these websites. Portable and written in Python, it can run on Windows, Linux, macOS and BSD. Scrapy is powerful, fast and simple, and also easily extensible. Simply write the rules to extract the data, and add new functionality if you wish without having to touch the core. Scrapy does the rest, and can be used in a number of applications. It can be used for data mining, monitoring...

Downloads: 25 This Week

Last Update: 4 days ago
See Project
12

Transformers

State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX

...Using pre-trained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities. Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. Images, for tasks like image classification, object detection, and segmentation. Audio, for tasks like speech recognition and audio classification. Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and then share them with the community on our model hub. ...

Downloads: 23 This Week

Last Update: 4 days ago
See Project
13

MarkPDFDown

A high-quality PDF to Markdown tool based on large language model

MarkPDFdown is an open-source document processing tool designed to convert PDF files into structured Markdown output that can be easily used for documentation, content pipelines, and AI processing workflows. The project focuses on extracting text, formatting, and structural information from complex PDF documents and transforming that information into clean Markdown that preserves the original hierarchy of headings, paragraphs, tables, and lists. By producing Markdown rather than raw text,...

Downloads: 10 This Week

Last Update: 2026-03-06
See Project
14

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22
See Project
15

Docling

Get your documents ready for gen AI

...The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines. Docling is designed to run efficiently on commodity hardware and can be used both as a Python API and a command-line tool. Its modular architecture allows developers to extend functionality and integrate specialized models for tasks such as OCR and audio transcription. ...

Downloads: 6 This Week

Last Update: 3 days ago
See Project
16

Prompt Engineering Interactive Tutorial

Anthropic's Interactive Prompt Engineering Tutorial

...The course leans heavily on realistic failure modes (ambiguity, hallucination, brittle instructions) and shows how to iteratively debug prompts the way you would debug code. Lessons include building prompts from scratch for common tasks like extraction, classification, transformation, and step-by-step reasoning, with checkpoints that let you compare your outputs against solid baselines. You’ll also practice advanced patterns such as tool use, constrained generation, and response validation so outputs are trustworthy and machine-consumable.

Downloads: 0 This Week

Last Update: 2025-10-06
See Project
17

ClatScope

OSINT reconnaissance tool for IP, domain, email, and username lookups

ClatScope is a Python-based OSINT (open source intelligence) utility designed to gather and analyze publicly available information from multiple online sources. It is primarily aimed at investigators, cybersecurity professionals, penetration testers, and researchers who need a centralized platform for reconnaissance tasks. It integrates with numerous public APIs and internet services to retrieve detailed data about IP addresses, domains, email addresses, phone numbers, usernames, and other...

Downloads: 8 This Week

Last Update: 2026-03-07
See Project
18

RAG Anything

RAG-Anything: All-in-One RAG Framework

RAG-Anything is an open-source unified framework that extends the Retrieval-Augmented Generation (RAG) paradigm to fully multimodal document and knowledge retrieval, enabling systems to ingest, parse, represent, and query rich content that includes text, images, tables, formulas, and other structured or visual elements. Traditional RAG systems are typically limited to text and cannot effectively work across heterogeneous document layouts, but RAG-Anything addresses this by modeling...

Downloads: 8 This Week

Last Update: 2026-03-24
See Project
19

MegaParse

File Parser optimised for LLM Ingestion with no loss

...It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.

Downloads: 1 This Week

Last Update: 2025-02-14
See Project
20

Nano PDF Editor

Edit PDF files with Nano Banana

Nano PDF Editor is a minimalist, portable PDF viewer and toolkit that focuses on simplicity, speed, and ease of integration for applications that need basic PDF rendering without heavy dependencies. It provides core functionality such as page navigation, zooming, text selection, and rendering directly to native graphics surfaces, making it suitable for lightweight PDF viewing scenarios on desktop or embedded platforms. Designed to be easily embedded into larger software projects, Nano-PDF...

Downloads: 7 This Week

Last Update: 2026-02-05
See Project
21

HunyuanOCR

OCR expert VLM powered by Hunyuan's native multimodal architecture

HunyuanOCR is an open-source, end-to-end OCR (optical character recognition) Vision-Language Model (VLM) developed by Tencent‑Hunyuan. It’s designed to unify the entire OCR pipeline, detection, recognition, layout parsing, information extraction, translation, and even subtitle or structured output generation, into a single model inference instead of a cascade of separate tools. Despite being fairly lightweight (about 1 billion parameters), it delivers state-of-the-art performance across a wide variety of OCR tasks, outperforming many traditional OCR systems and even other multimodal models on benchmark suites. ...

Downloads: 0 This Week

Last Update: 5 days ago
See Project
22

Unrud Video Downloader

Download videos from websites like YouTube and many others

...The application supports a wide range of features, including downloading entire playlists, handling private or password-protected content, and automatically selecting optimal formats based on user preferences. It also allows users to convert videos into audio files such as MP3, making it useful for media extraction workflows. The software is distributed across multiple platforms, including Linux package managers and containerized environments, ensuring broad accessibility. It includes configuration options and debugging capabilities for advanced users who want more control over the download process.

Downloads: 15 This Week

Last Update: 4 days ago
See Project
23

Matcha-TTS

A fast TTS architecture with conditional flow matching

Matcha-TTS is a non-autoregressive neural text-to-speech architecture that uses conditional flow matching to generate speech quickly while maintaining natural quality. It models speech as an ODE-based generative process, and conditional flow matching lets it reach high-quality audio in only a few synthesis steps, which greatly reduces latency compared to score-matching diffusion approaches. The model is fully probabilistic, so it can generate diverse realizations of the same text while still...

Downloads: 15 This Week

Last Update: 2025-11-28
See Project
24

Director

AI video agents framework for next-gen video interactions

Director is a video database management system designed to organize, search, and retrieve large collections of video content efficiently.

Downloads: 0 This Week

Last Update: 2025-01-29
See Project
25

LangExtract

A Python library for extracting structured information

...LangExtract supports a wide range of models, including Google Gemini, OpenAI GPT, and local LLMs via Ollama, making it adaptable to different deployment environments and compliance needs. The system excels at handling long documents using optimized chunking, multi-pass extraction, and parallel processing to ensure both high recall and structured consistency.

Downloads: 10 This Week

Last Update: 5 days ago
See Project

Previous
1
2
You're on page 3
4
5
6
7
Next

Related Searches

pdf editor portable

yt-dlp

•mobile phone forensics tools

osint

pdf editor

pdf

youtube

termux

youtube downloader

phone number location tracking

Related Categories

Artificial Intelligence

Scientific/Engineering

Software Development

Multimedia

Internet

SourceForge

Create a Project
Open Source Software
Business Software
Top Downloaded Projects

Company

About
Team
SourceForge Headquarters
1320 Columbia Street Suite 310
San Diego, CA 92101
+1 (858) 422-6466

Resources

Support
Site Documentation
Site Status
SourceForge Reviews

© 2026 Slashdot Media. All Rights Reserved.

Terms Privacy Opt Out Advertise