extract free download

Showing 126 open source projects for "extract"

View related business solutions

Artificial Intelligence Linux Clear Filters & Widen Search

Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
Full-stack observability with actually useful AI | Grafana Cloud
Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.

Create free account
1

LangChain Extract

Did you say you like data?

LangChain Extract is an open-source reference application designed to demonstrate how large language models can be used to extract structured data from unstructured text and document files. The project implements a lightweight web service that allows developers to define extraction schemas and apply them to various sources such as plain text, HTML, or PDF documents.

Downloads: 0 This Week

Last Update: 2026-03-09
See Project
2

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API

text-extract-api is an open-source service designed to extract readable text from a wide variety of document formats through a simple API interface. The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
3

ScrapeGraphAI

Python scraper based on AI

...ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Just say which information you want to extract and the library will do it for you.

Downloads: 1 This Week

Last Update: 2026-03-24
See Project
4

LLM Scraper

Extract structured data from webpages using LLM-powered scraping

LLM Scraper is a TypeScript library designed to extract structured data from webpages using large language models. Instead of relying on fragile HTML selectors or manual parsing rules, the tool interprets webpage content with language models and converts it into structured data according to a defined schema. Developers can specify the data structure using tools such as Zod or JSON Schema, enabling the model to extract relevant information directly into typed objects.

Downloads: 1 This Week

Last Update: 4 days ago
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

Kor

LLM

This is a half-baked prototype that “helps” you extract structured data from text using LLMs. Specify the schema of what should be extracted and provide some examples. Kor will generate a prompt, send it to the specified LLM and parse out the output. You might even get results back.

Downloads: 0 This Week

Last Update: 2024-07-20
See Project
6

AUTOMATIC1111 Stable Diffusion web UI

Stable Diffusion web UI

AUTOMATIC1111's stable-diffusion-webui is a powerful, user-friendly web interface built on the Gradio library that allows users to easily interact with Stable Diffusion models for AI-powered image generation. Supporting both text-to-image (txt2img) and image-to-image (img2img) generation, this open-source UI offers a rich feature set including inpainting, outpainting, attention control, and multiple advanced upscaling options. With a flexible installation process across Windows, Linux, and...

1 Review

Downloads: 341 This Week

Last Update: 2025-06-02
See Project
7

Chandra

OCR model for complex documents with layout-aware structured outputs

Chandra is an advanced OCR model designed to extract and structure information from complex documents such as tables, forms, handwritten notes, and mathematical content. It focuses on preserving full document layout, meaning that extracted text is accompanied by positional metadata like bounding boxes for each element. Chandra supports multiple output formats including Markdown, HTML, and JSON, making it suitable for downstream processing and integration into data pipelines.

Downloads: 2 This Week

Last Update: 2026-03-18
See Project
8

docext

An on-premises, OCR-free unstructured data extraction

...This allows the system to detect and extract structured elements such as tables, signatures, key fields, and layout information while maintaining semantic understanding of the document content. The toolkit can also convert complex documents into structured markdown representations that preserve formatting and contextual relationships.

Downloads: 2 This Week

Last Update: 2026-03-12
See Project
9

GraphRAG

A modular graph-based Retrieval-Augmented Generation (RAG) system

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

Downloads: 4 This Week

Last Update: 2026-03-27
See Project
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
10

Scribe.js

JavaScript OCR and text extraction for images and PDFs

Scribe.js is a JavaScript library that provides Optical Character Recognition (OCR) and text extraction capabilities for both images and PDF documents, aimed at developers who want to build OCR features directly into their applications. The library can take image files (such as PNG or JPEG) and recognize the text they contain, and it can also extract text from PDF files that either already contain text or are image-based scans, using modern web standards and WebAssembly under the hood. In addition to simple text extraction, Scribe.js supports writing or injecting a high-quality invisible text layer back into PDFs, effectively making them searchable and improving usability for indexing or accessibility. ...

Downloads: 2 This Week

Last Update: 2026-03-14
See Project
11

TikTok MCP

Model Context Protocol (MCP) with TikTok integration

The TikTok MCP integrates TikTok access into AI applications like Claude AI via TikNeuron. It enables analysis and interaction with TikTok content to determine virality factors and extract video content.

Downloads: 3 This Week

Last Update: 2026-02-27
See Project
12

Umi-OCR

OCR software, free and offline

Umi-OCR is a free and open-source optical character recognition (OCR) tool designed to provide fast, offline text extraction from images, screenshots, PDFs, and more without requiring a network connection. It includes a highly efficient offline OCR engine with built-in multilingual recognition libraries, so users can extract text across multiple languages with high accuracy directly on their machines. The software supports flexible usage patterns including screenshot capture OCR, batch processing of large sets of images or documents, PDF parsing, QR code detection, and layout-aware paragraph output. Users can interact with Umi-OCR through a graphical interface, command-line options, or HTTP interfaces, making it adaptable to both casual desktop usage and programmatic automation. ...

Downloads: 44 This Week

Last Update: 2026-01-15
See Project
13

Voice-Pro

Comprehensive Gradio WebUI for audio processing

Voice-Pro is the best gradio WebUI for transcription, translation and text-to-speech. It can be easily installed with one click. Create a virtual environment using Miniconda, running completely separate from the Windows system (fully portable). Supports real-time transcription and translation, as well as batch mode.

1 Review

Downloads: 30 This Week

Last Update: 2025-12-05
See Project
14

Stagehand

An AI web browsing framework focused on simplicity and extensibility

...Each Stagehand function takes in an atomic instruction, such as act("click the login button") or extract("find the red shoes"), generates the appropriate Playwright code to accomplish that instruction, and executes it.

Downloads: 0 This Week

Last Update: 4 days ago
See Project
15

deepfakes_faceswap

Deepfakes Software For All

Faceswap is the leading free and open source multi-platform deepfakes software. When faceswapping was first developed and published, the technology was groundbreaking, it was a huge step in AI development. It was also completely ignored outside of academia because the code was confusing and fragmentary. It required a thorough understanding of complicated AI techniques and took a lot of effort to figure it out. Until one individual brought it together into a single, cohesive collection.

Downloads: 11 This Week

Last Update: 2025-12-21
See Project
16

PaperAI

Semantic search and workflows for medical/scientific papers

PaperAI is an open-source framework for searching and analyzing scientific papers, particularly useful for researchers looking to extract insights from large-scale document collections.

Downloads: 2 This Week

Last Update: 2025-07-01
See Project
17

Gitingest

Create prompt-friendly codebase digests from any Git repository URL

...The generated output is optimized for prompt usage, helping AI models understand codebases more effectively without requiring manual file aggregation. In addition to producing the code digest, Gitingest also calculates statistics about the extracted content such as repository structure, total size of the extract, and token count. Gitingest can be used as a command line utility or integrated directly into Python applications.

Downloads: 0 This Week

Last Update: 2026-03-13
See Project
18

Sparrow

Structured data extraction and instruction calling with ML, LLM

Sparrow is an open-source platform designed to extract structured information from documents, images, and other unstructured data sources using machine learning and large language models. The system focuses on transforming complex documents such as invoices, receipts, forms, and scanned pages into structured formats like JSON that can be processed by downstream applications. It combines several components, including OCR pipelines, vision-language models, and LLM-based reasoning modules to identify and extract meaningful data fields from heterogeneous document layouts. ...

Downloads: 0 This Week

Last Update: 2026-03-04
See Project
19

Verba

Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

Welcome to Verba: The Golden RAGtriever, a community-driven open-source application designed to offer an end-to-end, streamlined, and user-friendly interface for Retrieval-Augmented Generation (RAG) out of the box. In just a few easy steps, explore your datasets and extract insights with ease, either locally with Ollama and Huggingface or through LLM providers such as Anthrophic, Cohere, and OpenAI. This project is built with and for the community, please be aware that it might not be maintained with the same urgency as other Weaviate production applications.

Downloads: 1 This Week

Last Update: 2025-07-14
See Project
20

LiteParse

A fast, helpful, and open-source document parser

LiteParse is an open-source lightweight parsing library designed to extract structured data from unstructured text using large language models in an efficient and cost-effective manner. It focuses on simplifying the process of turning raw text into structured outputs such as JSON by providing a streamlined interface for prompt-based parsing. The system is designed to minimize overhead, making it suitable for applications where performance and cost are critical considerations.

Downloads: 5 This Week

Last Update: 2 days ago
See Project
21

AI-Crawler

Crawl a website starting from a URL, find relevant pages

...Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure. Users can define their data requirements in plain English, and the system will interpret those instructions to crawl a domain and extract structured data. The tool supports output formats such as JSON and Markdown, and it can generate or accept schemas to ensure that extracted data is structured according to application needs. It is designed as a low-code solution, reducing the complexity of building and maintaining custom scraping pipelines.

Downloads: 5 This Week

Last Update: 3 days ago
See Project
22

Scriberr

Self-hosted AI audio transcription

...The application includes a polished user interface that simplifies the management of recordings, transcripts, and annotations, making it suitable for both casual users and professionals handling large volumes of audio. Beyond transcription, Scriberr also integrates features such as summarization, tagging, and interaction with language models, allowing users to extract insights from conversations or meetings efficiently.

Downloads: 7 This Week

Last Update: 2026-03-19
See Project
23

Dendrite

Tools to build web AI agents that can authenticate

Dendrite Python SDK is a toolkit for building web AI agents that can authenticate, interact with, and extract data from any website, facilitating web automation tasks.

Downloads: 1 This Week

Last Update: 2025-01-29
See Project
24

deepdoctection

A Repo For Document AI

DeepDoctection is a document AI framework that applies deep learning techniques to analyze and extract structured data from scanned documents, PDFs, and images. deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for fine-tuning, evaluating and running models. ...

Downloads: 3 This Week

Last Update: 5 days ago
See Project
25

AgentQL MCP

Model Context Protocol server that integrates AgentQL's data

The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project