Showing 779 open source projects for "extraction"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    ...It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    AWS Encryption SDK for Dafny

    AWS Encryption SDK for Dafny

    AWS Encryption SDK for Dafny

    ...Refer to the specification for how to install duvet in order to generate reports. By default duvet_report will extract the spec only if it cannot find the compliance directory in the specification repo, but will re-use a previous extraction if it exists.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    spacy-llm

    spacy-llm

    Integrating LLMs into structured NLP pipelines

    ...With only a few (and sometimes no) examples, an LLM can be prompted to perform custom NLP tasks such as text categorization, named entity recognition, coreference resolution, information extraction and more. This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Weaviate

    Weaviate

    Weaviate is a cloud-native, modular, real-time vector search engine

    ...Weaviate in detail: Weaviate is a low-latency vector search engine with out-of-the-box support for different media types (text, images, etc.). It offers Semantic Search, Question-Answer-Extraction, Classification, Customizable Models (PyTorch/TensorFlow/Keras), and more. Built from scratch in Go, Weaviate stores both objects and vectors, allowing for combining vector search with structured filtering with the fault-tolerance of a cloud-native database, all accessible through GraphQL, REST, and various language clients.
    Downloads: 12 This Week
    Last Update:
    See Project
  • Go From AI Idea to AI App Fast Icon
    Go From AI Idea to AI App Fast

    One platform to build, fine-tune, and deploy ML models. No MLOps team required.

    Access Gemini 3 and 200+ models. Build chatbots, agents, or custom models with built-in monitoring and scaling.
    Try Free
  • 5
    DotnetSpider

    DotnetSpider

    Lightweight .NET framework for fast web crawling and data scraping

    DotnetSpider is a web crawling and data extraction framework built on the .NET Standard platform. It is designed to help developers create efficient and scalable crawlers for collecting structured data from websites. It provides a high-level API that simplifies the process of defining spiders, managing requests, and extracting content from web pages. Developers can create custom spiders by extending base classes and configuring pipelines that handle downloading, parsing, and storing collected data. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    nw_wrld

    nw_wrld

    nw_wrld is an event-driven sequencer for triggering visuals

    nw_wrld is a procedurally generated world-building engine tailored for game developers and interactive storytellers who want to craft rich, random yet coherent environments without hand-crafting every detail. It uses noise functions and modular terrain algorithms to generate expansive maps, diverse biomes, and layered features like rivers, mountain ranges, forests, and resource nodes. The system is designed to be extensible, letting developers plug in new generation rules or tweak parameters...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 7
    Docspell

    Docspell

    Assist in organizing your piles of documents

    Docspell is a personal document organizer. Or sometimes called a "Document Management System" (DMS). You'll need a scanner to convert your papers into files. Docspell can then assist in organizing the resulting mess. It can unify your files from scanners, emails, and other sources. It is targeted for home use, i.e. families, households, and also for smaller groups/companies. You can associate tags, set correspondent,s and lots of other predefined and custom metadata. If your documents are...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    AutoClip

    AutoClip

    AI-powered video clipping and highlight generation

    AutoClip is an open-source, AI-powered video processing system designed to automate the extraction of “highlight” segments from full-length videos — ideal for creators who want to generate bite-sized clips, compilations, or highlight reels without manually sifting through hours of footage. The system supports downloading videos from major platforms (e.g. YouTube, Bilibili), or accepting local uploads, and then applies AI analysis to identify segments worth clipping based on content (e.g. high energy moments, speech, or other heuristics). ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 9
    Smile

    Smile

    Statistical machine intelligence and learning engine

    Smile is a fast and comprehensive machine learning engine. With advanced data structures and algorithms, Smile delivers the state-of-art performance. Compared to this third-party benchmark, Smile outperforms R, Python, Spark, H2O, xgboost significantly. Smile is a couple of times faster than the closest competitor. The memory usage is also very efficient. If we can train advanced machine learning models on a PC, why buy a cluster? Write applications quickly in Java, Scala, or any JVM...
    Downloads: 5 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 10
    OpenMed

    OpenMed

    Open source healthcare AI

    OpenMed is an open-source healthcare AI and medical NLP toolkit designed to turn clinical text into structured insights using transformer-based models and production-oriented interfaces. Its core purpose is to provide specialized medical entity extraction, PII detection and de-identification, assertion-aware analysis, and related healthcare text processing capabilities without locking users into a proprietary platform. The project includes a curated registry of more than a dozen medical NER models focused on areas such as diseases, drugs, anatomy, genes, and protected health information, and it is built to support both research and deployment scenarios. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Markdownify MCP Server

    Markdownify MCP Server

    Convert files and web content into clean, usable Markdown easily

    ...It supports formats such as PDFs, images, audio with transcription, DOCX, XLSX, and PPTX, along with web sources like YouTube transcripts, Bing results, and general webpages. Markdownify MCP is designed to simplify content extraction and make data easier to read, share, and reuse in structured workflows. Developers can install dependencies, build, and run the server locally, then extend functionality by modifying its TypeScript-based tools and server logic. It also allows retrieval of existing Markdown files, making it useful for documentation, research, and AI-assisted workflows. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 12
    rep+

    rep+

    Burp-style HTTP Repeater for Chrome DevTools with built‑in AI

    rep+ is a lightweight browser extension for Chrome DevTools that brings a Burp Suite-style HTTP repeater directly into the developer console, enhanced with built-in AI to help explain requests and suggest tests. It captures HTTP traffic from the inspected page without needing a proxy, allowing users to replay, modify, and analyze individual requests with fine-grained control over headers, bodies, and methods. The tool offers hierarchical grouping, tagging, and filtering of captured requests...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    RAG Anything

    RAG Anything

    RAG-Anything: All-in-One RAG Framework

    RAG-Anything is an open-source unified framework that extends the Retrieval-Augmented Generation (RAG) paradigm to fully multimodal document and knowledge retrieval, enabling systems to ingest, parse, represent, and query rich content that includes text, images, tables, formulas, and other structured or visual elements. Traditional RAG systems are typically limited to text and cannot effectively work across heterogeneous document layouts, but RAG-Anything addresses this by modeling...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    wal2json

    wal2json

    JSON output plugin for changeset extraction

    wal2json is an output plugin for logical decoding. It means that the plugin have access to tuples produced by INSERT and UPDATE. Also, UPDATE/DELETE old row versions can be accessed depending on the configured replica identity. Changes can be consumed using the streaming protocol (logical replication slots) or by a special SQL API. format version 1 produces a JSON object per transaction. All of the new/old tuples are available in the JSON object. Also, there are options to include properties...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Scweet

    Scweet

    Scrape tweets, profiles, followers and following from Twitter/X

    ...Instead of depending on deprecated unauthenticated scraping methods, it works by using X’s web GraphQL API together with authenticated browser cookies, which gives it a more current and practical approach for data extraction. The project supports a broad set of collection patterns, including searches by keyword, hashtag, user, date range, engagement thresholds, language, and location, making it useful for research, monitoring, and data gathering workflows. It is built for both local use and higher-volume runs, with support for proxies, dedicated accounts, and multi-account cookie handling to improve reliability at scale. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 16
    Eko

    Eko

    Build Production-ready Agentic Workflow with Natural Language

    Eko (Eko Keeps Operating) is a JavaScript framework designed for building production-ready agent-based workflows using natural language commands. It allows developers to create automated agents that can handle complex workflows in both computer and browser environments. With a focus on high development efficiency, Eko simplifies the creation of multi-step workflows, enabling users to integrate and automate tasks across platforms. It provides a unified interface for managing agents, offering...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 17
    eos

    eos

    A lightweight 3D Morphable Face Model library in modern C++

    eos is a lightweight 3D Morphable Face Model fitting library that provides basic functionality to use face models, as well as camera and shape fitting functionality. It's written in modern C++11/14. MorphableModel and PcaModel classes to represent 3DMMs, with basic operations like draw_sample(). Supports the Surrey Face Model (SFM), 4D Face Model (4DFM), Basel Face Model (BFM) 2009 and 2017, and the Liverpool-York Head Model (LYHM) out-of-the-box.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 18
    Dungbeetle

    Dungbeetle

    A distributed job server

    Dungbeetle is a metadata and data lineage tracking tool developed by Zerodha to map and visualize how data flows across systems. It helps teams maintain data transparency by tracking dependencies between databases, tables, and reports, offering a centralized view of data pipelines. Dungbeetle is designed to enhance observability and trust in analytics ecosystems.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    AudioMuse-AI

    AudioMuse-AI

    AudioMuse-AI is an Open Source Dockerized environment

    AudioMuse-AI is an open-source system designed to automatically generate playlists and analyze music libraries using artificial intelligence and audio signal processing techniques. The platform runs locally in a Dockerized environment and performs detailed sonic analysis on audio files to understand characteristics such as tempo, mood, and acoustic similarity. By analyzing the underlying audio content rather than relying on external metadata services, the system can organize large personal...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 20
    Open Semantic Search

    Open Semantic Search

    Open source semantic search and text analytics for large document sets

    Open Semantic Search is an open source research and analytics platform designed for searching, analyzing, and exploring large collections of documents using semantic search technologies. It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and enrich the data with extracted information such as named entities and metadata. It also supports optical character recognition to extract text from images and scanned documents, including images embedded inside PDF files. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 21
    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for ML security

    ...ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, sci-kit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, generation, certification, etc.).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    ANTLR

    ANTLR

    Parser generator to read, process, or translate structured text

    ...Twitter search uses ANTLR for query parsing, with over 2 billion queries a day. The languages for Hive and Pig, the data warehouse and analysis systems for Hadoop, both use ANTLR. Lex Machina uses ANTLR for information extraction from legal texts. Oracle uses ANTLR within SQL Developer IDE and their migration tools. NetBeans IDE parses C++ with ANTLR. The HQL language in the Hibernate object-relational mapping framework is built with ANTLR.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 23
    spider_collection

    spider_collection

    Collection of Python web scraping scripts for data extraction tasks

    spider_collection is a collection of Python web crawler scripts created primarily for experimentation, learning, and practical scraping tasks. spider_collection gathers multiple independent spiders designed to collect data from different platforms and services, demonstrating a variety of scraping techniques and workflows. These crawlers make use of common Python scraping tools such as requests, parsel, BeautifulSoup, and the Scrapy framework to extract structured information from web pages....
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    Spider

    Spider

    High-performance Rust web crawler and scraper for large-scale data

    ...It supports advanced capabilities such as headless browser rendering, background crawling tasks, and configurable rules that control crawl depth or ignored paths. These capabilities make the project suitable for building search indexers, data extraction pipelines, & SEO analysis tools.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 25
    Reader LLM

    Reader LLM

    Convert any URL to an LLM-friendly input with a simple prefix

    Reader LLM is an open-source tool designed to convert web content into formats that are easier for large language models to process. The system works by transforming a webpage into a clean text or Markdown representation that removes unnecessary formatting and highlights the core information within the page. Developers can use a simple URL prefix to retrieve a version of a webpage that has been optimized for machine consumption, making it suitable for use in AI agents or retrieval-augmented...
    Downloads: 2 This Week
    Last Update:
    See Project
MongoDB Logo MongoDB