Showing 112 open source projects for "extraction"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Ship Agents Faster Icon
    Ship Agents Faster

    Transform your applications and workflows into powerful agentic systems at global scale.

    Gemini Enterprise Agent Platform lets you rapidly build, scale, govern and optimize production-ready agents grounded in your organization's data. The platform enables developers to build custom or pre-built agents for virtually any use case. New customers get $300 in free credits.
    Get Started Free
  • 1
    Video-subtitle-extractor

    Video-subtitle-extractor

    A GUI tool for extracting hard-coded subtitle (hardsub) from videos

    Video hard subtitle extraction, generate srt file. There is no need to apply for a third-party API, and text recognition can be implemented locally. A deep learning-based video subtitle extraction framework, including subtitle region detection and subtitle content extraction. A GUI tool for extracting hard-coded subtitles (hardsub) from videos and generating srt files.
    Downloads: 72 This Week
    Last Update:
    See Project
  • 2
    Sparrow

    Sparrow

    Structured data extraction and instruction calling with ML, LLM

    ...It combines several components, including OCR pipelines, vision-language models, and LLM-based reasoning modules to identify and extract meaningful data fields from heterogeneous document layouts. The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Unstract

    Unstract

    No-code LLM Platform to launch APIs and ETL Pipelines

    Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Error to trace to log to deploy. One click. No SSH. Icon
    Error to trace to log to deploy. One click. No SSH.

    Catch the cause before the pager goes off.

    AppSignal links every error to the trace, the trace to the log, the log to the deploy that shipped it.
    Free 30 days.
  • 5
    NeMo Retriever Library

    NeMo Retriever Library

    Document content and metadata extraction microservice

    ...It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient handling of large datasets. It supports multiple extraction strategies for different document formats, balancing accuracy and throughput depending on the use case. Additionally, it can generate embeddings for extracted content and integrate with vector databases like Milvus, making it well-suited for retrieval-augmented generation pipelines.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    NLP

    NLP

    Open source NLP guide with models, methods, and real use cases

    ...It explains how machines process and understand human language, combining theory with practical examples. Its covers core NLP concepts such as text representation, feature extraction, and model evaluation, alongside hands-on implementations using tools like Word2Vec, TF-IDF, and FastText. It also introduces topic modeling with LDA, keyword extraction techniques, and document similarity methods. NLP extends into real-world applications, including sentiment analysis and text classification, helping readers connect concepts to use cases. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    text-extract-api

    text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API

    ...The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models. Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    MinerU

    MinerU

    A high-quality tool for convert PDF to Markdown and JSON

    MinerU is an open-source, high-quality document extraction toolkit focused on converting PDFs (and other document formats) into structured Markdown and JSON. It leverages OCR and layout analysis to preserve semantic structure and metadata, ideal for research and data science workflows.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 10
    pyAudioAnalysis

    pyAudioAnalysis

    Python Audio Analysis Library: Feature Extraction, Classification

    ...The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. It also includes utilities for visualizing audio features and analyzing patterns within sound recordings, which can be useful in applications such as speech recognition, music classification, and acoustic event detection. Because the library integrates machine learning algorithms with signal processing tools, it enables researchers to develop complete audio analysis pipelines using a single framework.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    video2robot

    video2robot

    End-to-end pipeline converting generative videos

    video2robot is an end-to-end open-source pipeline that converts generative video or prompt-driven motion content into executable humanoid robot motion sequences, enabling researchers and developers to go from high-level action descriptions or videos to robot-ready motion data. The pipeline supports both prompt-to-video generation using models like Veo/Sora and video upload processing, followed by human pose extraction through a 3D pose model and retargeting of that motion to robot joints using a general motion retargeting system. This workflow allows users to generate robot motion files that specify joint angles, root positions, and orientations that can be deployed on supported robot platforms (e.g., Unitree models). Video2robot includes scripts for each stage of the pipeline (generation, extraction, conversion, visualization) and can run as a CLI or through a basic web UI.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    MiroFish

    MiroFish

    A Simple and Universal Swarm Intelligence Engine

    MiroFish is a next-generation artificial intelligence prediction engine that leverages multi-agent technology and swarm-intelligence simulation to model, simulate, and forecast complex real-world scenarios. The system extracts “seed” information from sources such as breaking news, policy documents, and market signals to construct a high-fidelity digital parallel world populated by thousands of virtual agents with independent memory and behavior rules. Users can inject variables or conditions...
    Downloads: 190 This Week
    Last Update:
    See Project
  • 13
    fireworks-tech-graph

    fireworks-tech-graph

    Claude Code skill for generating production-quality SVG+PNG technical

    ...It aims to transform unstructured information into interconnected graphs that can be queried and analyzed for insights, making it easier to understand complex ecosystems such as software stacks or research fields. The system likely leverages AI techniques for entity extraction, relationship mapping, and graph construction, enabling automated knowledge organization. It can be used to power recommendation systems, research tools, or intelligent assistants that require contextual understanding of technical topics. The project emphasizes scalability and adaptability, allowing it to handle large datasets and evolving knowledge bases. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    Umi-OCR

    Umi-OCR

    OCR software, free and offline

    Umi-OCR is a free and open-source optical character recognition (OCR) tool designed to provide fast, offline text extraction from images, screenshots, PDFs, and more without requiring a network connection. It includes a highly efficient offline OCR engine with built-in multilingual recognition libraries, so users can extract text across multiple languages with high accuracy directly on their machines. The software supports flexible usage patterns including screenshot capture OCR, batch processing of large sets of images or documents, PDF parsing, QR code detection, and layout-aware paragraph output. ...
    Downloads: 72 This Week
    Last Update:
    See Project
  • 15
    DINOv2

    DINOv2

    PyTorch code and models for the DINOv2 self-supervised learning

    ...The core promise is that a single pretrained backbone can transfer well to many downstream tasks—from linear probing on classification to retrieval, detection, and segmentation—often requiring little or no fine-tuning. The repository includes code for training, evaluating, and feature extraction, with utilities to run k-NN or linear evaluation baselines to assess representation quality. Pretrained checkpoints cover multiple model sizes so practitioners can trade accuracy for speed and memory depending on their deployment constraints.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    WanGP

    WanGP

    AI video generator optimized for low VRAM and older GPUs use

    ...Wan2GP provides a full web-based interface that simplifies interaction with complex generative pipelines, making it easier to configure prompts, models, and rendering settings. It also integrates a wide range of utilities such as prompt enhancement, mask editing, motion design, and extraction tools for pose, depth, and flow data to support advanced video workflows.
    Downloads: 60 This Week
    Last Update:
    See Project
  • 17
    Kaldi

    Kaldi

    kaldi-asr/kaldi is the official location of the Kaldi project

    ...Kaldi is designed for researchers who need a highly customizable environment to experiment with new algorithms, as well as for practitioners who want robust, production-ready ASR pipelines. It includes extensive tools for data preparation, feature extraction, acoustic and language modeling, decoding, and evaluation. With its modular design, Kaldi allows users to adapt the system to a wide range of languages and domains. As one of the most influential projects in speech recognition, it has become a foundation for much of the modern work in ASR.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 18
    Auto-Deep-Research

    Auto-Deep-Research

    Your Fully-Automated Personal AI Assistant

    ...Users provide a research topic or multifaceted goal, and the system autonomously breaks the objective down into subtasks like literature collection, critical summarization, cross-comparison, citation extraction, metric evaluation, and structured writing. Auto-Deep-Research integrates retrieval from academic and web sources, processes document corpora for relevance and key insights, and organizes outputs into coherent chapters or sections according to research standards. It also embeds validation loops, where intermediate drafts are self-checked for consistency, coverage, and alignment with sound reasoning practices, reducing reliance on raw generation alone.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Paper2Slides

    Paper2Slides

    From Paper to Presentation in One Click

    ...It is designed to replace the repetitive work of turning dense technical documents into presentation-friendly structure by extracting key points, figures, and data into a coherent visual narrative. The system supports multiple input formats, so you can process PDFs and common office documents rather than being locked to a single file type. It uses an extraction approach intended to capture critical insights comprehensively, including important visuals and data points that often get missed in naive summarization. A major focus is traceability: generated slide content is designed to remain linked back to the source material so you can verify accuracy and reduce information drift. It also offers styling flexibility, letting you use built-in themes or describe a custom design direction in natural language for themed outputs.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    docext

    docext

    An on-premises, OCR-free unstructured data extraction

    docext is a document intelligence toolkit that uses vision-language models to extract structured information from documents such as PDFs, forms, and scanned images. The system is designed to operate entirely on-premises, allowing organizations to process sensitive documents without relying on external cloud services. Unlike traditional document processing pipelines that rely heavily on optical character recognition, docext leverages multimodal AI models capable of understanding both visual...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    ...It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question answering, or structured information extraction tasks. Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    VLMEvalKit

    VLMEvalKit

    Open-source evaluation toolkit of large multi-modality models (LMMs)

    ...VLMEvalKit supports generation-based evaluation methods, allowing models to produce textual responses to visual inputs while measuring performance through techniques such as exact matching or language-model-assisted answer extraction.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Python Client For NLP Cloud

    Python Client For NLP Cloud

    NLP Cloud serves high performance pre-trained or custom models for NER

    NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, dialogue summarization, paraphrasing, intent classification, product description and ad generation, chatbot, grammar and spelling correction, keywords and keyphrases extraction, text generation, image generation, blog post generation, source code generation, question answering, automatic speech recognition, machine translation, language detection, semantic search, semantic similarity, tokenization, POS tagging, embeddings, and dependency parsing. It is ready for production, served through a REST API. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...
    Downloads: 26 This Week
    Last Update:
    See Project
  • 25
    Docling

    Docling

    Get your documents ready for gen AI

    ...The project focuses on converting and parsing many document formats into a unified structured representation that downstream systems can easily consume. It supports advanced PDF understanding, including layout detection, table extraction, and reading order analysis, enabling high-fidelity document intelligence pipelines. Docling is designed to run efficiently on commodity hardware and can be used both as a Python API and a command-line tool. Its modular architecture allows developers to extend functionality and integrate specialized models for tasks such as OCR and audio transcription. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next