283 projects for "extraction" with 1 filter applied:

  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 1
    zpdf

    zpdf

    Zero-copy PDF text extraction library written in Zig

    zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    text-extract-api

    text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API

    ...The project focuses on converting complex files such as PDFs, images, scanned documents, and office files into structured plain text that can be processed by downstream applications or language models. Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 3
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 4
    DocStrange

    DocStrange

    Extract and convert data from any document, images, pdfs, word doc

    DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 5
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    AI-Crawler

    AI-Crawler

    Crawl a website starting from a URL, find relevant pages

    AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 7
    NLP

    NLP

    Open source NLP guide with models, methods, and real use cases

    ...It explains how machines process and understand human language, combining theory with practical examples. Its covers core NLP concepts such as text representation, feature extraction, and model evaluation, alongside hands-on implementations using tools like Word2Vec, TF-IDF, and FastText. It also introduces topic modeling with LDA, keyword extraction techniques, and document similarity methods. NLP extends into real-world applications, including sentiment analysis and text classification, helping readers connect concepts to use cases. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 8
    Scribe.js

    Scribe.js

    JavaScript OCR and text extraction for images and PDFs

    ...In addition to simple text extraction, Scribe.js supports writing or injecting a high-quality invisible text layer back into PDFs, effectively making them searchable and improving usability for indexing or accessibility. It is written in modern ECMAScript Modules (ESM), so it can be imported in both browser and Node.js environments without a build step, though browser usage requires same-origin hosting of the files.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 9
    pyAudioAnalysis

    pyAudioAnalysis

    Python Audio Analysis Library: Feature Extraction, Classification

    ...The project provides a collection of tools that allow developers to extract meaningful features from audio files and use those features for classification, segmentation, and analysis. The library supports multiple audio processing workflows, including feature extraction from raw audio signals, training of machine learning models, and automatic audio segmentation. It also includes utilities for visualizing audio features and analyzing patterns within sound recordings, which can be useful in applications such as speech recognition, music classification, and acoustic event detection. Because the library integrates machine learning algorithms with signal processing tools, it enables researchers to develop complete audio analysis pipelines using a single framework.
    Downloads: 3 This Week
    Last Update:
    See Project
  • Train ML Models With SQL You Already Know Icon
    Train ML Models With SQL You Already Know

    BigQuery automates data prep, analysis, and predictions with built-in AI assistance.

    Build and deploy ML models using familiar SQL. Automate data prep with built-in Gemini. Query 1 TB and store 10 GB free monthly.
    Try Free
  • 10
    MiroFish

    MiroFish

    A Simple and Universal Swarm Intelligence Engine

    MiroFish is a next-generation artificial intelligence prediction engine that leverages multi-agent technology and swarm-intelligence simulation to model, simulate, and forecast complex real-world scenarios. The system extracts “seed” information from sources such as breaking news, policy documents, and market signals to construct a high-fidelity digital parallel world populated by thousands of virtual agents with independent memory and behavior rules. Users can inject variables or conditions...
    Downloads: 1,284 This Week
    Last Update:
    See Project
  • 11
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    WaterCrawl is an open source web crawling and data extraction platform designed to transform website content into structured data suitable for machine learning and AI workflows. It enables developers and researchers to crawl web pages, extract meaningful information, and convert it into formats that are easier to process and analyze. It provides a modern crawling system that can automatically navigate links, control crawl depth, and collect content from targeted sections of a website. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 12
    dude uncomplicated data extraction

    dude uncomplicated data extraction

    dude uncomplicated data extraction: A simple framework

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax. Dude is currently in Pre-Alpha. Please expect breaking changes. You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude scrape command.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Scanopy

    Scanopy

    Clean network diagrams, One-time setup, zero upkeep

    Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. Developers can compose custom pipelines that chain together transforms, filters, and exporters, enabling automation of tedious data preparation steps and accelerating insights with minimal code. The system places a premium on extensibility, allowing contributors to add new extractors or analysis modules tailored to specific industries or datasets. ...
    Downloads: 18 This Week
    Last Update:
    See Project
  • 14
    Symfony DomCrawler

    Symfony DomCrawler

    Eases DOM navigation for HTML and XML documents

    Symfony DomCrawler is a PHP component that provides powerful tools for navigating and extracting data from HTML and XML documents. It allows developers to parse, filter, and manipulate web pages using CSS selectors and XPath expressions. DomCrawler is widely used for web scraping, testing, and processing structured content, and integrates well with other Symfony components like BrowserKit.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    JavaScript Obfuscator

    JavaScript Obfuscator

    A powerful obfuscator for JavaScript and Node.js

    JavaScript Obfuscator is a Node.js library and CLI that transforms readable JavaScript into hardened, difficult-to-reverse code. It applies techniques such as identifier mangling, string array extraction/encoding, control-flow flattening, dead-code injection, and numeric literal transformations to disguise intent. Advanced options include self-defending code, domain locking, debug/console protection, and property key transformation, allowing you to tailor defenses to your threat model. The tool supports source maps and granular “threshold”/whitelist settings so you can balance protection with performance and debuggability. ...
    Downloads: 17 This Week
    Last Update:
    See Project
  • 16
    LangChain Extract

    LangChain Extract

    Did you say you like data?

    LangChain Extract is an open-source reference application designed to demonstrate how large language models can be used to extract structured data from unstructured text and document files. The project implements a lightweight web service that allows developers to define extraction schemas and apply them to various sources such as plain text, HTML, or PDF documents. Built using FastAPI and the LangChain framework, the application exposes a REST API that can process documents and return structured outputs that match user-defined JSON schemas. Developers can create reusable “extractors” that define what type of information should be pulled from a document, along with example prompts that improve extraction quality through in-context learning.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    video2robot

    video2robot

    End-to-end pipeline converting generative videos

    video2robot is an end-to-end open-source pipeline that converts generative video or prompt-driven motion content into executable humanoid robot motion sequences, enabling researchers and developers to go from high-level action descriptions or videos to robot-ready motion data. The pipeline supports both prompt-to-video generation using models like Veo/Sora and video upload processing, followed by human pose extraction through a 3D pose model and retargeting of that motion to robot joints using a general motion retargeting system. This workflow allows users to generate robot motion files that specify joint angles, root positions, and orientations that can be deployed on supported robot platforms (e.g., Unitree models). Video2robot includes scripts for each stage of the pipeline (generation, extraction, conversion, visualization) and can run as a CLI or through a basic web UI.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Tamagui

    Tamagui

    Style React fast with 100% parity on React Native

    ...Tamagui also includes a full UI kit with both styled and unstyled components, enabling flexible design system creation. Its compiler performs advanced optimizations such as CSS extraction, tree flattening, and dead code elimination, reducing bundle size and improving rendering speed. The system includes robust theming capabilities with support for design tokens, responsive props, and dynamic themes like dark mode.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 19
    Hacks

    Hacks

    A collection of hacks and one-off scripts

    ...Rather than being a single cohesive application, it serves as a repository of practical command-line tools that can be used independently or combined into workflows. The scripts cover a wide range of tasks, including URL manipulation, parameter replacement, data extraction, and reconnaissance automation. Many of the tools in the repository are designed for efficiency and simplicity, enabling users to perform complex operations with minimal overhead. It is particularly popular among security researchers and developers who need quick, flexible solutions for niche problems.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    MatImage

    MatImage

    Image Processing library for Matlab

    matImage is an open-source MATLAB library for image processing and analysis. It provides a variety of tools for image enhancement, segmentation, and feature extraction. It’s especially useful for users working on biomedical images or those needing detailed image analysis in MATLAB.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    ...It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question answering, or structured information extraction tasks. Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 22
    MemPalace

    MemPalace

    The highest-scoring AI memory system ever benchmarked

    MemPalace is an open-source AI memory system designed to solve one of the most persistent limitations of large language models: the loss of context between sessions. Instead of relying on summarization or selective extraction like most memory tools, it takes a radically different approach by storing conversations in their entirety and making them retrievable through structured organization and semantic search. The system is inspired by the classical “memory palace” mnemonic technique, organizing information into hierarchical spaces such as wings, rooms, and halls, which allows AI agents to navigate past knowledge in a more contextual and intuitive way. ...
    Downloads: 220 This Week
    Last Update:
    See Project
  • 23
    Magnitude

    Magnitude

    Vision AI browser agent for automation, testing, and extraction

    ...This approach allows the agent to generalize better across complex and modern websites, making it more robust than traditional selector-based automation tools. Browser Agent by Magnitude supports a wide range of capabilities including navigation, interaction, data extraction, and automated verification through built-in testing features. Developers can use it to automate repetitive web tasks, integrate services without APIs, or build advanced browser-based agents. It also provides flexible abstraction levels, allowing both high-level task execution and precise low-level control of actions like mouse movements and keyboard input.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    skycaiji

    skycaiji

    Open source web scraping system for automated data collection tasks

    SkyCaiji is an open source web scraping and data collection system designed to gather information from websites through configurable extraction rules. It focuses on simplifying the process of building crawlers by allowing users to visually define scraping rules rather than writing complex code. It can collect structured or unstructured data from many types of webpages and automate the extraction process for large datasets. SkyCaiji is designed to run on a variety of hosting environments including local machines, shared hosting environments, and cloud servers. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 25
    Wiseflow

    Wiseflow

    Enhance any agent's browser use skill

    Wiseflow is an open-source information extraction and knowledge discovery system designed to collect, filter, and organize valuable information from large volumes of online content. The platform continuously monitors specified sources such as websites, social platforms, and other digital channels to identify relevant data according to user-defined interests or topics. By combining web crawling, content parsing, and large language model analysis, the system extracts concise insights from raw information streams and converts them into structured data that can be stored or analyzed. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB