Showing 452 open source projects for "data extraction"

View related business solutions
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 1
    pdfly

    pdfly

    CLI tool to extract (meta)data from PDF and manipulate PDF files

    A Python library designed for manipulating PDF files with functionalities for extraction, transformation, and document generation.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 2
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    Sparrow

    Sparrow

    Structured data extraction and instruction calling with ML, LLM

    ...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    webclaw

    webclaw

    Fast, local-first web content extraction for LLMs

    webclaw is a high-performance web content extraction tool designed specifically for AI agents and large language models, focusing on delivering clean, structured data instead of raw HTML. It is built in Rust and operates without a headless browser, using advanced techniques such as TLS fingerprinting to bypass common scraping barriers and mimic real browser behavior. The tool addresses a major inefficiency in AI workflows by removing irrelevant elements like navigation menus, ads, and scripts, significantly reducing token usage when feeding data into language models. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    OCRBase

    OCRBase

    MD/.JSON Document OCR and structured data extraction API

    OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Firecrawl

    Firecrawl

    Turn entire websites into LLM-ready markdown or structured data

    Crawl and convert any website into LLM-ready markdown or structured data. Built by Mendable.ai and the Firecrawl community. Includes powerful scraping, crawling, and data extraction capabilities. Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap is required.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 7
    Unstract

    Unstract

    No-code LLM Platform to launch APIs and ETL Pipelines

    Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 8
    OpenDataLoader PDF

    OpenDataLoader PDF

    PDF Parser for AI-ready data. Automate PDF accessibility

    OpenDataLoader PDF is an open-source document processing system designed to convert complex PDF files into structured, AI-ready formats such as Markdown, JSON, and HTML while preserving layout, hierarchy, and semantic meaning. It focuses on enabling downstream use cases like retrieval-augmented generation (RAG), knowledge extraction, and document intelligence pipelines by maintaining accurate reading order and spatial metadata through bounding boxes. The tool combines deterministic parsing...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 9
    ContextGem

    ContextGem

    ContextGem: Effortless LLM extraction from documents

    ContextGem is an open-source framework designed to simplify the extraction of structured data and insights from documents using large language models (LLMs). It provides a flexible, intuitive API that minimizes boilerplate code, enabling developers to build complex extraction workflows efficiently. ContextGem supports various document formats and integrates with multiple LLM providers, making it a versatile tool for tasks like contract analysis, anomaly detection, and information retrieval.​
    Downloads: 7 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    X-Crawl

    X-Crawl

    Flexible Node.js AI-assisted crawler library

    A high-performance web crawling and scraping framework for Node.js, designed for large-scale data extraction.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 11
    AI-Crawler

    AI-Crawler

    Crawl a website starting from a URL, find relevant pages

    AI Crawler is an experimental AI-powered web crawling and data extraction tool that uses natural language prompts to guide the discovery and retrieval of relevant information across websites. Unlike traditional web scrapers that rely on static selectors and manual scripting, it uses AI to dynamically identify and prioritize pages based on user intent, making it more flexible and resilient to changes in website structure.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    Scanopy

    Scanopy

    Clean network diagrams, One-time setup, zero upkeep

    Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...
    Downloads: 11 This Week
    Last Update:
    See Project
  • 13
    tsfresh

    tsfresh

    Automatic extraction of relevant features from time series

    tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python....
    Downloads: 4 This Week
    Last Update:
    See Project
  • 14
    AgentQL MCP

    AgentQL MCP

    Model Context Protocol server that integrates AgentQL's data

    The AgentQL MCP Server is a Model Context Protocol (MCP) server that integrates AgentQL's data extraction capabilities, enabling users to extract structured data from web pages using natural language prompts. ​
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    watercrawl

    watercrawl

    AI-ready web crawler that extracts and structures website content

    ...WaterCrawl supports customizable extraction rules so users can focus only on relevant elements while ignoring unnecessary page components. WaterCrawl also offers real-time monitoring capabilities, allowing users to track crawling progress, performance metrics, and errors during large data collection jobs. Developers can integrate the tool into applications through a REST API and multiple client SDKs, enabling automated data pipelines and AI data preparation workflows.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 16
    FModel

    FModel

    Unreal Engine Archives Explorer

    FModel is a freeware application designed to explore Unreal Engine games archives, allowing users to delve into the assets and structures of games developed with Unreal Engine.
    Downloads: 82 This Week
    Last Update:
    See Project
  • 17
    newpipeextractor

    newpipeextractor

    Library for extracting streaming site data without official APIs

    ...It handles many low-level tasks involved in web data extraction, including parsing responses, managing platform-specific logic, and handling errors, allowing developers to focus on implementing application features rather than scraping mechanics. Each supported service is implemented through its own extractor components that conform to a common interface, enabling consistent access to data across different platforms.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 18
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    BlockArrays.jl

    BlockArrays.jl

    BlockArrays for Julia

    A block array is a partition of an array into blocks or subarrays, see Wikipedia for a more extensive description. This package has two purposes. Firstly, it defines an interface for an AbstractBlockArray block arrays that can be shared among types representing different types of block arrays. The advantage to this is that it provides a consistent API for block arrays. Secondly, it also implements two different types of block arrays that follow the AbstractBlockArray interface. The type...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 20
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    Curator is an open-source Python library designed to build synthetic data pipelines for training and evaluating machine learning models, particularly large language models. The system helps developers generate, transform, and curate high-quality datasets by combining automated generation with structured validation and filtering. It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question answering, or structured information extraction tasks. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 21
    DocStrange

    DocStrange

    Extract and convert data from any document, images, pdfs, word doc

    DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22
    mtail

    mtail

    Extract internal monitoring data from application logs

    Extract internal monitoring data from application logs for collection in a time-series database. mtail is a tool for extracting metrics from application logs to be exported into a timeseries database or timeseries calculator for alerting and dashboarding. It fills a monitoring niche by being the glue between applications that do not export their own internal state (other than via logs) and existing monitoring systems, such that system operators do not need to patch those applications to instrument them or writing custom extraction code for every such application. ...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 23
    Ferret

    Ferret

    Declarative web scraping

    A web scraping system aiming to simplify data extraction from the web. ferret has a declarative query language that makes it easy to focus on the data that you need to get. ferret has the ability to scrape JS rendered pages, handle all page events, and emulate user interactions. the ferret was designed as a library from the ground up. it can be easily embedded into any Go application. ferret helps you to focus on the data you need using an easy-to-learn declarative language. ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. ferret is extremely extensible, and creating custom functions and types is super easy. ferret allows users to focus on the data. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    Wiseflow

    Wiseflow

    Enhance any agent's browser use skill

    Wiseflow is an open-source information extraction and knowledge discovery system designed to collect, filter, and organize valuable information from large volumes of online content. The platform continuously monitors specified sources such as websites, social platforms, and other digital channels to identify relevant data according to user-defined interests or topics. By combining web crawling, content parsing, and large language model analysis, the system extracts concise insights from raw information streams and converts them into structured data that can be stored or analyzed. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 25
    NeMo Retriever Library

    NeMo Retriever Library

    Document content and metadata extraction microservice

    ...It processes various document types by splitting them into components such as text, tables, charts, and images, and then applies OCR and contextual analysis to convert them into structured data formats. The system is built on NVIDIA NIM microservices, enabling high-performance parallel processing and efficient handling of large datasets. It supports multiple extraction strategies for different document formats, balancing accuracy and throughput depending on the use case. Additionally, it can generate embeddings for extracted content and integrate with vector databases like Milvus, making it well-suited for retrieval-augmented generation pipelines.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
Auth0 Logo