Showing 1407 open source projects for "data processing"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 1
    Data-Juicer

    Data-Juicer

    Data processing for and with foundation models

    Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Polymarket Data

    Polymarket Data

    Polymarket Data Retriever that fetches, processes, and structures data

    Polymarket Data is a comprehensive data engineering pipeline designed to collect, process, and structure trading activity from the Polymarket prediction market ecosystem into analyzable datasets. The system operates as a multi-stage pipeline that integrates data from both off-chain APIs and on-chain event sources, enabling users to reconstruct full trading activity including markets, order events, and executed trades. It begins by fetching market metadata such as questions, outcomes, and...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 3
    Data Formulator

    Data Formulator

    Create rich visualizations with AI

    To create rich visualizations, data analysts often need to iterate back and forth among data processing and chart specification to achieve their goals. To achieve this, analysts need not only proficiency in data transformation and visualization tools but also efforts to manage the branching history consisting of many different versions of data and charts. Recent LLM-powered AI systems have greatly improved visualization authoring experiences, for example by mitigating manual data transformation barriers via LLMs' code generation ability. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Synthetic Data Generator

    Synthetic Data Generator

    SDG is a specialized framework

    ...It also includes a data processing module capable of handling different data types, preprocessing columns, managing missing values, and converting formats automatically before model training.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Custom VMs From 1 to 96 vCPUs With 99.95% Uptime Icon
    Custom VMs From 1 to 96 vCPUs With 99.95% Uptime

    General-purpose, compute-optimized, or GPU/TPU-accelerated. Built to your exact specs.

    Live migration and automatic failover keep workloads online through maintenance. One free e2-micro VM every month.
    Try Free
  • 5
    NYC Taxi Data

    NYC Taxi Data

    Import public NYC taxi and for-hire vehicle (Uber, Lyft)

    The nyc-taxi-data repository is a rich dataset and exploratory project around New York City taxi trip records. It collects and preprocesses large-scale trip datasets (fares, pickup/dropoff, timestamps, locations, passenger counts) to enable data analysis, modeling, and visualization efforts. The project includes scripts and notebooks for cleaning and filtering the raw data, memory-efficient processing for large CSV/Parquet files, and aggregation workflows (e.g. trips per hour, heatmaps of pickups/dropoffs). ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    Agentic Data Scientist

    Agentic Data Scientist

    An end-to-end Data Scientist

    ...Each agent is designed to independently call functions, interact with data sources, and adapt to uncertainties during processing, enabling iterative refinement of models without manual coordination. The framework supports interoperability with existing data tools and libraries, letting the agents leverage libraries like pandas, scikit-learn, and visualization frameworks to perform real computations rather than mock demonstrations.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    Kapacitor

    Kapacitor

    Open source framework for processing, monitoring, and alerting

    Open source framework for processing, monitoring, and alerting on time series data. Kapacitor is a real-time data processing engine for monitoring and alerting, specifically designed to work with time-series data from InfluxDB.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 8
    Arroyo

    Arroyo

    Distributed stream processing engine in Rust

    Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    MeshLab

    MeshLab

    The open source mesh processing system

    ...VCG can be used as a stand-alone large-scale automated mesh processing pipeline, while MeshLab makes it easy to experiment with its algorithms interactively. The open source system for processing and editing 3D triangular meshes. It provides a set of tools for editing, cleaning, healing, inspecting, rendering, texturing and converting meshes. It offers features for processing raw data produced by 3D digitization tools/devices and for preparing models for 3D printing.
    Downloads: 37 This Week
    Last Update:
    See Project
  • Go from Code to Production URL in Seconds Icon
    Go from Code to Production URL in Seconds

    Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

    Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.
    Try it free
  • 10
    LAStools

    LAStools

    efficient tools for LiDAR processing

    LAStools is a collection of efficient, multi-core, scriptable tools for processing LiDAR data. It supports various formats, including LAS, LAZ, Terrasolid BIN, and ESRI Shapefiles, providing a comprehensive suite for LiDAR data management and analysis.
    Downloads: 17 This Week
    Last Update:
    See Project
  • 11
    pdfcpu

    pdfcpu

    A PDF processor written in Go

    pdfcpu is a PDF processing library written in Go supporting encryption. It provides both an API and a CLI. Supported are all versions up to PDF 1.7 (ISO-32000). This is an effort to build a comprehensive PDF processing library from the ground up written in Go. Over time pdfcpu aims to support the standard range of PDF processing features and also any interesting use cases that may present themselves along the way. The main focus lies on strong support for batch processing and scripting via a...
    Downloads: 15 This Week
    Last Update:
    See Project
  • 12
    CyberChef

    CyberChef

    A web app for encryption, encoding, compression and data analysis

    CyberChef, developed by GCHQ, is a versatile web application dubbed the "Cyber Swiss Army Knife." It enables users to perform a wide array of operations on data, including encryption, encoding, compression, and analysis, all within a browser interface.​
    Downloads: 33 This Week
    Last Update:
    See Project
  • 13
    go-streams

    go-streams

    A lightweight stream processing library for Go

    A lightweight stream processing library for Go. go-streams provides a simple and concise DSL to build data pipelines. In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    ExtractThinker

    ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs

    ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    Broadway

    Broadway

    Concurrent and multi-stage data ingestion and data processing

    Broadway is a data processing library for Elixir designed to handle high-throughput, concurrent workloads with ease. It provides an abstraction for defining pipelines that consume data from sources like RabbitMQ, Kafka, Amazon SQS, or custom producers. Each pipeline is fault-tolerant and backpressure-aware, ensuring stable throughput even under load.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 16
    Apache Spark

    Apache Spark

    A unified analytics engine for large-scale data processing

    Apache Spark is a unified engine for large-scale data processing, offering APIs for batch jobs, streaming, machine learning, and graph computation. It builds on resilient distributed datasets (RDDs) and the newer DataFrame/Dataset abstractions to provide fault-tolerant, in-memory computation across clusters. Spark’s execution engine handles scheduling, shuffles, caching, and data locality so users can focus on transformations rather than infrastructure plumbing. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • 17
    DeerFlow

    DeerFlow

    Deep Research framework, combining language models with tools

    DeerFlow is an open-source, community-driven “deep research” framework / multi-agent orchestration platform developed by ByteDance. It aims to combine the reasoning power of large language models (LLMs) with automated tool-use — such as web search, web crawling, Python execution, and data processing — to enable complex, end-to-end research workflows. Instead of a monolithic AI assistant, DeerFlow defines multiple specialized agents (e.g. “planner,” “searcher,” “coder,” “report generator”) that collaborate in a structured workflow, allowing tasks like literature reviews, data gathering, data analysis, code execution, and final report generation to be largely automated. ...
    Downloads: 572 This Week
    Last Update:
    See Project
  • 18
    Bytewax

    Bytewax

    Python Stream Processing

    ...Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. You can use Bytewax for a variety of workloads from moving data à la Kafka Connect style all the way to advanced online machine learning workloads. Bytewax is not limited to streaming applications but excels anywhere that data can be distributed at the input and output.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Numaflow

    Numaflow

    Kubernetes-native platform to run massively parallel data/streaming

    Numaflow is a Kubernetes-native tool for running massively parallel stream processing. A Numaflow Pipeline is implemented as a Kubernetes custom resource and consists of one or more source, data processing, and sink vertices. Numaflow installs in a few minutes and is easier and cheaper to use for simple data processing applications than a full-featured stream processing platform.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Easy3D

    Easy3D

    Efficient library for processing 3D data

    Easy3D is a lightweight, easy-to-use, and efficient library for processing and rendering 3D data, implemented in C++ with Python bindings. It is designed for tasks such as 3D modeling, geometry processing, and rendering, emphasizing simplicity and efficiency. Easy3D serves as a valuable tool for research, education, and the development of sophisticated 3D applications, providing a solid foundation for handling 3D data.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 21
    Pachyderm

    Pachyderm

    Data-Centric Pipelines and Data Versioning

    ...Pachyderm provides a powerful solution to optimize data processing, MLOps, and ML Lifecycles.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    fluentbit

    fluentbit

    Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX

    Fluent Bit is a super-fast, lightweight, and highly scalable logging and metrics processor and forwarder. It is the preferred choice for cloud and containerized environments. A robust, lightweight, and portable architecture for high throughput with low CPU and memory usage from any data source to any destination. Proven across distributed cloud and container environments. Highly available with I/O handlers to store data for disaster recovery. Granular management of data parsing and routing....
    Downloads: 12 This Week
    Last Update:
    See Project
  • 23
    Best-of Python

    Best-of Python

    A ranked list of awesome Python open-source libraries

    ...Ranked list of awesome python libraries for web development. Correctly generate plurals, ordinals, indefinite articles; convert numbers. Libraries for loading, collecting, and extracting data from a variety of data sources and formats. Libraries for data batch- and stream-processing, workflow automation, job scheduling, and other data pipeline tasks.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    Reactor Core

    Reactor Core

    Non-Blocking Reactive Foundation for the JVM

    Reactor Core is a foundational library for building reactive applications in Java, providing a powerful API for asynchronous, non-blocking programming.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 25
    GLM

    GLM

    OpenGL Mathematics (GLM)

    ...This project isn't limited to GLSL features. An extension system, based on the GLSL extension conventions, provides extended capabilities: matrix transformations, quaternions, data packing, random numbers, noise, etc. This library works perfectly with OpenGL but it also ensures interoperability with other third party libraries and SDK. It is a good candidate for software rendering (raytracing / rasterisation), image processing, physics simulations and any development context that requires a simple and convenient mathematics library. ...
    Downloads: 90 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB