Best Open Source Linux Stream Processing Tools 2025

Stream Processing Tools for Linux

Stream Processing Linux Clear Filters

Browse free open source Stream Processing tools and projects for Linux below. Use the toggles on the left to filter open source Stream Processing tools by OS, license, language, programming language, and project status.

Gen AI apps are built with MongoDB Atlas
The database for AI-powered applications.

MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.

Start Free
Field Service Management Software | BlueFolder
Maximize technician productivity with intuitive field service software

Track all your service data in one easy-to-use system, enabling your team to move faster and generate more revenue for your bottom line.

Learn More
1

Benthos

Fancy stream processing made operationally mundane

Benthos is a high performance and resilient stream processor, able to connect various sources and sinks in a range of brokering patterns and perform hydration, enrichments, transformations and filters on payloads. It comes with a powerful mapping language, is easy to deploy and monitor, and ready to drop into your pipeline either as a static binary, docker image, or serverless function, making it cloud native as heck. Delivery guarantees can be a dodgy subject. Benthos processes and acknowledges messages using an in-process transaction model with no need for any disk persisted state, so when connecting to at-least-once sources and sinks it's able to guarantee at-least-once delivery even in the event of crashes, disk corruption, or other unexpected server faults. This behaviour is the default and free of caveats, which also makes deploying and scaling Benthos much simpler.

Downloads: 9 This Week

Last Update: 21 hours ago
See Project
2

Best-of Python

A ranked list of awesome Python open-source libraries

This curated list contains 390 awesome open-source projects with a total of 1.4M stars grouped into 28 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Contributions are very welcome! Ranked list of awesome python libraries for web development. Correctly generate plurals, ordinals, indefinite articles; convert numbers. Libraries for loading, collecting, and extracting data from a variety of data sources and formats. Libraries for data batch- and stream-processing, workflow automation, job scheduling, and other data pipeline tasks.

Downloads: 5 This Week

Last Update: 2 days ago
See Project
3

fluentbit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX

Fluent Bit is a super-fast, lightweight, and highly scalable logging and metrics processor and forwarder. It is the preferred choice for cloud and containerized environments. A robust, lightweight, and portable architecture for high throughput with low CPU and memory usage from any data source to any destination. Proven across distributed cloud and container environments. Highly available with I/O handlers to store data for disaster recovery. Granular management of data parsing and routing. Filtering and enrichment to optimize security and minimize cost. The lightweight, asynchronous design optimizes resource usage: CPU, memory, disk I/O, network. No more OOM errors! Integration with all your technology, cloud-native services, containers, streaming processors, and data backends. Fully event-driven design leverages the operating system API for performance and reliability. All operations to collect and deliver data are asynchronous.

Downloads: 4 This Week

Last Update: 2025-11-11
See Project
4

Riemann

A network event stream processing system, in Clojure

Riemann aggregates events from your servers and applications with a powerful stream processing language. Send an email for every exception in your app. Track the latency distribution of your web app. See the top processes on any host, by memory and CPU. Combine statistics from every Riak node in your cluster and forward to Graphite. Track user activity from second to second. Riemann streams are just functions which accept an event. Events are just structs with some common fields like :host and :service You can use dozens of built-in streams for filtering, altering, and combining events, or write your own. Since Riemann's configuration is a Clojure program, its syntax is concise, regular, and extendable. Configuration-as-code minimizes boilerplate and gives you the flexibility to adapt to complex situations.

Downloads: 3 This Week

Last Update: 2025-05-26
See Project
Turn traffic into pipeline and prospects into customers
For account executives and sales engineers looking for a solution to manage their insights and sales data

Docket is an AI-powered sales enablement platform designed to unify go-to-market (GTM) data through its proprietary Sales Knowledge Lake™ and activate it with intelligent AI agents. The platform helps marketing teams increase pipeline generation by 15% by engaging website visitors in human-like conversations and qualifying leads. For sales teams, Docket improves seller efficiency by 33% by providing instant product knowledge, retrieving collateral, and creating personalized documents. Built for GTM teams, Docket integrates with over 100 tools across the revenue tech stack and offers enterprise-grade security with SOC 2 Type II, GDPR, and ISO 27001 compliance. Customers report improved win rates, shorter sales cycles, and dramatically reduced response times. Docket’s scalable, accurate, and fast AI agents deliver reliable answers with confidence scores, empowering teams to close deals faster.

Learn More
5

Acl

A powerful server and network library, including coroutine

The Acl (Advanced C/C++ Library) project a is powerful multi-platform network communication library and service framework, supporting LINUX, WIN32, Solaris, FreeBSD, MacOS, AndroidOS, iOS. Many applications written by Acl run on these devices with Linux, Windows, iPhone and Android and serve billions of users. There are some important modules in Acl project, including network communcation, server framework, application protocols, multiple coders, etc. The common protocols such as HTTP/SMTP/ICMP//MQTT/Redis/Memcached/Beanstalk/Handler Socket are implemented in Acl, and the codec library such as XML/JSON/MIME/BASE64/UUCODE/QPCODE/RFC2047/RFC1035, etc., are also included in Acl. Acl also provides a unified abstract interface for popular databases such as Mysql, Postgresql, Sqlite. Using Acl library users can write database applications more easily, quickly and safely.

Downloads: 2 This Week

Last Update: 2025-07-26
See Project
6

Pathway

Python ETL framework for stream processing, real-time analytics, LLM

Pathway is an open-source framework designed for building real-time data applications using reactive and declarative paradigms. It enables seamless integration of live data streams and structured data into analytical pipelines with minimal latency. Pathway is especially well-suited for scenarios like financial analytics, IoT, fraud detection, and logistics, where high-velocity and continuously changing data is the norm. Unlike traditional batch processing frameworks, Pathway continuously updates the results of your data logic as new events arrive, functioning more like a database that reacts in real-time. It supports Python, integrates with modern data tools, and offers a deterministic dataflow model to ensure reproducibility and correctness.

Downloads: 2 This Week

Last Update: 2025-11-13
See Project
7

SageMaker Spark Container

Docker image used to run data processing workloads

Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing. The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.

Downloads: 2 This Week

Last Update: 2025-06-23
See Project
8

horizon

Horizon is a realtime, open-source backend for JavaScript apps

Horizon is an open-source developer platform for building sophisticated realtime apps. It provides a complete backend that makes it dramatically simpler to build, deploy, manage, and scale engaging JavaScript web and mobile apps. Horizon is extensible, integrates with the Node.js stack, and allows building modern, arbitrarily complex applications. While technologies like RethinkDB and WebSocket make it possible to build engaging realtime apps, empirically there is still too much friction for most developers. Building realtime apps now requires understanding and manually orchestrating multiple systems across the software stack, understanding distributed stream processing, and learning how to deploy and scale realtime systems. The learning curve is quite steep, and most of the initial work involves boilerplate code that is far removed from the primary task of building a realtime app.

Downloads: 2 This Week

Last Update: 2021-08-13
See Project
9

Akka

Build concurrent, distributed, and resilient message-driven apps

Build powerful reactive, concurrent, and distributed applications more easily. Akka is a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala. Actors and Streams let you build systems that scale up, using the resources of a server more efficiently, and out, using multiple servers. Building on the principles of The Reactive Manifesto Akka allows you to write systems that self-heal and stay responsive in the face of failures. Up to 50 million msg/sec on a single machine. Small memory footprint; ~2.5 million actors per GB of heap. Distributed systems without single points of failure. Load balancing and adaptive routing across nodes. Event Sourcing and CQRS with Cluster Sharding. Distributed Data for eventual consistency using CRDTs. Asynchronous non-blocking stream processing with backpressure.

Downloads: 1 This Week

Last Update: 4 days ago
See Project
Financial reporting cloud-based software.
For companies looking to automate their consolidation and financial statement function

The software is cloud based and automates complexities around consolidating and reporting for groups with multiple year ends, currencies and ERP systems with a slice and dice approach to reporting. While retaining the structure, control and validation needed in a financial reporting tool, we’ve managed to keep things flexible.

Learn More
10

Numaflow

Kubernetes-native platform to run massively parallel data/streaming

Numaflow is a Kubernetes-native tool for running massively parallel stream processing. A Numaflow Pipeline is implemented as a Kubernetes custom resource and consists of one or more source, data processing, and sink vertices. Numaflow installs in a few minutes and is easier and cheaper to use for simple data processing applications than a full-featured stream processing platform.

Downloads: 1 This Week

Last Update: 2025-11-11
See Project
11

ksqlDB

The database purpose-built for stream processing applications

Build applications that respond immediately to events. Craft materialized views over streams. Receive real-time push updates, or pull current state on demand. Seamlessly leverage your existing Apache Kafka® infrastructure to deploy stream-processing workloads and bring powerful new capabilities to your applications. Use a familiar, lightweight syntax to pack a powerful punch. Capture, process, and serve queries using only SQL. No other languages or services are required. ksqlDB enables you to build event streaming applications leveraging your familiarity with relational databases. Three categories are foundational to building an application: collections, stream processing, and queries. Streams are immutable, append-only sequences of events. They're useful for representing a series of historical facts. Tables are mutable collections of events. They let you represent the latest version of each value per key.

Downloads: 1 This Week

Last Update: 2021-12-21
See Project
12

text-dedup

All-in-one text de-duplication

text-dedup is a Python library that enables efficient deduplication of large text corpora by using MinHash and other probabilistic techniques to detect near-duplicate content. This is especially useful for NLP tasks where duplicated training data can skew model performance. text-dedup scales to billions of documents and offers tools for chunking, hashing, and comparing text efficiently with low memory usage. It supports Jaccard similarity thresholding, parallel execution, and flexible deduplication strategies, making it ideal for cleaning web-scraped data, language model training datasets, or document archives.

Downloads: 1 This Week

Last Update: 2025-04-08
See Project
13

DSPatch

The Refreshingly Simple C++ Dataflow Framework

Webite: http://flowbasedprogramming.com DSPatch, pronounced "dispatch", is a powerful C++ dataflow framework. DSPatch is not limited to any particular domain or data type, from reactive programming to stream processing, DSPatch's generic, object-oriented API allows you to create virtually any dataflow system imaginable. *See also:* DSPatcher ( https://github.com/MarcusTomlinson/DSPatcher ): A cross-platform graphical tool for building DSPatch circuits. DSPatchables ( https://github.com/MarcusTomlinson/DSPatchables ): A DSPatch component repository.

Downloads: 2 This Week

Last Update: 2020-02-15
See Project
14

Amadeus

Harmonious distributed data analysis in Rust

Amadeus is a high-performance, distributed data processing framework written in Rust, designed to offer an ergonomic and safe alternative to tools like Apache Spark. It provides both streaming and batch capabilities, allowing users to work with real-time and historical data at scale. Thanks to Rust’s memory safety and zero-cost abstractions, Amadeus delivers performance gains while reducing the complexity and bugs common in large-scale data pipelines. It emphasizes developer productivity through a fluent, expressive API and makes it easier to build composable and reliable data transformation pipelines without sacrificing speed or safety.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
15

Arroyo

Distributed stream processing engine in Rust

Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.

Downloads: 0 This Week

Last Update: 2025-06-23
See Project
16

Biceps

An experimental CEP (Complex Event Processing) engine. It implements the event stream processing as a library embeddable in C++ and Perl. Since then it has been renamed to Triceps, so please look at the new location https://sourceforge.net/projects/t

Downloads: 0 This Week

Last Update: 2015-05-02
See Project
17

Bytewax

Python Stream Processing

Bytewax is a Python framework that simplifies event and stream processing. Because Bytewax couples the stream and event processing capabilities of Flink, Spark, and Kafka Streams with the friendly and familiar interface of Python, you can re-use the Python libraries you already know and love. Connect data sources, run stateful transformations, and write to various downstream systems with built-in connectors or existing Python libraries. Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. You can use Bytewax for a variety of workloads from moving data à la Kafka Connect style all the way to advanced online machine learning workloads. Bytewax is not limited to streaming applications but excels anywhere that data can be distributed at the input and output.

Downloads: 0 This Week

Last Update: 2024-11-25
See Project
18

CocoIndex

ETL framework to index data for AI, such as RAG

CocoIndex is an open-source framework designed for building powerful, local-first semantic search systems. It lets users index and retrieve content based on meaning rather than keywords, making it ideal for modern AI-based search applications. CocoIndex leverages vector embeddings and integrates with various models and frameworks, including OpenAI and Hugging Face, to provide high-quality semantic understanding. It’s built for transparency, ease of use, and local control over your search data, distinguishing itself from closed, black-box systems. The tool is suitable for developers working on personal knowledge bases, AI search interfaces, or private LLM applications.

Downloads: 0 This Week

Last Update: 7 days ago
See Project
19

Cosmos DB Spark

Apache Spark Connector for Azure Cosmos DB

Azure Cosmos DB Spark is the official connector for Azure CosmosDB and Apache Spark. The connector allows you to easily read to and write from Azure Cosmos DB via Apache Spark DataFrames in Python and Scala. It also allows you to easily create a lambda architecture for batch-processing, stream-processing, and a serving layer while being globally replicated and minimizing the latency involved in working with big data.

Downloads: 0 This Week

Last Update: 2023-12-21
See Project
20

Dataflow Java SDK

Google Cloud Dataflow provides a simple, powerful model

The Dataflow Java SDK is the open-source Java library that powers Apache Beam pipelines for Google Cloud Dataflow, a serverless and scalable platform for processing large datasets in both batch and stream modes. This SDK allows developers to write Beam-based pipelines in Java and execute them on Dataflow, taking advantage of features like autoscaling, dynamic work rebalancing, and fault-tolerant distributed processing. While it has been mostly succeeded by the unified Beam SDKs, it remains relevant for legacy systems and offers insight into the underlying mechanisms that power scalable data workflows on Google Cloud.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
21

FLOGO

Simplify building efficient & modern serverless functions and apps

Project Flogo is an ultra-light, Go-based open source ecosystem for building event-driven apps. Event-driven, you say? Yup, the notion of triggers and actions are leveraged to process incoming events. An action, a common interface, exposes key capabilities such as application integration, stream processing, etc. All capabilities within the Flogo Ecosystem have a few things in common, they all process events (in a manner suitable for the specific purpose) and they all implement the action interface exposed by Flogo Core. Integration Flows Application Integration process engine with conditional branching and a visual development environment. A simple pipeline-based stream processing action with event joining capabilities across multiple triggers & aggregation over time windows. Microgateway pattern for conditional, content-based routing, JWT validation, rate limiting, circuit breaking and other common patterns.

Downloads: 0 This Week

Last Update: 2023-01-24
See Project
22

Fondant

Production-ready data processing made easy and shareable

Fondant is a modular, pipeline-based framework designed to simplify the preparation of large-scale datasets for training machine learning models, especially foundation models. It offers an end-to-end system for ingesting raw data, applying transformations, filtering, and formatting outputs—all while remaining scalable and traceable. Fondant is designed with reproducibility in mind and supports containerized steps using Docker, making it easy to share and reuse data processing components. It’s built for use in research and production, empowering data scientists to streamline dataset curation and preprocessing workflows efficiently.

Downloads: 0 This Week

Last Update: 2025-04-08
See Project
23

GATES

A Middleware for Distrubted Data Stream Processing

Downloads: 0 This Week

Last Update: 2013-03-12
See Project
24

HStreamDB

HStreamDB is an open-source, cloud-native streaming database

HStreamDB is an open-source, cloud-native streaming database for IoT and beyond. Modernize your data stack for real-time applications. By subscribing to streams in HStreamDB, any update of the data stream will be pushed to your apps in real-time, and this promotes your apps to be more responsive. You can also replace message brokers with HStreamDB and everything you do with message brokers can be done better with HStreamDB. HStreamDB provides built-in support for event time-based stream processing. You can use your familiar SQL to perform basic filtering and transformation operations, statistics and aggregation based on multiple kinds of time windows and even joining between multiple streams. With connectors provided, you can easily integrate HStreamDB with other external systems, such as MQTT Broker, MySQL, Redis and ElasticSearch. More connectors will be added.

Downloads: 0 This Week

Last Update: 2024-04-26
See Project
25

Lithops

A multi-cloud framework for big data analytics

Lithops is an open-source serverless computing framework that enables transparent execution of Python functions across multiple cloud providers and on-prem infrastructure. It abstracts cloud providers like IBM Cloud, AWS, Azure, and Google Cloud into a unified interface and turns your Python functions into scalable, event-driven workloads. Lithops is ideal for data processing, ML inference, and embarrassingly parallel workloads, giving you the power of FaaS (Function-as-a-Service) without vendor lock-in. It also supports hybrid cloud setups, object storage access, and simple integration with Jupyter notebooks.

Downloads: 0 This Week

Last Update: 2025-09-13
See Project