Data Pipeline Tools for Linux

View 13 business solutions

Browse free open source Data Pipeline tools and projects for Linux below. Use the toggles on the left to filter open source Data Pipeline tools by OS, license, language, programming language, and project status.

  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Build Securely on Azure with Proven Frameworks Icon
    Build Securely on Azure with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 1
    Pentaho

    Pentaho

    Pentaho offers comprehensive data integration and analytics platform.

    Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/
    Leader badge
    Downloads: 1,801 This Week
    Last Update:
    See Project
  • 2
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    Increase data quality and reduce the painful cost of errors. Data engineering best practices using git-like operations on data. lakeFS is an open-source data version control for data lakes. It enables zero-copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more. Data is dynamic, it changes over time. Dealing with that without a data version control system is error-prone and labor-intensive. With lakeFS, your data lake is version controlled and you can easily time-travel between consistent snapshots of the lake. Easier ETL testing - test your ETLs on top of production data, in isolation, without copying anything. Safely experiment and test on full production data. Easily Collaborate on production data with your team. Automate data quality checks within data pipelines.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 3
    StarRocks

    StarRocks

    StarRocks is a next-gen sub-second MPP database for full analytics

    StarRocks is the next generation of real-time SQL engines for enterprise analytics. Real-time analytics is notoriously difficult. Complex data pipelines and de-normalized tables have always been a necessary evil. Processing any updates or deletes once data arrives has not been possible- until now. StarRocks solves these challenges and makes real-time analytics easy. Get amazing query performance on Star or Snowflake Schemas directly. From canceled orders to updated items, your analytics applications can easily handle them with StarRocks. From streaming data to change data capture, StarRocks meets the data ingestion demands of real-time analytics. Scale storage and computing power horizontally and support tens of thousands of concurrent users. All of your BI tools work with StarRocks through standard SQL. StarRocks provides superior performance. It is also a unified OLAP covering most data analytics scenarios.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 4
    Best-of Python

    Best-of Python

    A ranked list of awesome Python open-source libraries

    This curated list contains 390 awesome open-source projects with a total of 1.4M stars grouped into 28 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Contributions are very welcome! Ranked list of awesome python libraries for web development. Correctly generate plurals, ordinals, indefinite articles; convert numbers. Libraries for loading, collecting, and extracting data from a variety of data sources and formats. Libraries for data batch- and stream-processing, workflow automation, job scheduling, and other data pipeline tasks.
    Downloads: 6 This Week
    Last Update:
    See Project
  • Build apps or websites quickly on a fully managed platform Icon
    Build apps or websites quickly on a fully managed platform

    Get two million requests free per month.

    Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure.
    Try it for free
  • 5
    rudderstack

    rudderstack

    Privacy and Security focused Segment-alternative, in Golang

    Quickly deploy flexible, powerful customer data pipelines, then send the data to your entire stack—without the engineering headache. Our complete toolset makes it easy to level-up your customer data stack. Spare your data engineers the headache. Our 180+ integrations, along with custom webhook sources and destinations, save data teams hundred of hours. Say goodbye to different versions of the truth. Our SDKs track anonymous and known users at the source and reconcile users in your warehouse and SaaS tools. Go beyond event streaming and control all of your customer data on your own terms. Learn how we can help you build a customer data platform. RudderStack treats your data warehouse as a first-class citizen among destinations, with advanced features and configurable, near real-time sync. RudderStack is built API-first. It integrates seamlessly with the tools that the developers already use and love.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 6
    Conduit

    Conduit

    Conduit streams data between data stores. Kafka Connect replacement

    Conduit is a data streaming tool written in Go. It aims to provide the best user experience for building and running real-time data pipelines. Conduit comes with batteries included, it provides a UI, common connectors, processors and observability data out of the box. Sync data between your production systems using an extensible, event-first experience with minimal dependencies that fit within your existing workflow. Eliminate the multi-step process you go through today. Just download the binary and start building. Conduit connectors give you the ability to pull and push data to any production datastore you need. If a datastore is missing, the simple SDK allows you to extend Conduit where you need it. Conduit pipelines listen for changes to a database, data warehouse, etc., and allows your data applications to act upon those changes in real-time. Run it in a way that works for you; use it as a standalone service or orchestrate it within your infrastructure.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 7
    Union Pandera

    Union Pandera

    Light-weight, flexible, expressive statistical data testing library

    The open-source framework for precision data testing for data scientists and ML engineers. Pandera provides a simple, flexible, and extensible data-testing framework for validating not only your data but also the functions that produce them. A simple, zero-configuration data testing framework for data scientists and ML engineers seeking correctness. Access a comprehensive suite of built-in tests, or easily create your own validation rules for your specific use cases. Validate the functions that produce your data by automatically generating test cases for them. Integrate seamlessly with the Python ecosystem. Overcome the initial hurdle of defining a schema by inferring one from clean data, then refine it over time. Identify the critical points in your data pipeline, and validate data going in and out of them. Build confidence in the quality of your data by defining schemas for complex data objects.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    memphis

    memphis

    Next-Generation Event Processing Platform

    Memphis enables building modern queue-based applications that require large volumes of streamed and enriched data, modern protocols, zero ops, up to x9 faster development, up to x46 fewer costs, and significantly lower dev time for data-oriented developers and data engineers. Queues and brokers are a mission-critical component in the modern application architecture and should be highly available and stable as possible. Provide great performance while maintaining efficient resource consumption. Increase observability, integrations with 3rd-party monitoring tools, real-time notifications, stream lineage, and therefore troubleshooting time reduction. Enable rapid development and ultra-short time-to-production.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    Datapipe

    Datapipe

    Real-time, incremental ETL library for ML with record-level depend

    Datapipe is a real-time, incremental ETL library for Python with record-level dependency tracking. Datapipe is designed to streamline the creation of data processing pipelines. It excels in scenarios where data is continuously changing, requiring pipelines to adapt and process only the modified data efficiently. This library tracks dependencies for each record in the pipeline, ensuring minimal and efficient data processing.
    Leader badge
    Downloads: 29 This Week
    Last Update:
    See Project
  • Powering the best of the internet | Fastly Icon
    Powering the best of the internet | Fastly

    Fastly's edge cloud platform delivers faster, safer, and more scalable sites and apps to customers.

    Ensure your websites, applications and services can effortlessly handle the demands of your users with Fastly. Fastly’s portfolio is designed to be highly performant, personalized and secure while seamlessly scaling to support your growth.
    Try for free
  • 10
    Backstage

    Backstage

    Backstage is an open platform for building developer portals

    Powered by a centralized software catalog, Backstage restores order to your infrastructure and enables your product teams to ship high-quality code quickly, without compromising autonomy. At Spotify, we've always believed in the speed and ingenuity that comes from having autonomous development teams. But as we learned firsthand, the faster you grow, the more fragmented and complex your software ecosystem becomes. And then everything slows down again. By centralizing services and standardizing your tooling, Backstage streamlines your development environment from end to end. Instead of restricting autonomy, standardization frees your engineers from infrastructure complexity. So you can return to building and scaling, quickly and safely. Every team can see all the services they own and related resources (deployments, data pipelines, pull request status, etc.)
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    Elementary

    Elementary

    Open-source data observability for analytics engineers

    Elementary is an open-source data observability solution for data & analytics engineers. Monitor your dbt project and data in minutes, and be the first to know of data issues. Gain immediate visibility, detect data issues, send actionable alerts, and understand the impact and root cause. Generate a data observability report, host it or share with your team. Monitoring of data quality metrics, freshness, volume and schema changes, including anomaly detection. Elementary data monitors are configured and executed like native tests in dbt your project. Uploading and modeling of dbt artifacts, run and test results to tables as part of your runs. Get informative notifications on data issues, schema changes, models and tests failures. Inspect upstream and downstream dependencies to understand impact and root cause of data issues.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 12
    Luigi

    Luigi

    Python module that helps you build complex pipelines of batch jobs

    Luigi is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    The Tengo Language

    The Tengo Language

    A fast script language for Go

    Tengo is a small, dynamic, fast, secure script language for Go. Tengo is fast and secure because it's compiled/executed as bytecode on stack-based VM that's written in native Go. Securely Embeddable and Extensible. Compiler/runtime written in native Go (no external deps or cgo). Executable as a standalone language / REPL. Use cases, rules engine, state machine, data pipeline, transpiler. If you need to evaluate a simple expression, you can use Eval function instead.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    AutoGluon

    AutoGluon

    AutoGluon: AutoML for Image, Text, and Tabular Data

    AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data. Intended for both ML beginners and experts, AutoGluon enables you to quickly prototype deep learning and classical ML solutions for your raw data with a few lines of code. Automatically utilize state-of-the-art techniques (where appropriate) without expert knowledge. Leverage automatic hyperparameter tuning, model selection/ensembling, architecture search, and data processing. Easily improve/tune your bespoke models and data pipelines, or customize AutoGluon for your use-case. AutoGluon is modularized into sub-modules specialized for tabular, text, or image data. You can reduce the number of dependencies required by solely installing a specific sub-module via: python3 -m pip install <submodule>.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    DataGym.ai

    DataGym.ai

    Open source annotation and labeling tool for image and video assets

    DATAGYM enables data scientists and machine learning experts to label images up to 10x faster. AI-assisted annotation tools reduce manual labeling effort, give you more time to finetune ML models and speed up your go to market of new products. Accelerate your computer vision projects by cutting down data preparation time up to 50%. A machine learning model is only as good as its training data. DATAGYM is an end-to-end workbench to create, annotate, manage, and export the right training data for your computer vision models. Your image data can be imported into DATAGYM from your local machine, from any public image URL or directly from an AWS cloud S3 bucket. Machine learning teams spend up to 80% of their time on data preparation. DATAGYM provides AI-powered annotation functions to help you accelerate your labeling task. The Pre-Labeling feature enables turbo-labeling – it processes thousands of images in the background within a very short time.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    Dolphin Scheduler

    Dolphin Scheduler

    A distributed and extensible workflow scheduler platform

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available `out of the box`. Dedicated to solving the complex task dependencies in data processing, making the scheduler system out of the box for data processing. Decentralized multi-master and multi-worker, HA is supported by itself, overload processing. All process definition operations are visualized, Visualization process defines key information at a glance, One-click deployment. Support multi-tenant. Support many task types e.g., spark,flink,hive, mr, shell, python, sub_process. Support custom task types, Distributed scheduling, and the overall scheduling capability will increase linearly with the scale of the cluster.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    Kestra

    Kestra

    Kestra is an infinitely scalable orchestration and scheduling platform

    Build reliable workflows, blazingly fast, deploy in just a few clicks. Kestra is an open-source, event-driven orchestrator that simplifies data operations and improves collaboration between engineers and business users. By bringing Infrastructure as Code best practices to data pipelines, Kestra allows you to build reliable workflows and manage them with confidence. Thanks to the declarative YAML interface for defining orchestration logic, everyone who benefits from analytics can participate in the data pipeline creation process. The UI automatically adjusts the YAML definition any time you make changes to a workflow from the UI or via an API call. Therefore, the orchestration logic is defined declaratively in code, even if some workflow components are modified in other ways.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    go-streams

    go-streams

    A lightweight stream processing library for Go

    A lightweight stream processing library for Go. go-streams provides a simple and concise DSL to build data pipelines. In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    Mage.ai

    Mage.ai

    Build, run, and manage data pipelines for integrating data

    Open-source data pipeline tool for transforming and integrating data. The modern replacement for Airflow. Effortlessly integrate and synchronize data from 3rd party sources. Build real-time and batch pipelines to transform data using Python, SQL, and R. Run, monitor, and orchestrate thousands of pipelines without losing sleep. Have you met anyone who said they loved developing in Airflow? That’s why we designed an easy developer experience that you’ll enjoy. Each step in your pipeline is a standalone file containing modular code that’s reusable and testable with data validations. No more DAGs with spaghetti code. Start developing locally with a single command or launch a dev environment in your cloud using Terraform. Write code in Python, SQL, or R in the same data pipeline for ultimate flexibility.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    gusty

    gusty

    Making DAG construction easier

    gusty allows you to control your Airflow DAGs, Task Groups, and Tasks with greater ease. gusty manages collections of tasks, represented as any number of YAML, Python, SQL, Jupyter Notebook, or R Markdown files. A directory of task files is instantly rendered into a DAG by passing a file path to gusty's create_dag function. gusty also manages dependencies (within one DAG) and external dependencies (dependencies on tasks in other DAGs) for each task file you define. All you have to do is provide a list of dependencies or external_dependencies inside of a task file, and gusty will automatically set each task's dependencies and create external task sensors for any external dependencies listed. gusty works with both Airflow 1.x and Airflow 2.x, and has even more features, all of which aim to make the creation, management, and iteration of DAGs more fluid, so that you can intuitively design your DAG and build your tasks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    whylogs

    whylogs

    The open standard for data logging

    whylogs is an open-source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to track changes in their dataset Create data constraints to know whether their data looks the way it should. Quickly visualize key summary statistics about their datasets. whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    CloverDX

    CloverDX

    Design, automate, operate and publish data pipelines at scale

    Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace
    Downloads: 4 This Week
    Last Update:
    See Project
  • 23
    apache spark data pipeline osDQ

    apache spark data pipeline osDQ

    osDQ dedicated to create apache spark based data pipeline using JSON

    This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/ This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also. Get json example at https://github.com/arrahtech/osdq-spark How to run Unzip the zip file Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json Mac UNIX java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin
    Downloads: 2 This Week
    Last Update:
    See Project
  • 24
    Alluxio

    Alluxio

    Open Source Data Orchestration for the Cloud

    Alluxio is the world’s first open source data orchestration technology for analytics and AI for the cloud. It bridges the gap between computation frameworks and storage systems, bringing data from the storage tier closer to the data driven applications. This enables applications to connect to numerous storage systems through a common interface. It makes data local, more accessible and as elastic as compute.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Apache SeaTunnel

    Apache SeaTunnel

    SeaTunnel is a distributed, high-performance data integration platform

    SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies. There are hundreds of commonly-used data sources of which versions are incompatible. With the emergence of new technologies, more data sources are appearing. It is difficult for users to find a tool that can fully and quickly support these data sources. Data synchronization needs to support various synchronization scenarios such as offline-full synchronization, offline-incremental synchronization, CDC, real-time synchronization, and full database synchronization. Existing data integration and data synchronization tools often require vast computing resources or JDBC connection resources to complete real-time synchronization of massive small tables.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.