Data Pipeline Tools for Windows

View 19 business solutions

Browse free open source Data Pipeline tools and projects for Windows below. Use the toggles on the left to filter open source Data Pipeline tools by OS, license, language, programming language, and project status.

  • Keep company data safe with Chrome Enterprise Icon
    Keep company data safe with Chrome Enterprise

    Protect your business with AI policies and data loss prevention in the browser

    Make AI work your way with Chrome Enterprise. Block unapproved sites and set custom data controls that align with your company's policies.
    Download Chrome
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 1
    Pentaho

    Pentaho

    Pentaho offers comprehensive data integration and analytics platform.

    Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/
    Leader badge
    Downloads: 2,239 This Week
    Last Update:
    See Project
  • 2
    Backstage

    Backstage

    Backstage is an open platform for building developer portals

    Powered by a centralized software catalog, Backstage restores order to your infrastructure and enables your product teams to ship high-quality code quickly, without compromising autonomy. At Spotify, we've always believed in the speed and ingenuity that comes from having autonomous development teams. But as we learned firsthand, the faster you grow, the more fragmented and complex your software ecosystem becomes. And then everything slows down again. By centralizing services and standardizing your tooling, Backstage streamlines your development environment from end to end. Instead of restricting autonomy, standardization frees your engineers from infrastructure complexity. So you can return to building and scaling, quickly and safely. Every team can see all the services they own and related resources (deployments, data pipelines, pull request status, etc.)
    Downloads: 14 This Week
    Last Update:
    See Project
  • 3
    Apache Kafka

    Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka is a distributed streaming platform built around durable, partitioned logs called topics, enabling high-throughput, fault-tolerant event pipelines. Producers append records to partitions, brokers replicate them for durability, and consumer groups read them at their own pace while balancing work across instances. The commit/offset model and retention policies support patterns from real-time processing to event sourcing and audit trails. Exactly-once processing semantics, idempotent producers, and transactions help prevent duplicates across complex dataflows. Kafka Streams and Kafka Connect extend the core: Streams provides a library for stateful stream processing within applications, while Connect standardizes integration with external systems. With horizontal scalability, strong ordering guarantees within partitions, and mature tooling, Kafka serves as the backbone for event-driven architectures across analytics, microservices, and data integration.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    Kestra

    Kestra

    Kestra is an infinitely scalable orchestration and scheduling platform

    Build reliable workflows, blazingly fast, deploy in just a few clicks. Kestra is an open-source, event-driven orchestrator that simplifies data operations and improves collaboration between engineers and business users. By bringing Infrastructure as Code best practices to data pipelines, Kestra allows you to build reliable workflows and manage them with confidence. Thanks to the declarative YAML interface for defining orchestration logic, everyone who benefits from analytics can participate in the data pipeline creation process. The UI automatically adjusts the YAML definition any time you make changes to a workflow from the UI or via an API call. Therefore, the orchestration logic is defined declaratively in code, even if some workflow components are modified in other ways.
    Downloads: 7 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 5
    Microsoft Integration

    Microsoft Integration

    Microsoft Integration, Azure, Power Platform, Office 365 and much more

    Microsoft Integration, Azure, BAPI, Office 365 and much more Stencils Pack it’s a Visio package that contains fully resizable Visio shapes (symbols/icons) that will help you to visually represent On-premise, Cloud or Hybrid Integration and Enterprise architectures scenarios (BizTalk Server, API Management, Logic Apps, Service Bus, Event Hub…), solutions diagrams and features or systems that use Microsoft Azure and related cloud and on-premises technologies in Visio 2016/2013.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Mage.ai

    Mage.ai

    Build, run, and manage data pipelines for integrating data

    Open-source data pipeline tool for transforming and integrating data. The modern replacement for Airflow. Effortlessly integrate and synchronize data from 3rd party sources. Build real-time and batch pipelines to transform data using Python, SQL, and R. Run, monitor, and orchestrate thousands of pipelines without losing sleep. Have you met anyone who said they loved developing in Airflow? That’s why we designed an easy developer experience that you’ll enjoy. Each step in your pipeline is a standalone file containing modular code that’s reusable and testable with data validations. No more DAGs with spaghetti code. Start developing locally with a single command or launch a dev environment in your cloud using Terraform. Write code in Python, SQL, or R in the same data pipeline for ultimate flexibility.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    Best-of Python

    Best-of Python

    A ranked list of awesome Python open-source libraries

    This curated list contains 390 awesome open-source projects with a total of 1.4M stars grouped into 28 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Contributions are very welcome! Ranked list of awesome python libraries for web development. Correctly generate plurals, ordinals, indefinite articles; convert numbers. Libraries for loading, collecting, and extracting data from a variety of data sources and formats. Libraries for data batch- and stream-processing, workflow automation, job scheduling, and other data pipeline tasks.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 8
    DataKit

    DataKit

    Connect processes into powerful data pipelines

    Connect processes into powerful data pipelines with a simple git-like filesystem interface. DataKit is a tool to orchestrate applications using a Git-like dataflow. It revisits the UNIX pipeline concept, with a modern twist: streams of tree-structured data instead of raw text. DataKit allows you to define complex build pipelines over version-controlled data. DataKit is currently used as the coordination layer for HyperKit, the hypervisor component of Docker for Mac and Windows, and for the DataKitCI continuous integration system. src contains the main DataKit service. This is a Git-like database to which other services can connect. ci contains DataKitCI, a continuous integration system that uses DataKit to monitor repositories and store build results. The easiest way to use DataKit is to start both the server and the client in containers.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    AutoGluon

    AutoGluon

    AutoGluon: AutoML for Image, Text, and Tabular Data

    AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data. Intended for both ML beginners and experts, AutoGluon enables you to quickly prototype deep learning and classical ML solutions for your raw data with a few lines of code. Automatically utilize state-of-the-art techniques (where appropriate) without expert knowledge. Leverage automatic hyperparameter tuning, model selection/ensembling, architecture search, and data processing. Easily improve/tune your bespoke models and data pipelines, or customize AutoGluon for your use-case. AutoGluon is modularized into sub-modules specialized for tabular, text, or image data. You can reduce the number of dependencies required by solely installing a specific sub-module via: python3 -m pip install <submodule>.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Run applications fast and securely in a fully managed environment Icon
    Run applications fast and securely in a fully managed environment

    Cloud Run is a fully-managed compute platform that lets you run your code in a container directly on top of scalable infrastructure.

    Run frontend and backend services, batch jobs, deploy websites and applications, and queue processing workloads without the need to manage infrastructure.
    Try for free
  • 10
    Conduit

    Conduit

    Conduit streams data between data stores. Kafka Connect replacement

    Conduit is a data streaming tool written in Go. It aims to provide the best user experience for building and running real-time data pipelines. Conduit comes with batteries included, it provides a UI, common connectors, processors and observability data out of the box. Sync data between your production systems using an extensible, event-first experience with minimal dependencies that fit within your existing workflow. Eliminate the multi-step process you go through today. Just download the binary and start building. Conduit connectors give you the ability to pull and push data to any production datastore you need. If a datastore is missing, the simple SDK allows you to extend Conduit where you need it. Conduit pipelines listen for changes to a database, data warehouse, etc., and allows your data applications to act upon those changes in real-time. Run it in a way that works for you; use it as a standalone service or orchestrate it within your infrastructure.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    CueLake

    CueLake

    Use SQL to build ELT pipelines on a data lakehouse

    With CueLake, you can use SQL to build ELT (Extract, Load, Transform) pipelines on a data lakehouse. You write Spark SQL statements in Zeppelin notebooks. You then schedule these notebooks using workflows (DAGs). To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg). To transform data, you write SQL statements to create views and tables in your data lakehouse. CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    DataGym.ai

    DataGym.ai

    Open source annotation and labeling tool for image and video assets

    DATAGYM enables data scientists and machine learning experts to label images up to 10x faster. AI-assisted annotation tools reduce manual labeling effort, give you more time to finetune ML models and speed up your go to market of new products. Accelerate your computer vision projects by cutting down data preparation time up to 50%. A machine learning model is only as good as its training data. DATAGYM is an end-to-end workbench to create, annotate, manage, and export the right training data for your computer vision models. Your image data can be imported into DATAGYM from your local machine, from any public image URL or directly from an AWS cloud S3 bucket. Machine learning teams spend up to 80% of their time on data preparation. DATAGYM provides AI-powered annotation functions to help you accelerate your labeling task. The Pre-Labeling feature enables turbo-labeling – it processes thousands of images in the background within a very short time.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Orchest

    Orchest

    Build data pipelines, the easy way

    Code, run and monitor your data pipelines all from your browser! From idea to scheduled pipeline in hours, not days. Interactively build your data science pipelines in our visual pipeline editor. Versioned as a JSON file. Run scripts or Jupyter notebooks as steps in a pipeline. Python, R, Julia, JavaScript, and Bash are supported. Parameterize your pipelines and run them periodically on a cron schedule. Easily install language or system packages. Built on top of regular Docker container images. Creation of multiple instances with up to 8 vCPU & 32 GiB memory. A free Orchest instance with 2 vCPU & 8 GiB memory. Simple data pipelines with Orchest. Each step runs a file in a container. It's that simple! Spin up services whose lifetime spans across the entire pipeline run. Easily define your dependencies to run on any machine. Run any subset of the pipeline directly or periodically.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    Increase data quality and reduce the painful cost of errors. Data engineering best practices using git-like operations on data. lakeFS is an open-source data version control for data lakes. It enables zero-copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more. Data is dynamic, it changes over time. Dealing with that without a data version control system is error-prone and labor-intensive. With lakeFS, your data lake is version controlled and you can easily time-travel between consistent snapshots of the lake. Easier ETL testing - test your ETLs on top of production data, in isolation, without copying anything. Safely experiment and test on full production data. Easily Collaborate on production data with your team. Automate data quality checks within data pipelines.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    whylogs

    whylogs

    The open standard for data logging

    whylogs is an open-source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to track changes in their dataset Create data constraints to know whether their data looks the way it should. Quickly visualize key summary statistics about their datasets. whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    CloverDX

    CloverDX

    Design, automate, operate and publish data pipelines at scale

    Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace
    Downloads: 3 This Week
    Last Update:
    See Project
  • 17
    Apache SeaTunnel

    Apache SeaTunnel

    SeaTunnel is a distributed, high-performance data integration platform

    SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies. There are hundreds of commonly-used data sources of which versions are incompatible. With the emergence of new technologies, more data sources are appearing. It is difficult for users to find a tool that can fully and quickly support these data sources. Data synchronization needs to support various synchronization scenarios such as offline-full synchronization, offline-incremental synchronization, CDC, real-time synchronization, and full database synchronization. Existing data integration and data synchronization tools often require vast computing resources or JDBC connection resources to complete real-time synchronization of massive small tables.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    During the exploration phase of a machine learning project, a data scientist tries to find the optimal pipeline for his specific use case. This usually involves applying standard data cleaning steps, creating or selecting useful features, trying out different models, etc. Testing multiple pipelines requires many lines of code, and writing it all in the same notebook often makes it long and cluttered. On the other hand, using multiple notebooks makes it harder to compare the results and to keep an overview. On top of that, refactoring the code for every test can be quite time-consuming. How many times have you conducted the same action to pre-process a raw dataset? How many times have you copy-and-pasted code from an old repository to re-use it in a new use case? ATOM is here to help solve these common issues. The package acts as a wrapper of the whole machine learning pipeline, helping the data scientist to rapidly find a good model for his problem.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    BitSail

    BitSail

    BitSail is a distributed high-performance data integration engine

    BitSail is ByteDance's open source data integration engine which is based on distributed architecture and provides high performance. It supports data synchronization between multiple heterogeneous data sources, and provides global data integration solutions in batch, streaming, and incremental scenarios. At present, it serves almost all business lines in ByteDance, such as Douyin, Toutiao, etc., and synchronizes hundreds of trillions of data every day. BitSail has been widely used and supports hundreds of trillions of large traffic. At the same time, it has been verified in various scenarios such as the cloud native environment of the volcano engine and the on-premises private cloud environment.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20

    CCDLAB

    A FITS image data viewer & reducer, and UVIT Data Reduction Pipeline.

    CCDLAB is a FITS image data viewer, reducer, and UVIT Data Pipeline. The latest CCDLAB installer can be downloaded here: https://github.com/user29A/CCDLAB/releases The Visual Studio 2017 project files can be found here: https://github.com/user29A/CCDLAB/ Those may not be the latest code files as code is generally updated a few times a week. If you want the latest project files then let me know.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Datapipe

    Datapipe

    Real-time, incremental ETL library for ML with record-level depend

    Datapipe is a real-time, incremental ETL library for Python with record-level dependency tracking. Datapipe is designed to streamline the creation of data processing pipelines. It excels in scenarios where data is continuously changing, requiring pipelines to adapt and process only the modified data efficiently. This library tracks dependencies for each record in the pipeline, ensuring minimal and efficient data processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Dolphin Scheduler

    Dolphin Scheduler

    A distributed and extensible workflow scheduler platform

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available `out of the box`. Dedicated to solving the complex task dependencies in data processing, making the scheduler system out of the box for data processing. Decentralized multi-master and multi-worker, HA is supported by itself, overload processing. All process definition operations are visualized, Visualization process defines key information at a glance, One-click deployment. Support multi-tenant. Support many task types e.g., spark,flink,hive, mr, shell, python, sub_process. Support custom task types, Distributed scheduling, and the overall scheduling capability will increase linearly with the scale of the cluster.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Elementary

    Elementary

    Open-source data observability for analytics engineers

    Elementary is an open-source data observability solution for data & analytics engineers. Monitor your dbt project and data in minutes, and be the first to know of data issues. Gain immediate visibility, detect data issues, send actionable alerts, and understand the impact and root cause. Generate a data observability report, host it or share with your team. Monitoring of data quality metrics, freshness, volume and schema changes, including anomaly detection. Elementary data monitors are configured and executed like native tests in dbt your project. Uploading and modeling of dbt artifacts, run and test results to tables as part of your runs. Get informative notifications on data issues, schema changes, models and tests failures. Inspect upstream and downstream dependencies to understand impact and root cause of data issues.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    GenStage

    GenStage

    Producer and consumer actors with back-pressure for Elixir

    GenStage is a specification and set of behaviours for building demand-driven data pipelines on the BEAM. It formalizes the roles of producers, consumers, and producer-consumers, using back-pressure so that fast producers don’t overwhelm downstream stages. Developers implement callbacks like handle_demand and handle_events to control how items are emitted, transformed, and consumed across asynchronous boundaries. Because stages are OTP processes, you gain fault tolerance, supervised restarts, and concurrency tuned via configurable demand and partitioning. GenStage underpins higher-level libraries like Flow and Broadway, but it can also be used directly for custom pipelines where timing and throughput matter. Its clear separation of concerns encourages testable, composable stages that can be rearranged as requirements evolve. In production, this leads to predictable, resilient dataflows for event ingestion, batching, and parallel processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Luigi

    Luigi

    Python module that helps you build complex pipelines of batch jobs

    Luigi is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next