Data Pipeline Tools for Windows

View 19 business solutions

Browse free open source Data Pipeline tools and projects for Windows below. Use the toggles on the left to filter open source Data Pipeline tools by OS, license, language, programming language, and project status.

  • Level Up Your Cyber Defense with External Threat Management Icon
    Level Up Your Cyber Defense with External Threat Management

    See every risk before it hits. From exposed data to dark web chatter. All in one unified view.

    Move beyond alerts. Gain full visibility, context, and control over your external attack surface to stay ahead of every threat.
    Try for Free
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    The database for AI-powered applications.

    MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
    Start Free
  • 1
    Pentaho

    Pentaho

    Pentaho offers comprehensive data integration and analytics platform.

    Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/
    Leader badge
    Downloads: 2,380 This Week
    Last Update:
    See Project
  • 2
    Backstage

    Backstage

    Backstage is an open platform for building developer portals

    Powered by a centralized software catalog, Backstage restores order to your infrastructure and enables your product teams to ship high-quality code quickly, without compromising autonomy. At Spotify, we've always believed in the speed and ingenuity that comes from having autonomous development teams. But as we learned firsthand, the faster you grow, the more fragmented and complex your software ecosystem becomes. And then everything slows down again. By centralizing services and standardizing your tooling, Backstage streamlines your development environment from end to end. Instead of restricting autonomy, standardization frees your engineers from infrastructure complexity. So you can return to building and scaling, quickly and safely. Every team can see all the services they own and related resources (deployments, data pipelines, pull request status, etc.)
    Downloads: 13 This Week
    Last Update:
    See Project
  • 3
    Apache Kafka

    Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka is a distributed streaming platform built around durable, partitioned logs called topics, enabling high-throughput, fault-tolerant event pipelines. Producers append records to partitions, brokers replicate them for durability, and consumer groups read them at their own pace while balancing work across instances. The commit/offset model and retention policies support patterns from real-time processing to event sourcing and audit trails. Exactly-once processing semantics, idempotent producers, and transactions help prevent duplicates across complex dataflows. Kafka Streams and Kafka Connect extend the core: Streams provides a library for stateful stream processing within applications, while Connect standardizes integration with external systems. With horizontal scalability, strong ordering guarantees within partitions, and mature tooling, Kafka serves as the backbone for event-driven architectures across analytics, microservices, and data integration.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 4
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    Increase data quality and reduce the painful cost of errors. Data engineering best practices using git-like operations on data. lakeFS is an open-source data version control for data lakes. It enables zero-copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more. Data is dynamic, it changes over time. Dealing with that without a data version control system is error-prone and labor-intensive. With lakeFS, your data lake is version controlled and you can easily time-travel between consistent snapshots of the lake. Easier ETL testing - test your ETLs on top of production data, in isolation, without copying anything. Safely experiment and test on full production data. Easily Collaborate on production data with your team. Automate data quality checks within data pipelines.
    Downloads: 7 This Week
    Last Update:
    See Project
  • Resolve Support Tickets 2x Faster​ with ServoDesk Icon
    Resolve Support Tickets 2x Faster​ with ServoDesk

    Full access to Enterprise features. No credit card required.

    What if You Could Automate 90% of Your Repetitive Tasks in Under 30 Days? At ServoDesk, we help businesses like yours automate operations with AI, allowing you to cut service times in half and increase productivity by 25% - without hiring more staff.
    Try ServoDesk for free
  • 5
    Microsoft Integration

    Microsoft Integration

    Microsoft Integration, Azure, Power Platform, Office 365 and much more

    Microsoft Integration, Azure, BAPI, Office 365 and much more Stencils Pack it’s a Visio package that contains fully resizable Visio shapes (symbols/icons) that will help you to visually represent On-premise, Cloud or Hybrid Integration and Enterprise architectures scenarios (BizTalk Server, API Management, Logic Apps, Service Bus, Event Hub…), solutions diagrams and features or systems that use Microsoft Azure and related cloud and on-premises technologies in Visio 2016/2013.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 6
    Best-of Python

    Best-of Python

    A ranked list of awesome Python open-source libraries

    This curated list contains 390 awesome open-source projects with a total of 1.4M stars grouped into 28 categories. All projects are ranked by a project-quality score, which is calculated based on various metrics automatically collected from GitHub and different package managers. If you like to add or update projects, feel free to open an issue, submit a pull request, or directly edit the projects.yaml. Contributions are very welcome! Ranked list of awesome python libraries for web development. Correctly generate plurals, ordinals, indefinite articles; convert numbers. Libraries for loading, collecting, and extracting data from a variety of data sources and formats. Libraries for data batch- and stream-processing, workflow automation, job scheduling, and other data pipeline tasks.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 7
    StarRocks

    StarRocks

    StarRocks is a next-gen sub-second MPP database for full analytics

    StarRocks is the next generation of real-time SQL engines for enterprise analytics. Real-time analytics is notoriously difficult. Complex data pipelines and de-normalized tables have always been a necessary evil. Processing any updates or deletes once data arrives has not been possible- until now. StarRocks solves these challenges and makes real-time analytics easy. Get amazing query performance on Star or Snowflake Schemas directly. From canceled orders to updated items, your analytics applications can easily handle them with StarRocks. From streaming data to change data capture, StarRocks meets the data ingestion demands of real-time analytics. Scale storage and computing power horizontally and support tens of thousands of concurrent users. All of your BI tools work with StarRocks through standard SQL. StarRocks provides superior performance. It is also a unified OLAP covering most data analytics scenarios.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    rudderstack

    rudderstack

    Privacy and Security focused Segment-alternative, in Golang

    Quickly deploy flexible, powerful customer data pipelines, then send the data to your entire stack—without the engineering headache. Our complete toolset makes it easy to level-up your customer data stack. Spare your data engineers the headache. Our 180+ integrations, along with custom webhook sources and destinations, save data teams hundred of hours. Say goodbye to different versions of the truth. Our SDKs track anonymous and known users at the source and reconcile users in your warehouse and SaaS tools. Go beyond event streaming and control all of your customer data on your own terms. Learn how we can help you build a customer data platform. RudderStack treats your data warehouse as a first-class citizen among destinations, with advanced features and configurable, near real-time sync. RudderStack is built API-first. It integrates seamlessly with the tools that the developers already use and love.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    Dolphin Scheduler

    Dolphin Scheduler

    A distributed and extensible workflow scheduler platform

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available `out of the box`. Dedicated to solving the complex task dependencies in data processing, making the scheduler system out of the box for data processing. Decentralized multi-master and multi-worker, HA is supported by itself, overload processing. All process definition operations are visualized, Visualization process defines key information at a glance, One-click deployment. Support multi-tenant. Support many task types e.g., spark,flink,hive, mr, shell, python, sub_process. Support custom task types, Distributed scheduling, and the overall scheduling capability will increase linearly with the scale of the cluster.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Grafana: The open and composable observability platform Icon
    Grafana: The open and composable observability platform

    Faster answers, predictable costs, and no lock-in built by the team helping to make observability accessible to anyone.

    Grafana is the open source analytics & monitoring solution for every database.
    Learn More
  • 10
    Kestra

    Kestra

    Kestra is an infinitely scalable orchestration and scheduling platform

    Build reliable workflows, blazingly fast, deploy in just a few clicks. Kestra is an open-source, event-driven orchestrator that simplifies data operations and improves collaboration between engineers and business users. By bringing Infrastructure as Code best practices to data pipelines, Kestra allows you to build reliable workflows and manage them with confidence. Thanks to the declarative YAML interface for defining orchestration logic, everyone who benefits from analytics can participate in the data pipeline creation process. The UI automatically adjusts the YAML definition any time you make changes to a workflow from the UI or via an API call. Therefore, the orchestration logic is defined declaratively in code, even if some workflow components are modified in other ways.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    CloverDX

    CloverDX

    Design, automate, operate and publish data pipelines at scale

    Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace
    Downloads: 13 This Week
    Last Update:
    See Project
  • 12
    DataGym.ai

    DataGym.ai

    Open source annotation and labeling tool for image and video assets

    DATAGYM enables data scientists and machine learning experts to label images up to 10x faster. AI-assisted annotation tools reduce manual labeling effort, give you more time to finetune ML models and speed up your go to market of new products. Accelerate your computer vision projects by cutting down data preparation time up to 50%. A machine learning model is only as good as its training data. DATAGYM is an end-to-end workbench to create, annotate, manage, and export the right training data for your computer vision models. Your image data can be imported into DATAGYM from your local machine, from any public image URL or directly from an AWS cloud S3 bucket. Machine learning teams spend up to 80% of their time on data preparation. DATAGYM provides AI-powered annotation functions to help you accelerate your labeling task. The Pre-Labeling feature enables turbo-labeling – it processes thousands of images in the background within a very short time.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    DataKit

    DataKit

    Connect processes into powerful data pipelines

    Connect processes into powerful data pipelines with a simple git-like filesystem interface. DataKit is a tool to orchestrate applications using a Git-like dataflow. It revisits the UNIX pipeline concept, with a modern twist: streams of tree-structured data instead of raw text. DataKit allows you to define complex build pipelines over version-controlled data. DataKit is currently used as the coordination layer for HyperKit, the hypervisor component of Docker for Mac and Windows, and for the DataKitCI continuous integration system. src contains the main DataKit service. This is a Git-like database to which other services can connect. ci contains DataKitCI, a continuous integration system that uses DataKit to monitor repositories and store build results. The easiest way to use DataKit is to start both the server and the client in containers.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    whylogs

    whylogs

    The open standard for data logging

    whylogs is an open-source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to track changes in their dataset Create data constraints to know whether their data looks the way it should. Quickly visualize key summary statistics about their datasets. whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    Datapipe

    Datapipe

    Real-time, incremental ETL library for ML with record-level depend

    Datapipe is a real-time, incremental ETL library for Python with record-level dependency tracking. Datapipe is designed to streamline the creation of data processing pipelines. It excels in scenarios where data is continuously changing, requiring pipelines to adapt and process only the modified data efficiently. This library tracks dependencies for each record in the pipeline, ensuring minimal and efficient data processing.
    Leader badge
    Downloads: 3 This Week
    Last Update:
    See Project
  • 16
    Apache SeaTunnel

    Apache SeaTunnel

    SeaTunnel is a distributed, high-performance data integration platform

    SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies. There are hundreds of commonly-used data sources of which versions are incompatible. With the emergence of new technologies, more data sources are appearing. It is difficult for users to find a tool that can fully and quickly support these data sources. Data synchronization needs to support various synchronization scenarios such as offline-full synchronization, offline-incremental synchronization, CDC, real-time synchronization, and full database synchronization. Existing data integration and data synchronization tools often require vast computing resources or JDBC connection resources to complete real-time synchronization of massive small tables.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    AutoGluon

    AutoGluon

    AutoGluon: AutoML for Image, Text, and Tabular Data

    AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data. Intended for both ML beginners and experts, AutoGluon enables you to quickly prototype deep learning and classical ML solutions for your raw data with a few lines of code. Automatically utilize state-of-the-art techniques (where appropriate) without expert knowledge. Leverage automatic hyperparameter tuning, model selection/ensembling, architecture search, and data processing. Easily improve/tune your bespoke models and data pipelines, or customize AutoGluon for your use-case. AutoGluon is modularized into sub-modules specialized for tabular, text, or image data. You can reduce the number of dependencies required by solely installing a specific sub-module via: python3 -m pip install <submodule>.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    Automated Tool for Optimized Modelling

    During the exploration phase of a machine learning project, a data scientist tries to find the optimal pipeline for his specific use case. This usually involves applying standard data cleaning steps, creating or selecting useful features, trying out different models, etc. Testing multiple pipelines requires many lines of code, and writing it all in the same notebook often makes it long and cluttered. On the other hand, using multiple notebooks makes it harder to compare the results and to keep an overview. On top of that, refactoring the code for every test can be quite time-consuming. How many times have you conducted the same action to pre-process a raw dataset? How many times have you copy-and-pasted code from an old repository to re-use it in a new use case? ATOM is here to help solve these common issues. The package acts as a wrapper of the whole machine learning pipeline, helping the data scientist to rapidly find a good model for his problem.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    BitSail

    BitSail

    BitSail is a distributed high-performance data integration engine

    BitSail is ByteDance's open source data integration engine which is based on distributed architecture and provides high performance. It supports data synchronization between multiple heterogeneous data sources, and provides global data integration solutions in batch, streaming, and incremental scenarios. At present, it serves almost all business lines in ByteDance, such as Douyin, Toutiao, etc., and synchronizes hundreds of trillions of data every day. BitSail has been widely used and supports hundreds of trillions of large traffic. At the same time, it has been verified in various scenarios such as the cloud native environment of the volcano engine and the on-premises private cloud environment.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20

    CCDLAB

    A FITS image data viewer & reducer, and UVIT Data Reduction Pipeline.

    CCDLAB is a FITS image data viewer, reducer, and UVIT Data Pipeline. The latest CCDLAB installer can be downloaded here: https://github.com/user29A/CCDLAB/releases The Visual Studio 2017 project files can be found here: https://github.com/user29A/CCDLAB/ Those may not be the latest code files as code is generally updated a few times a week. If you want the latest project files then let me know.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Conduit

    Conduit

    Conduit streams data between data stores. Kafka Connect replacement

    Conduit is a data streaming tool written in Go. It aims to provide the best user experience for building and running real-time data pipelines. Conduit comes with batteries included, it provides a UI, common connectors, processors and observability data out of the box. Sync data between your production systems using an extensible, event-first experience with minimal dependencies that fit within your existing workflow. Eliminate the multi-step process you go through today. Just download the binary and start building. Conduit connectors give you the ability to pull and push data to any production datastore you need. If a datastore is missing, the simple SDK allows you to extend Conduit where you need it. Conduit pipelines listen for changes to a database, data warehouse, etc., and allows your data applications to act upon those changes in real-time. Run it in a way that works for you; use it as a standalone service or orchestrate it within your infrastructure.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    CueLake

    CueLake

    Use SQL to build ELT pipelines on a data lakehouse

    With CueLake, you can use SQL to build ELT (Extract, Load, Transform) pipelines on a data lakehouse. You write Spark SQL statements in Zeppelin notebooks. You then schedule these notebooks using workflows (DAGs). To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg). To transform data, you write SQL statements to create views and tables in your data lakehouse. CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Elementary

    Elementary

    Open-source data observability for analytics engineers

    Elementary is an open-source data observability solution for data & analytics engineers. Monitor your dbt project and data in minutes, and be the first to know of data issues. Gain immediate visibility, detect data issues, send actionable alerts, and understand the impact and root cause. Generate a data observability report, host it or share with your team. Monitoring of data quality metrics, freshness, volume and schema changes, including anomaly detection. Elementary data monitors are configured and executed like native tests in dbt your project. Uploading and modeling of dbt artifacts, run and test results to tables as part of your runs. Get informative notifications on data issues, schema changes, models and tests failures. Inspect upstream and downstream dependencies to understand impact and root cause of data issues.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    GenStage

    GenStage

    Producer and consumer actors with back-pressure for Elixir

    GenStage is a specification and set of behaviours for building demand-driven data pipelines on the BEAM. It formalizes the roles of producers, consumers, and producer-consumers, using back-pressure so that fast producers don’t overwhelm downstream stages. Developers implement callbacks like handle_demand and handle_events to control how items are emitted, transformed, and consumed across asynchronous boundaries. Because stages are OTP processes, you gain fault tolerance, supervised restarts, and concurrency tuned via configurable demand and partitioning. GenStage underpins higher-level libraries like Flow and Broadway, but it can also be used directly for custom pipelines where timing and throughput matter. Its clear separation of concerns encourages testable, composable stages that can be rearranged as requirements evolve. In production, this leads to predictable, resilient dataflows for event ingestion, batching, and parallel processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Luigi

    Luigi

    Python module that helps you build complex pipelines of batch jobs

    Luigi is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next