Best Open Source Linux Data Integration Tools 2026

Data Integration Tools for Linux

Data Integration Linux Clear Filters

Browse free open source Data Integration tools and projects for Linux below. Use the toggles on the left to filter open source Data Integration tools by OS, license, language, programming language, and project status.

$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
1

Pentaho

Pentaho offers comprehensive data integration and analytics platform.

Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/

69 Reviews

Downloads: 967 This Week

Last Update: 2025-02-06
See Project
2

Pentaho Data Integration

Pentaho Data Integration ( ETL ) a.k.a Kettle

Pentaho Data Integration uses the Maven framework. Project distribution archive is produced under the assemblies module. Core implementation, database dialog, user interface, PDI engine, PDI engine extensions, PDI core plugins, and integration tests. Maven, version 3+, and Java JDK 1.8 are requisites. Use of the Pentaho checkstyle format (via mvn checkstyle:check and reviewing the report) and developing working Unit Tests helps to ensure that pull requests for bugs and improvements are processed quickly. In addition to the unit tests, there are integration tests that test cross-module operation.

Downloads: 68 This Week

Last Update: 2021-11-08
See Project
3

Dagster

An orchestration platform for the development, production

Dagster is an orchestration platform for the development, production, and observation of data assets. Dagster as a productivity platform: With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early. Dagster as a robust orchestration engine: Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally. Dagster as a unified control plane: The ‘single plane of glass’ data teams love to use. Rein in the chaos and maintain control over your data as the complexity scales. Centralize your metadata in one tool with built-in observability, diagnostics, cataloging, and lineage. Spot any issues and identify performance improvement opportunities.

Downloads: 36 This Week

Last Update: 2026-07-09
See Project
4

nango

A single API for all your integrations.

Nango is a single API to interact with all other external APIs. It should be the only API you need to integrate to your app. Nango is an open-source solution for integrating third-party APIs with applications, simplifying API authentication, data syncing, and management.

Downloads: 11 This Week

Last Update: 2 days ago
See Project
Cut Data Warehouse Costs by 54%
Easily migrate from Snowflake, Redshift, or Databricks with free tools.

BigQuery delivers 54% lower TCO with exabyte scale and flexible pricing. Free migration tools handle the SQL translation automatically.

Try Free
5

nichenetr

NicheNet: predict active ligand-target links between interacting cells

nichenetr: the R implementation of the NicheNet method. The goal of NicheNet is to study intercellular communication from a computational perspective. NicheNet uses human or mouse gene expression data of interacting cells as input and combines this with a prior model that integrates existing knowledge on ligand-to-target signaling paths. This allows to predict ligand-receptor interactions that might drive gene expression changes in cells of interest. This model of prior information on potential ligand-target links can then be used to infer active ligand-target links between interacting cells. NicheNet prioritizes ligands according to their activity (i.e., how well they predict observed changes in gene expression in the receiver cell) and looks for affected targets with high potential to be regulated by these prioritized ligands.

Downloads: 9 This Week

Last Update: 2024-09-05
See Project
6

Common Core Ontologies

The Common Core Ontology Repository

The Common Core Ontologies (CCO) comprise twelve ontologies that are designed to represent and integrate taxonomies of generic classes and relations across all domains of interest. CCO is a mid-level extension of Basic Formal Ontology (BFO), an upper-level ontology framework widely used to structure and integrate ontologies in the biomedical domain (Arp, et al., 2015). BFO aims to represent the most generic categories of entity and the most generic types of relations that hold between them, by defining a small number of classes and relations. CCO then extends from BFO in the sense that every class in CCO is asserted to be a subclass of some class in BFO, and that CCO adopts the generic relations defined in BFO (e.g., has_part) (Smith and Grenon, 2004). Accordingly, CCO classes and relations are heavily constrained by the BFO framework, from which it inherits much of its basic semantic relationships.

Downloads: 8 This Week

Last Update: 2026-04-04
See Project
7

Airbyte

Data integration platform for ELT pipelines from APIs, databases

We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes. Moving critical data with Airbyte is as easy and reliable as flipping on a switch. Our teams process more than 300 billion rows each month for ambitious businesses of all sizes. Enable your data engineering teams to focus on projects that are more valuable to your business. Building and maintaining custom connectors have become 5x easier with Airbyte. With an average response rate of 10 minutes or less and a Customer Satisfaction score of 96/100, our team is ready to support your data integration journey all over the world.

Downloads: 6 This Week

Last Update: 2025-10-15
See Project
8

Gradle Docker Compose Plugin

Simplifies usage of Docker Compose for integration testing

The Gradle Docker Compose Plugin by Avast integrates Docker Compose lifecycle management into Gradle builds. It allows developers to define and manage Docker containers required for integration testing or local development directly from their Gradle build scripts. This plugin automates the startup and shutdown of services, supports container health checks, and enables tight integration between application code and containerized services, enhancing reproducibility and automation in development pipelines.

Downloads: 6 This Week

Last Update: 2026-02-03
See Project
9

Recap

Recap tracks and transform schemas across your whole application

Recap is a schema language and multi-language toolkit to track and transform schemas across your whole application. Your data passes through web services, databases, message brokers, and object stores. Recap describes these schemas in a single language, regardless of which system your data passes through. Recap schemas can be defined in YAML, TOML, JSON, XML, or any other compatible language.

Downloads: 5 This Week

Last Update: 2025-12-30
See Project
Secure File Transfer for Windows with Cerberus by Redwood
Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.

Try for Free
10

Apache DevLake

Apache DevLake is an open-source dev data platform

Apache DevLake is an open-source dev data platform that ingests, analyzes, and visualizes the fragmented data from DevOps tools to extract insights for engineering excellence, developer experience, and community growth. Apache DevLake is designed for developer teams looking to make better sense of their development process and to bring a more data-driven approach to their own practices. You can ask Apache DevLake many questions regarding your development process. Just connect and query. Your Dev Data lives in many silos and tools. DevLake brings them all together to give you a complete view of your Software Development Life Cycle (SDLC). From DORA to scrum retros, DevLake implements metrics effortlessly with prebuilt dashboards supporting common frameworks and goals. DevLake fits teams of all shapes and sizes, and can be readily extended to support new data sources, metrics, and dashboards, with a flexible framework for data collection and transformation.

Downloads: 4 This Week

Last Update: 7 days ago
See Project
11

harmonypy

Integrate multiple high-dimensional datasets with fuzzy k-means

Harmony is an algorithm for integrating multiple high-dimensional datasets. harmonypy is a port of the harmony R package by Ilya Korsunsky. Harmony is a general-purpose R package with an efficient algorithm for integrating multiple data sets. It is especially useful for large single-cell datasets such as single-cell RNA-seq.

Downloads: 4 This Week

Last Update: 2026-04-24
See Project
12

PHPCI

PHPCI is a free and open source continuous integration tool

PHPCI is a continuous integration (CI) server designed specifically for PHP applications. It automates tasks such as testing, code quality checks, and deployment, helping developers maintain code consistency and detect issues early. PHPCI supports various plugins and tools, including PHPUnit, PHPMD, and Codeception, making it highly customizable for different project needs.

Downloads: 3 This Week

Last Update: 2025-05-19
See Project
13

Jitsu

Jitsu is an open-source Segment alternative

Jitsu is a fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days. Installing Jitsu is a matter of selecting your framework and adding few lines of code to your app. Jitsu is built to be framework agnostic, so regardless of your stack, we have a solution that'll work for your team. Connect data warehouse (Snowflake, Clickhouse, BigQuery, S3, Redshift ot Postgres) and query your data instantly. Jitsu can either stream data in real-time or send it in micro-batches (up to once a minute). Apply any transformation with Jitsu. Just write JavaScript code right in the UI to do anything with incoming data. And yes, the code editor supports code completion, debugging and many more. It feels like a full-featured IDE!

Downloads: 2 This Week

Last Update: 2026-07-09
See Project
14

PhantomJS-Node

PhantomJS integration module for NodeJS

PhantomJS-Node is a Node.js bridge to PhantomJS, enabling programmatic control of the headless browser for tasks like web scraping, automated testing, and page rendering.

Downloads: 2 This Week

Last Update: 2025-02-05
See Project
15

KubeRay

A toolkit to run Ray applications on Kubernetes

KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers several key components. KubeRay core: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.

Downloads: 1 This Week

Last Update: 2026-06-18
See Project
16

PDI Data Vault framework

Data Vault loading automation using Pentaho Data Integration.

A metadata driven 'tool' to automate loading a designed Data Vault. It consists of a set of Pentaho Data Integration and database objects. Thel Virtual Machine (VMware) is a 64 bit Ubuntu Server 14.04, with MySQL (Percona Server) and PostgreSQL 9.4 as the database flavours and PDI version 5.2 CE. NB: Directory version_2.4 contains the most recent Virtual Machine. The readme.txt contains info about that VM.

Downloads: 17 This Week

Last Update: 2015-09-23
See Project
17

Open Source Data Quality and Profiling

World's first open source data quality & data preparation project

This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/

8 Reviews

Downloads: 2 This Week

Last Update: 2021-01-20
See Project
18

EasyDataQuality for Pentaho Kettle

EasyDataQuality for Pentaho Data Integration in Kettle

EasyDQ plugins for Contact cleansing in Pentaho Data Integration in Kettle.

1 Review

Downloads: 4 This Week

Last Update: 2016-04-26
See Project
19

CloverDX

Design, automate, operate and publish data pipelines at scale

Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace

4 Reviews

Downloads: 3 This Week

Last Update: 2023-05-04
See Project
20

PaloKettlePentahoDataIntegrationPDI

PaloKettlePlugin is for Pentaho Data Integration aka Kettle. It's a Cell Input und Output Step for Palo Molap. The first code was developed by mybiq/3A-Strategy, the PDI-3 version has been developed by Stratebi. Now by 3A-Strategy and Litebi for PDI

Downloads: 5 This Week

Last Update: 2015-08-03
See Project
21

Metl ETL Data Integration

Simple message-based, web-based ETL integration

Metl is a simple, web-based ETL tool that allows for data integrations including database, files, messaging, and web services. Supports RDBMS, SOAP, HTTP, FTP, SFTP, XML, FIXLEN, CSV, JSON, ZIP, and more. Metl implements scheduled integration tasks without the need for custom coding or heavy infrastructure. It can be deployed in the cloud or in an internal data center, and it was built to allow developers to extend it with custom components.

Downloads: 3 This Week

Last Update: 2022-01-21
See Project
22

Tarunyai

Tool is in development please donot download.Analytics tool

This tool is still in development phase please donot download for now. Tarunyai data integration prepares and blends data to create a complete picture of your business that drives actionable insights.

Downloads: 3 This Week

Last Update: 2016-06-13
See Project
23

XAware Data Integration Project

Create XML and JSON data services from any data source

Create services to integrate applications & move data of any type. Build data views across DBMS, SOAP, HTTP/REST, Salesforce, SAP, Microsoft, SharePoint, Text, LDAP, FTP sources to read, write & transfer data. Eclipse designer & run-time engine.

Downloads: 2 This Week

Last Update: 2016-04-06
See Project
24

Daffodil Replicator

Daffodil Replicator is a powerful Open Source Java tool for data integration, data migration and data protection in real time. It allows bi-directional data replication and synchronization between homogeneous / heterogeneous databases including Oracle, M

1 Review

Downloads: 1 This Week

Last Update: 2019-06-12
See Project
25

ordi

Ontology Representation and Data Integration (ORDI) is middleware framework to allow enterprise data integration via RDF-like data model.

1 Review

Downloads: 1 This Week

Last Update: 2013-04-29
See Project