Data Integration Tools for Linux

View 40 business solutions

Browse free open source Data Integration tools and projects for Linux below. Use the toggles on the left to filter open source Data Integration tools by OS, license, language, programming language, and project status.

  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • MongoDB Atlas | Run databases anywhere Icon
    MongoDB Atlas | Run databases anywhere

    Ensure the availability of your data with coverage across AWS, Azure, and GCP on MongoDB Atlas—the multi-cloud database for every enterprise.

    MongoDB Atlas allows you to build and run modern applications across 125+ cloud regions, spanning AWS, Azure, and Google Cloud. Its multi-cloud clusters enable seamless data distribution and automated failover between cloud providers, ensuring high availability and flexibility without added complexity.
    Learn More
  • 1
    Pentaho

    Pentaho

    Pentaho offers comprehensive data integration and analytics platform.

    Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/
    Leader badge
    Downloads: 1,813 This Week
    Last Update:
    See Project
  • 2
    Pentaho Data Integration

    Pentaho Data Integration

    Pentaho Data Integration ( ETL ) a.k.a Kettle

    Pentaho Data Integration uses the Maven framework. Project distribution archive is produced under the assemblies module. Core implementation, database dialog, user interface, PDI engine, PDI engine extensions, PDI core plugins, and integration tests. Maven, version 3+, and Java JDK 1.8 are requisites. Use of the Pentaho checkstyle format (via mvn checkstyle:check and reviewing the report) and developing working Unit Tests helps to ensure that pull requests for bugs and improvements are processed quickly. In addition to the unit tests, there are integration tests that test cross-module operation.
    Downloads: 67 This Week
    Last Update:
    See Project
  • 3
    Airbyte

    Airbyte

    Data integration platform for ELT pipelines from APIs, databases

    We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes. Moving critical data with Airbyte is as easy and reliable as flipping on a switch. Our teams process more than 300 billion rows each month for ambitious businesses of all sizes. Enable your data engineering teams to focus on projects that are more valuable to your business. Building and maintaining custom connectors have become 5x easier with Airbyte. With an average response rate of 10 minutes or less and a Customer Satisfaction score of 96/100, our team is ready to support your data integration journey all over the world.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 4
    CellTypist

    CellTypist

    A tool for semi-automatic cell type classification, harmonization

    CellTypist is an automated tool for cell type classification, harmonization, and integration. Classification, transfer cell type labels from the reference to query dataset. Harmonization, match and harmonize cell types defined by independent datasets. integration, integrate cell and cell types with supervision from harmonization. CellTypist recapitulates cell type structure and biology of independent datasets. Regularised linear models with Stochastic Gradient Descent provide a fast and accurate prediction. Scalable and flexible. Python-based implementation is easy to integrate into existing pipelines. A community-driven encyclopedia for cell types.
    Downloads: 5 This Week
    Last Update:
    See Project
  • No-Nonsense Code-to-Cloud Security for Devs | Aikido Icon
    No-Nonsense Code-to-Cloud Security for Devs | Aikido

    Connect your GitHub, GitLab, Bitbucket, or Azure DevOps account to start scanning your repos for free.

    Aikido provides a unified security platform for developers, combining 12 powerful scans like SAST, DAST, and CSPM. AI-driven AutoFix and AutoTriage streamline vulnerability management, while runtime protection blocks attacks.
    Start for Free
  • 5
    Dagster

    Dagster

    An orchestration platform for the development, production

    Dagster is an orchestration platform for the development, production, and observation of data assets. Dagster as a productivity platform: With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early. Dagster as a robust orchestration engine: Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally. Dagster as a unified control plane: The ‘single plane of glass’ data teams love to use. Rein in the chaos and maintain control over your data as the complexity scales. Centralize your metadata in one tool with built-in observability, diagnostics, cataloging, and lineage. Spot any issues and identify performance improvement opportunities.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 6
    Gradle Docker Compose Plugin

    Gradle Docker Compose Plugin

    Simplifies usage of Docker Compose for integration testing

    The Gradle Docker Compose Plugin by Avast integrates Docker Compose lifecycle management into Gradle builds. It allows developers to define and manage Docker containers required for integration testing or local development directly from their Gradle build scripts. This plugin automates the startup and shutdown of services, supports container health checks, and enables tight integration between application code and containerized services, enhancing reproducibility and automation in development pipelines.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 7
    KubeRay

    KubeRay

    A toolkit to run Ray applications on Kubernetes

    KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers several key components. KubeRay core: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 8
    nango

    nango

    A single API for all your integrations.

    Nango is a single API to interact with all other external APIs. It should be the only API you need to integrate to your app. Nango is an open-source solution for integrating third-party APIs with applications, simplifying API authentication, data syncing, and management.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    Alova.js

    Alova.js

    Workflow-Streamlined next-generation request tools

    Extremely streamline API integration workflow. Quickly find APIs in the editor, and enjoy full type hints even in js projects with the API code automatically generated by Alova's extension. Request in various complex scenes by one line of code. Automatically manage paging data, and data preloading, reduce unnecessary data refresh, improve fluency by 300%, and reduce coding difficulty by 50%. Send requests immediately by watching state changes, useful in tab switching and condition querying. Global interceptor that supports silent token refresh, as well as providing unified management of token-based login, logout, token assignment, and token refresh.
    Downloads: 2 This Week
    Last Update:
    See Project
  • Build Securely on AWS with Proven Frameworks Icon
    Build Securely on AWS with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 10
    Apache DevLake

    Apache DevLake

    Apache DevLake is an open-source dev data platform

    Apache DevLake is an open-source dev data platform that ingests, analyzes, and visualizes the fragmented data from DevOps tools to extract insights for engineering excellence, developer experience, and community growth. Apache DevLake is designed for developer teams looking to make better sense of their development process and to bring a more data-driven approach to their own practices. You can ask Apache DevLake many questions regarding your development process. Just connect and query. Your Dev Data lives in many silos and tools. DevLake brings them all together to give you a complete view of your Software Development Life Cycle (SDLC). From DORA to scrum retros, DevLake implements metrics effortlessly with prebuilt dashboards supporting common frameworks and goals. DevLake fits teams of all shapes and sizes, and can be readily extended to support new data sources, metrics, and dashboards, with a flexible framework for data collection and transformation.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    Common Core Ontologies

    Common Core Ontologies

    The Common Core Ontology Repository

    The Common Core Ontologies (CCO) comprise twelve ontologies that are designed to represent and integrate taxonomies of generic classes and relations across all domains of interest. CCO is a mid-level extension of Basic Formal Ontology (BFO), an upper-level ontology framework widely used to structure and integrate ontologies in the biomedical domain (Arp, et al., 2015). BFO aims to represent the most generic categories of entity and the most generic types of relations that hold between them, by defining a small number of classes and relations. CCO then extends from BFO in the sense that every class in CCO is asserted to be a subclass of some class in BFO, and that CCO adopts the generic relations defined in BFO (e.g., has_part) (Smith and Grenon, 2004). Accordingly, CCO classes and relations are heavily constrained by the BFO framework, from which it inherits much of its basic semantic relationships.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 12
    Jitsu

    Jitsu

    Jitsu is an open-source Segment alternative

    Jitsu is a fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days. Installing Jitsu is a matter of selecting your framework and adding few lines of code to your app. Jitsu is built to be framework agnostic, so regardless of your stack, we have a solution that'll work for your team. Connect data warehouse (Snowflake, Clickhouse, BigQuery, S3, Redshift ot Postgres) and query your data instantly. Jitsu can either stream data in real-time or send it in micro-batches (up to once a minute). Apply any transformation with Jitsu. Just write JavaScript code right in the UI to do anything with incoming data. And yes, the code editor supports code completion, debugging and many more. It feels like a full-featured IDE!
    Downloads: 2 This Week
    Last Update:
    See Project
  • 13
    PHPCI

    PHPCI

    PHPCI is a free and open source continuous integration tool

    PHPCI is a continuous integration (CI) server designed specifically for PHP applications. It automates tasks such as testing, code quality checks, and deployment, helping developers maintain code consistency and detect issues early. PHPCI supports various plugins and tools, including PHPUnit, PHPMD, and Codeception, making it highly customizable for different project needs.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    harmonypy

    harmonypy

    Integrate multiple high-dimensional datasets with fuzzy k-means

    Harmony is an algorithm for integrating multiple high-dimensional datasets. harmonypy is a port of the harmony R package by Ilya Korsunsky. Harmony is a general-purpose R package with an efficient algorithm for integrating multiple data sets. It is especially useful for large single-cell datasets such as single-cell RNA-seq.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    Apache Hudi

    Apache Hudi

    Upserts, Deletes And Incremental Processing on Big Data

    Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    ExAws

    ExAws

    A flexible, easy to use set of clients AWS APIs for Elixir

    ExAws is a comprehensive Elixir client library for interfacing with AWS services. It provides low-level request builders for nearly all AWS APIs—like S3, EC2, Lambda, DynamoDB, SQS, SES, Route 53, and more—while supporting streaming, request configuration overrides, telemetry, flexible HTTP clients, and codecs. Its modular architecture enables importing only the services you need with separate packages (e.g., ex_aws_s3, ex_aws_ec2).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    Harmony Data Integration

    Harmony Data Integration

    Fast, sensitive and accurate integration of single-cell data

    Harmony is a general-purpose R package with an efficient algorithm for integrating multiple data sets. It is especially useful for large single-cell datasets such as single-cell RNA-seq. Harmony has been tested on R versions =4. Please consult the DESCRIPTION file for more details on required R packages. Harmony has been tested on Linux, OS X, and Windows platforms.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    Hetionet

    Hetionet

    Hetionet: an integrative network of disease

    Hetionet is a hetnet — network with multiple node and edge (relationship) types — which encodes biology. The hetnet was designed for Project Rephetio, which aims to systematically identify why drugs work and predict new therapies for drugs. The JSON and Neo4j formats contain node and edge properties, which are absent in the TSV and matrix formats, including licensing information. Therefore the recommended formats are JSON and Neo4j. Our hetio package in Python reads the JSON format, but it is otherwise a simple yet new format. The Neo4j graph database has an established and thriving ecosystem. However, if you would like to access Hetionet without Neo4j, then we suggest the JSON format. The matrix format refers to HetMat archives, which store edge adjacency matrices on disk. Additional usage information is available at the corresponding download locations.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    Mara Pipelines

    Mara Pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts

    This package contains a lightweight data transformation framework with a focus on transparency and complexity reduction. Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code. PostgreSQL as a data processing engine. Extensive web ui. The web browser as the main tool for inspecting, running and debugging pipelines. GNU make semantics. Nodes depend on the completion of upstream nodes. No data dependencies or data flows. No in-app data processing: command line tools as the main tool for interacting with databases and data. Single machine pipeline execution based on Python's multiprocessing. No need for distributed task queues. Easy debugging and output logging. Cost based priority queues: nodes with higher cost (based on recorded run times) are run first.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    PANDORA

    PANDORA

    Revolutionizing Biomedical Research with Advanced Machine Learning

    PANDORA is a machine learning (ML) tool that can be used to integrate various data types, including clinical, transcriptome and microbiome data and find connections in large datasets. PANDORA can be easily installed using Docker, a pre-built version of the software can be pulled from DockerHub. In order to run a test instance of PANDORA, users will first need to prepare their local environment by downloading, installing, and configuring Docker. genular is a community behind SIMON an open-source Machine Learning KnowledgeDiscovery software, built by a vibrant community of people just like you! Join us and make SIMON even cooler! Exploratory analysis of machine learning results with the help of many different visualization techniques will give you instant insights into models and data.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    PhantomJS-Node

    PhantomJS-Node

    PhantomJS integration module for NodeJS

    PhantomJS-Node is a Node.js bridge to PhantomJS, enabling programmatic control of the headless browser for tasks like web scraping, automated testing, and page rendering.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    Recap

    Recap

    Recap tracks and transform schemas across your whole application

    Recap is a schema language and multi-language toolkit to track and transform schemas across your whole application. Your data passes through web services, databases, message brokers, and object stores. Recap describes these schemas in a single language, regardless of which system your data passes through. Recap schemas can be defined in YAML, TOML, JSON, XML, or any other compatible language.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    nichenetr

    nichenetr

    NicheNet: predict active ligand-target links between interacting cells

    nichenetr: the R implementation of the NicheNet method. The goal of NicheNet is to study intercellular communication from a computational perspective. NicheNet uses human or mouse gene expression data of interacting cells as input and combines this with a prior model that integrates existing knowledge on ligand-to-target signaling paths. This allows to predict ligand-receptor interactions that might drive gene expression changes in cells of interest. This model of prior information on potential ligand-target links can then be used to infer active ligand-target links between interacting cells. NicheNet prioritizes ligands according to their activity (i.e., how well they predict observed changes in gene expression in the receiver cell) and looks for affected targets with high potential to be regulated by these prioritized ligands.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    scArches

    scArches

    Reference mapping for single-cell genomics

    Single-cell architecture surgery (scArches) is a package for reference-based analysis of single-cell data. scArches allows your single-cell query data to be analyzed by integrating it into a reference atlas. By mapping your data into an integrated reference you can transfer cell-type annotation from reference to query, identify disease states by mapping to healthy atlas, and advanced applications such as imputing missing data modalities or spatial locations.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    Open Source Data Quality and Profiling

    Open Source Data Quality and Profiling

    World's first open source data quality & data preparation project

    This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/
    Downloads: 5 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • Next
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.