Data Science Tools for Linux

View 17 business solutions

Browse free open source Data Science tools and projects for Linux below. Use the toggles on the left to filter open source Data Science tools by OS, license, language, programming language, and project status.

  • Secure File Transfer for Windows with Cerberus by Redwood Icon
    Secure File Transfer for Windows with Cerberus by Redwood

    Protect and share files over FTP/S, SFTP, HTTPS and SCP with the #1 rated Windows file transfer server.

    Cerberus supports unlimited users and connections on a single IP, with built-in encryption, 2FA, and a browser-based web client — all deployable in under 15 minutes with a 25-day free trial.
    Try for Free
  • $300 Free Credits for Your Google Cloud Projects Icon
    $300 Free Credits for Your Google Cloud Projects

    Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

    Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.
    Start Free Trial
  • 1
    RStudio

    RStudio

    RStudio is an integrated development environment (IDE) for R

    RStudio is a powerful, full-featured integrated development environment (IDE) tailored primarily for the R programming language but increasingly supportive of other languages like Python and Julia. It brings together console, editor, plotting, workspace, history, and file-management panes into a unified interface, helping data scientists, statisticians, and analysts to work more productively. The IDE is cross-platform: there are desktop versions for Windows, macOS and Linux, as well as a server version for remote or multi-user deployment via a web browser. In addition to code editing and execution, RStudio offers extensive support for reproducible research via R Markdown, notebooks, and integration with version control systems like Git and SVN. Package development is built in, with tooling for building, checking, and testing R packages, plus integration with documentation tools, CRAN submission workflows, and project templates.
    Downloads: 113 This Week
    Last Update:
    See Project
  • 2
    Quadratic

    Quadratic

    Data science spreadsheet with Python & SQL

    Quadratic enables your team to work together on data analysis to deliver better results, faster. You already know how to use a spreadsheet, but you’ve never had this much power before. Quadratic is a Web-based spreadsheet application that runs in the browser and as a native app (via Electron). Our goal is to build a spreadsheet that enables you to pull your data from its source (SaaS, Database, CSV, API, etc) and then work with that data using the most popular data science tools today (Python, Pandas, SQL, JS, Excel Formulas, etc). Quadratic has no environment to configure. The grid runs entirely in the browser with no backend service. This makes our grids completely portable and very easy to share. Quadratic has Python library support built-in. Bring the latest open-source tools directly to your spreadsheet. Quickly write code and see the output in full detail. No more squinting into a tiny terminal to see your data output.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 3
    AWS SDK for pandas

    AWS SDK for pandas

    Easy integration with Athena, Glue, Redshift, Timestream, Neptune

    aws-sdk-pandas (formerly AWS Data Wrangler) bridges pandas with the AWS analytics stack so DataFrames flow seamlessly to and from cloud services. With a few lines of code, you can read from and write to Amazon S3 in Parquet/CSV/JSON/ORC, register tables in the AWS Glue Data Catalog, and query with Amazon Athena directly into pandas. The library abstracts efficient patterns like partitioning, compression, and vectorized I/O so you get performant data lake operations without hand-rolling boilerplate. It also supports Redshift, OpenSearch, and other services, enabling ETL tasks that blend SQL engines and Python transformations. Operational helpers handle IAM, sessions, and concurrency while exposing knobs for encryption, versioning, and catalog consistency. The result is a productive workflow that keeps your analytics in Python while leveraging AWS-native storage and query engines at scale.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 4
    Milvus

    Milvus

    Vector database for scalable similarity search and AI applications

    Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. Average latency measured in milliseconds on trillion vector datasets. Rich APIs designed for data science workflows. Consistent user experience across laptop, local cluster, and cloud. Embed real-time search and analytics into virtually any application. Milvus’ built-in replication and failover/failback features ensure data and applications can maintain business continuity in the event of a disruption. Component-level scalability makes it possible to scale up and down on demand.
    Downloads: 11 This Week
    Last Update:
    See Project
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • 5
    ggplot2

    ggplot2

    An implementation of the Grammar of Graphics in R

    ggplot2 is a system written in R for declaratively creating graphics. It is based on The Grammar of Graphics, which focuses on following a layered approach to describe and construct visualizations or graphics in a structured manner. With ggplot2 you simply provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it will take care of the rest. ggplot2 is over 10 years old and is used by hundreds of thousands of people all over the world for plotting. In most cases using ggplot2 starts with supplying a dataset and aesthetic mapping (with aes()); adding on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), and faceting specifications (like facet_wrap()); and finally, coordinating systems. ggplot2 has a rich ecosystem of community-maintained extensions for those looking for more innovation. ggplot2 is a part of the tidyverse, an ecosystem of R packages designed for data science.
    Downloads: 11 This Week
    Last Update:
    See Project
  • 6
    LIFELINES

    LIFELINES

    Survival analysis in Python

    LIFELINES is a pure Python library for survival analysis, a statistical field focused on modeling time until an event occurs. It can be used for traditional cases like medical survival time, but also for business and product questions such as churn, subscription length, equipment failure, and customer retention. The library includes estimators such as Kaplan-Meier, Nelson-Aalen, and regression-based survival models. It is designed to be accessible to Python users and works well with common scientific computing workflows. Built-in plotting methods and datasets help users explore survival curves and compare groups visually. It is a practical tool for analysts, researchers, and data scientists who need event-time modeling without leaving Python.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 7
    Metaflow

    Metaflow

    A framework for real-life data science

    Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 8
    gophernotes

    gophernotes

    The Go kernel for Jupyter notebooks and nteract

    gophernotes is a Go kernel for Jupyter notebooks and nteract. It lets you use Go interactively in a browser-based notebook or desktop app. Use gophernotes to create and share documents that contain live Go code, equations, visualizations and explanatory text. These notebooks, with the live Go code, can then be shared with others via email, Dropbox, GitHub and the Jupyter Notebook Viewer. Go forth and do data science, or anything else interesting, with Go notebooks! This project utilizes a Go interpreter called gomacro under the hood to evaluate Go code interactively. The gophernotes logo was designed by the brilliant Marcus Olsson and was inspired by Renee French's original Go Gopher design. If you have the JUPYTER_PATH environmental variable set or if you are using an older version of Jupyter, you may need to copy this kernel config to another directory.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 9
    NannyML

    NannyML

    Detecting silent model failure. NannyML estimates performance

    NannyML is an open-source python library that allows you to estimate post-deployment model performance (without access to targets), detect data drift, and intelligently link data drift alerts back to changes in model performance. Built for data scientists, NannyML has an easy-to-use interface, and interactive visualizations, is completely model-agnostic, and currently supports all tabular classification use cases. NannyML closes the loop with performance monitoring and post deployment data science, empowering data scientist to quickly understand and automatically detect silent model failure. By using NannyML, data scientists can finally maintain complete visibility and trust in their deployed machine learning models. When the actual outcome of your deployed prediction models is delayed, or even when post-deployment target labels are completely absent, you can use NannyML's CBPE-algorithm to estimate model performance.
    Downloads: 8 This Week
    Last Update:
    See Project
  • Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
    Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

    Native application identity and user-based security for your Azure cloud

    Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
    Get a free trial
  • 10
    PySyft

    PySyft

    Data science on data without acquiring a copy

    Most software libraries let you compute over the information you own and see inside of machines you control. However, this means that you cannot compute on information without first obtaining (at least partial) ownership of that information. It also means that you cannot compute using machines without first obtaining control over those machines. This is very limiting to human collaboration and systematically drives the centralization of data, because you cannot work with a bunch of data without first putting it all in one (central) place. The Syft ecosystem seeks to change this system, allowing you to write software which can compute over information you do not own on machines you do not have (total) control over. This not only includes servers in the cloud, but also personal desktops, laptops, mobile phones, websites, and edge devices. Wherever your data wants to live in your ownership, the Syft ecosystem exists to help keep it there while allowing it to be used privately.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 11
    CSAPP-Labs

    CSAPP-Labs

    Solutions and Notes for Labs of Computer Systems

    CSAPP-Labs is a repository that organizes and provides practical lab exercises corresponding to the famous textbook Computer Systems: A Programmer’s Perspective (CS:APP), helping students deepen their understanding of how computer systems work at the machine level. The exercises cover core topics such as data representation, assembly language, processor architecture, cache behavior, memory hierarchy, linking, and concurrency, contextualizing abstract concepts from the book in real code and experiments. Each lab is structured to include test programs, Makefiles, harnesses, and step-by-step instructions that guide students through hands-on interaction with low-level programming and system behavior. By actually building and debugging code that runs close to hardware, learners acquire intuition about performance trade-offs, bit-level manipulation, stack frame layout, and how compilers and OS features influence execution.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 12
    Cookiecutter Data Science

    Cookiecutter Data Science

    Project structure for doing and sharing data science work

    A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. When we think about data analysis, we often think just about the resulting reports, insights, or visualizations. While these end products are generally the main event, it's easy to focus on making the products look nice and ignore the quality of the code that generates them. Because these end products are created programmatically, code quality is still important! And we're not talking about bikeshedding the indentation aesthetics or pedantic formatting standards, ultimately, data science code quality is about correctness and reproducibility. It's no secret that good analyses are often the result of very scattershot and serendipitous explorations. Tentative experiments and rapidly testing approaches that might not work out are all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 13
    Nuclio

    Nuclio

    High-Performance Serverless event and data processing platform

    Nuclio is an open source and managed serverless platform used to minimize development and maintenance overhead and automate the deployment of data-science-based applications. Real-time performance running up to 400,000 function invocations per second. Portable across low laptops, edge, on-prem and multi-cloud deployments. The first serverless platform supporting GPUs for optimized utilization and sharing. Automated deployment to production in a few clicks from Jupyter notebook. Deploy one of the example serverless functions or write your own. The dashboard, when running outside an orchestration platform (e.g. Kubernetes or Swarm), will simply be deployed to the local docker daemon. The Getting Started With Nuclio On Kubernetes guide has a complete step-by-step guide to using Nuclio serverless functions over Kubernetes.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 14
    SageMaker Training Toolkit

    SageMaker Training Toolkit

    Train machine learning models within Docker containers

    Train machine learning models within a Docker container using Amazon SageMaker. Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. You can use Amazon SageMaker to simplify the process of building, training, and deploying ML models. To train a model, you can include your training script and dependencies in a Docker container that runs your training code. A container provides an effectively isolated environment, ensuring a consistent runtime and reliable training process. The SageMaker Training Toolkit can be easily added to any Docker container, making it compatible with SageMaker for training models. If you use a prebuilt SageMaker Docker image for training, this library may already be included. Write a training script (eg. train.py). Define a container with a Dockerfile that includes the training script and any dependencies.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 15
    XGBoost

    XGBoost

    Scalable and Flexible Gradient Boosting

    XGBoost is an optimized distributed gradient boosting library, designed to be scalable, flexible, portable and highly efficient. It supports regression, classification, ranking and user defined objectives, and runs on all major operating systems and cloud platforms. XGBoost works by implementing machine learning algorithms under the Gradient Boosting framework. It also offers parallel tree boosting (GBDT, GBRT or GBM) that can quickly and accurately solve many data science problems. XGBoost can be used for Python, Java, Scala, R, C++ and more. It can run on a single machine, Hadoop, Spark, Dask, Flink and most other distributed environments, and is capable of solving problems beyond billions of examples.
    Downloads: 7 This Week
    Last Update:
    See Project
  • 16
    ClearML

    ClearML

    Streamline your ML workflow

    ClearML is an open source platform that automates and simplifies developing and managing machine learning solutions for thousands of data science teams all over the world. It is designed as an end-to-end MLOps suite allowing you to focus on developing your ML code & automation, while ClearML ensures your work is reproducible and scalable. The ClearML Python Package for integrating ClearML into your existing scripts by adding just two lines of code, and optionally extending your experiments and other workflows with ClearML powerful and versatile set of classes and methods. The ClearML Server storing experiment, model, and workflow data, and supports the Web UI experiment manager, and ML-Ops automation for reproducibility and tuning. It is available as a hosted service and open source for you to deploy your own ClearML Server. The ClearML Agent for ML-Ops orchestration, experiment and workflow reproducibility, and scalability.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 17
    Great Expectations

    Great Expectations

    Always know what to expect from your data

    Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling. Software developers have long known that testing and documentation are essential for managing complex codebases. Great Expectations brings the same confidence, integrity, and acceleration to data science and data engineering teams. Expectations are assertions for data. They are the workhorse abstraction in Great Expectations, covering all kinds of common data issues. Expectations are a great start, but it takes more to get to production-ready data validation. Where are Expectations stored? How do they get updated? How do you securely connect to production data systems? How do you notify team members and triage when data validation fails? Great Expectations supports all of these use cases out of the box. Instead of building these components for yourself over weeks or months, you will be able to add production-ready validation to your pipeline in a day.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 18
    LIFETIMES

    LIFETIMES

    Lifetime value in Python

    LIFETIMES is a Python library for customer lifetime value and repeat purchase behavior modeling. It helps analysts estimate how frequently customers may return, how long they may remain active, and how much value they may generate over time. The library is built around probabilistic models commonly used in customer analytics, including transaction frequency and monetary value modeling. It is useful for ecommerce, subscription-adjacent businesses, retail analytics, and retention analysis. The repository is now archived, so it should be treated as a stable historical project rather than an actively developed package. For current work, its ideas remain valuable, but teams may want to consider newer successor tools for ongoing production use.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 19
    Positron

    Positron

    Positron, a next-generation data science IDE

    Positron is a next-generation integrated development environment (IDE) created by Posit PBC (formerly RStudio Inc) specifically tailored for data science workflows in Python, R, and multi-language ecosystems. It aims to unify exploratory data analysis, production code, and data-app authoring in a single environment so that data scientists move from “question → insight → application” without switching tools. Built on the open-source Code-OSS foundation, Positron provides a familiar coding experience along with specialized panes and tooling for variable inspection, data-frame viewing, plotting previews, and interactive consoles designed for analytical work. The IDE supports notebook and script workflows, integration of data-app frameworks (such as Shiny, Streamlit, Dash), database and cloud connections, and built-in AI-assisted capabilities to help write code, explore data, and build models.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 20
    Dask

    Dask

    Parallel computing with task scheduling

    Dask is a Python library for parallel and distributed computing, designed to scale analytics workloads from single machines to large clusters. It integrates with familiar tools like NumPy, Pandas, and scikit-learn while enabling execution across cores or nodes with minimal code changes. Dask excels at handling large datasets that don’t fit into memory and is widely used in data science, machine learning, and big data pipelines.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 21
    NVIDIA Merlin

    NVIDIA Merlin

    Library providing end-to-end GPU-accelerated recommender systems

    NVIDIA Merlin is an open-source library that accelerates recommender systems on NVIDIA GPUs. The library enables data scientists, machine learning engineers, and researchers to build high-performing recommenders at scale. Merlin includes tools to address common feature engineering, training, and inference challenges. Each stage of the Merlin pipeline is optimized to support hundreds of terabytes of data, which is all accessible through easy-to-use APIs. For more information, see NVIDIA Merlin on the NVIDIA developer website. Transform data (ETL) for preprocessing and engineering features. Accelerate your existing training pipelines in TensorFlow, PyTorch, or FastAI by leveraging optimized, custom-built data loaders. Scale large deep learning recommender models by distributing large embedding tables that exceed available GPU and CPU memory. Deploy data transformations and trained models to production with only a few lines of code.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 22
    Data Science Specialization

    Data Science Specialization

    Course materials for the Data Science Specialization on Coursera

    The Data Science Specialization Courses repository is a collection of materials that support the Johns Hopkins University Data Science Specialization on Coursera. It contains the source code and resources used throughout the specialization’s courses, covering a broad range of data science concepts and techniques. The repository is designed as a shared space for code examples, datasets, and instructional materials, helping learners follow along with lectures and assignments. It spans essential topics such as R programming, data cleaning, exploratory data analysis, statistical inference, regression models, machine learning, and practical data science projects. By providing centralized resources, the repo makes it easier for students to practice concepts and replicate examples from the curriculum. It also offers a structured view of how multiple disciplines—programming, statistics, and applied data analysis—come together in a professional workflow.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 23
    marimo

    marimo

    A reactive notebook for Python

    marimo is an open-source reactive notebook for Python, reproducible, git-friendly, executable as a script, and shareable as an app. marimo notebooks are reproducible, extremely interactive, designed for collaboration (git-friendly!), deployable as scripts or apps, and fit for modern Pythonista. Run one cell and marimo reacts by automatically running affected cells, eliminating the error-prone chore of managing the notebook state. marimo's reactive UI elements, like data frame GUIs and plots, make working with data feel refreshingly fast, futuristic, and intuitive. Version with git, run as Python scripts, import symbols from a notebook into other notebooks or Python files, and lint or format with your favorite tools. You'll always be able to reproduce your collaborators' results. Notebooks are executed in a deterministic order, with no hidden state, delete a cell and marimo deletes its variables while updating affected cells.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 24
    tsfresh

    tsfresh

    Automatic extraction of relevant features from time series

    tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. tsfresh is used to to extract characteristics from time series. Without tsfresh, you would have to calculate all characteristics by hand. With tsfresh this process is automated and all your features can be calculated automatically. Further tsfresh is compatible with pythons pandas and scikit-learn APIs, two important packages for Data Science endeavours in python. The extracted features can be used to describe or cluster time series based on the extracted characteristics. Further, they can be used to build models that perform classification/regression tasks on the time series. Often the features give new insights into time series and their dynamics.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 25
    Rodeo

    Rodeo

    A data science IDE for Python

    A data science IDE for Python. RODEO, that is an open-source python IDE and has been brought up by the folks at yhat, is a development environment that is lightweight, intuitive and yet customizable to its very core and also contains all the features mentioned above that were searched for so long. It is just like your very own personal home base for exploration and interpretation of data that aims at Data Scientists and answers the main question, "Is there anything like RStudio for Python?" Rodeo makes it very easy for its users to explore what is created by them and also alongside allows the users to Inspect, interact, compare data frames, plots and even much more. It is an IDE that has been built especially for data science/Machine Learning in Python and you can also very simply think of it as a light weight alternative to the IPython Notebook.
    Downloads: 3 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
Auth0 Logo