Page 2 | Best Open Source Linux Big Data Tools 2024

Big Data Tools for Linux

View 45 business solutions

Big Data Linux Clear Filters

Red Hat Ansible Automation Platform on Microsoft Azure
Red Hat Ansible Automation Platform on Azure allows you to quickly deploy, automate, and manage resources securely and at scale.

Deploy Red Hat Ansible Automation Platform on Microsoft Azure for a strategic automation solution that allows you to orchestrate, govern and operationalize your Azure environment.

Learn More
The next chapter in business mental wellness
Entrust your employee well-being to Calmerry's nationwide network of licensed mental health professionals.

Calmerry is beneficial for businesses of all sizes, particularly those in high-stress industries, organizations with remote teams, and HR departments seeking to improve employee well-being and productivity

Learn More
1

GnuCopy

GnuCopy is an Open-Source tool to copy and archive all your important data. It supports all important archive typs like Zip and Tar to guaranty an easy and secure exchange between all types of operating systems. Additionally, you can create profiles to blacklist or whitelist specific file types or folders to seperate your big data stores for backups.

Downloads: 1 This Week

Last Update: 2023-07-28
See Project
2

NCHC-Storm

NCHC's Storm Team

Sharing the applications of storm which developed by NCHC's Storm Team.

Downloads: 1 This Week

Last Update: 2022-12-21
See Project
3

apache spark data pipeline osDQ

osDQ dedicated to create apache spark based data pipeline using JSON

This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/ This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also. Get json example at https://github.com/arrahtech/osdq-spark How to run Unzip the zip file Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json Mac UNIX java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin

Downloads: 1 This Week

Last Update: 2019-01-20
See Project
4

.NET for Apache Spark

A free, open-source, and cross-platform big data analytics framework

.NET for Apache Spark provides high-performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer. .NET for Apache Spark runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. It also runs on all major cloud providers including Azure HDInsight Spark, Amazon EMR Spark, AWS & Azure Databricks.

Downloads: 0 This Week

Last Update: 2022-06-01
See Project
Recruit and Manage your Workforce
Evolia makes it easier to hire, schedule and track time worked by frontline in medium and large-sized businesses.

Evolia is a web and mobile platform that connects enterprises with 1000’s of local shift workers and offers free workforce scheduling and time and attendance solutions. Is your business on Evolia?

Learn More
5

An introduction to Data Analysis in R

A guide for learning the basic tools on data anaylisis with R

An Introduction to Data Analysis in R [Book] A guide for learning the basic tools on data anaylisis: process, visualize and learn from your data using R programming. This repository holds the necessary data sets for the book "An introduction to Data Analysis in R", to be published by Springer series Use R!. The book can be purchased in XXX. The book is meant as an introductory guide to manipulate data sets in the Big Data paradigm. One of the main goals of this book is to take the analyst from the very first moment when she/he contacts with data to the final conclusion and presentation of results of analysis. We take into account the variety of fields where data analysis occurs nowadays. We pay special attention to the different ways to obtain data and how to make it manageable before starting the analysis. The data analysis includes most of the basic visualization options and some advanced extra options. Finally, basic statistics is used to learn from the processed data.

Downloads: 0 This Week

Last Update: 2020-02-08
See Project
6

Arroyo

Distributed stream processing engine in Rust

Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data. Unlike traditional batch processing, streaming engines can operate on both bounded and unbounded sources, emitting results as soon as they are available.

Downloads: 0 This Week

Last Update: 5 days ago
See Project
7

BEAR

CBR Meets Big Data

Case-based regression learner for big data. The package contains source and binary files for running BEAR's method. BEAR utilizes EAR4 and locality sensitive hashing in its implementation.

Downloads: 0 This Week

Last Update: 2015-08-11
See Project
8

Big Sack

Big Sack: A lightweight Java Key/Value store with undo and disk cache.

Big Sack is a Java persistence mechanism that allows storage of key value pairs following the popular Big Data paradigms. Its a very simple and straightforward way to bridge the gap between in-memory data structures and long-term storage. It has the convenience of Java SDK TreeMap and TreeSet classes and is used the same easy way, but it includes rollback through undo logging to checkpoint data so it does not wind up in an unknown state regardless of failures. Data storage in the exabyte range is possible using filesystem and/or memory-mapped IO. Three levels of configurable write-through caching at different granularities ensure performance.

Downloads: 0 This Week

Last Update: 2013-12-21
See Project
9

CherOS

CherOS es una Distribución GNU/Linux basada en antiX-22 que a su vez es basada en Debian. CherOS es 100% gratuito para que todos aquellos que quieran usar y si es posible compartirlo, ademas esta creado con la intensión de ser una distro para programadores python y que esta a su vez sea una Distro ligera. CherOS incluye herramientas para: Bases de Datos, Desarrollo Web, Movil y Escritorio, Trading, Ciencia de Datos, Big Data, Inteligencia Artificial, Machine Learning, Realidad Aumentada Ciberseguridad, Hacking Ético, Testing y Desarrollo de Lenguajes de Programación. El propósito de CherOS es ser una Distro con la cual los Usuario no tengan que molestarse en buscar herramientas para programar con Python y que sobre todo tengan a la mano un Sistema capas tanto de ser ligero con potente. REQUISITOS MINIMOS: -Procesador/CPU: 1 Ghz con 1 Núcleo -RAM: 512 MB -Almacenamiento: 20 GB

Downloads: 0 This Week

Last Update: 2023-06-10
See Project
Cloudflare secures and ensures the reliability of your external-facing resources such as websites, APIs, and applications.
Cloudflare is the foundation for your infrastructure, applications, and teams.

It protects your internal resources such as behind-the-firewall applications, teams, and devices.

Get Started
10

Chordalysis

Log-linear analysis (data modelling) for high-dimensional data

===== Project moved to https://github.com/fpetitjean/Chordalysis ===== Log-linear analysis is the statistical method used to capture multi-way relationships between variables. However, due to its exponential nature, previous approaches did not allow scale-up to more than a dozen variables. We present here Chordalysis, a log-linear analysis method for big data. Chordalysis exploits recent discoveries in graph theory by representing complex models as compositions of triangular structures, also known as chordal graphs. Chordalysis makes it possible to discover the structure of datasets with thousands of variables on a standard desktop computer. Associated papers at ICDM 2013, ICDM 2014 and SDM 2015 can be found at http://www.francois-petitjean.com/Research/ YourKit is supporting Chordalysis open source project with its full-featured Java Profiler. YourKit is the creator of innovative and intelligent tools for profiling Java and .NET applications. http://www.yourkit.com

Downloads: 0 This Week

Last Update: 2015-01-29
See Project
11

Cube Platform

Cube Platform is a decentralized grid computing system that uses P2P Pastry protocol for communication between nodes. It's a big data storage written in Java.

Downloads: 0 This Week

Last Update: 2013-04-23
See Project
12

Custom Apache Big data Distribution

A Custom Apache Distribution including Spark and Hadoop, for Windows.

This Distribution has been customized to work out of the box. So, just download it, and unzip it. Set the Path variables for bin folders, HADOOP_HOME, SPARK_HOME, and JAVA_HOME. That's it..! use Hadoop and Spark natively on Windows.

Downloads: 0 This Week

Last Update: 2020-03-11
See Project
13

ElasticJob

Distributed scheduled job framework

ElasticJob is a distributed scheduling solution consisting of two separate projects, ElasticJob-Lite and ElasticJob-Cloud. ElasticJob-Lite is a lightweight, decentralized solution that provides distributed task sharding services. ElasticJob-Cloud uses Mesos to manage and isolate resources. It uses a unified job API for each project. Developers only need code one time and can deploy at will. Support job sharding and high availability in distributed system. Scale out for throughput and efficiency improvement. Job processing capacity is flexible and scalable with the allocation of resources. Execute job on suitable time and assigned resources. Aggregation same job to same job executor. Append resources to newly assigned jobs dynamically. Using ElasticJob can make developers no longer worry about the non-functional requirements such as jobs scale out, so that they can focus more on business coding.

Downloads: 0 This Week

Last Update: 2023-10-14
See Project
14

Faum

Fast Autonomous Unsupervised Multidimiensional Classification

This is the proof-of-concept implementation of the FAUM Clustering method. This implementation was used to perform the published results and is now released in the hope that it will be useful.

Downloads: 0 This Week

Last Update: 2024-02-02
See Project
15

Flamingo Project

Workflow Designer, Hive Editor, Pig Editor, File System Browser

Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor. 1. Easy Tool for big data 2. Use comfortable in Hadoop EcoSystem projects 3. Based GPL V3 License Supporting Pig IDE, Hive IDE, HDFS Browser, Scheduler, Hadoop Job Monitoring, Workflow Engine, Workflow Designer, MapReduce.

3 Reviews

Downloads: 0 This Week

Last Update: 2016-11-29
See Project
16

Fluid

Fluid, elastic data abstraction and acceleration for BigData/AI apps

Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. Provide DataSet abstraction for underlying heterogeneous data sources with multidimensional management in a cloud environment. Enable dataset warmup and acceleration for data-intensive applications by using a distributed cache in Kubernetes with observability, portability, and scalability. Taking characteristics of application and data into consideration for cloud application/dataset scheduling to improve the performance.

Downloads: 0 This Week

Last Update: 2024-08-30
See Project
17

GOBIG

GOBIG is a toolbox that can be used for detecting genetic variations. The project is intended to handle big data. What's more important is that it be used to detect clusters of SNP variants. It is the intention to use the toolbox with common and rare variants. To use it, for example, to find the genetic map of genes causing complex diseases.

Downloads: 0 This Week

Last Update: 2015-09-10
See Project
18

GridDB

GridDB is a next-generation open source database

A cyber-physical systems is a system that collects a variety of data in physical space (the real world), analyzes and converts it into knowledge in cyberspace, and feeds the knowledge back to the real world to revitalize industry and solve social problems. GridDB is an open database that enables real-time processing of vast amounts of time-series data in physical space, which is necessary to realize a cyber-physical system. Multi-model architecture capable of supporting various data stores with time-series data-oriented and pluggable data stores for efficient real-time processing and management of huge amounts of time-series data at high frequency. Various architectural innovations, such as in-memory orientation with "memory as the main unit and disk as the secondary unit" and event-driven design with minimal overhead, have been incorporated to achieve processing capabilities that can handle petabyte-scale applications.

Downloads: 0 This Week

Last Update: 2024-05-30
See Project
19

HSRA

Hadoop spliced read aligner for RNA-seq data

HSRA is a MapReduce-based parallel tool for mapping reads from RNA sequencing (RNA-seq) experiments. RNA-seq analyses typically begin by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a cluster by combining a fast multithreaded spliced aligner (HISAT2) with Apache Hadoop, which is a distributed computing framework for scalable Big Data processing. HSRA currently supports single-end and paired-end read alignments from FASTQ/FASTA datasets. Moreover, our tool uses the Hadoop Sequence Parser (HSP) library (link above) to efficiently read the input datasets stored on the Hadoop Distributed File System (HDFS), being able to process datasets compressed with Gzip and BZip2 codecs.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
20

JuiceFS

JuiceFS is a distributed POSIX file system built on top of Redis

A POSIX, HDFS and S3 compatible distributed file system for cloud. JuiceFS is designed to bring back the gold-old memories and experience of file systems in local disks to the cloud. JuiceFS is POSIX compliant and is fully compatible with HDFS and S3. Cloud app building or migrating, file sharing cross-geo and cross-cloud has become easier than ever before. Whether it's a public cloud, private cloud, or hybrid cloud, JuiceFS is available on any cloud of your choice and delivers flexibility, availability, scalability and strong consistency for your data-intensive applications. Purposely built to serve big data scenarios such as self-driving model training, recommendation engine, and Next-generation Gene Sequencing, JuiceFS specializes in high performance and easier management of tens of billion of files management. We bring JuiceFS to developers with the hope that it will be easy to use, reliable, high-performance, and solve all your file storage problems in a cloud environment.

Downloads: 0 This Week

Last Update: 2024-09-02
See Project
21

LEACrypt

TTAK.KO-12.0223 Lightweight Encryption Algorithm Tool

The Lightweight Encryption Algorithm (also known as LEA) is a 128-bit block cipher developed by South Korea in 2013 to provide confidentiality in high-speed environments such as big data and cloud computing, as well as lightweight environments such as IoT devices and mobile devices. LEA is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP) and is the national standard of Republic of Korea (KS X 3246). LEA is included in the ISO/IEC 29192-2:2019 standard (Information security - Lightweight cryptography - Part 2: Block ciphers). This project is licensed under the ISC License. Copyright © 2020-2021 ALBANESE Research Lab Source code: https://github.com/pedroalbanese/leacrypt Visit: http://albanese.atwebpages.com

Downloads: 0 This Week

Last Update: 2022-12-16
See Project
22

LogicalSets

Integrated Comprehensive Data Architecture & Methodology

This is an advanced data architecture and methodology. A comprehensive Enterprise Resource Management System. A re-usable database with rules for customization, While being a data driven transaction processing engine, this system has very advanced reporting capabilities. This design eliminates up to 90% of business logic due to the way the data is structured. Uses a concept called Table Sets. Has a compound key that tells the programmer what tableset, which record which applet will view/edit the data. Developed in SAP PowerDesigner, for (Sybase) SQL Anywhere. Don't let the date fool you, this system is ahead of its time.

Downloads: 0 This Week

Last Update: 2021-12-06
See Project
23

MapReduce Brazil

Aggregates MapReduce projects

Nowadays the production and storage of Big Data is common, both in the academy and in the enterprises. To process this huge amount of data it is essential the use of high performance platforms and programming models like MapReduce

Downloads: 0 This Week

Last Update: 2015-08-26
See Project
24

MarDRe

MapReduce-based tool to remove duplicate DNA reads

MarDRe is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads through the clustering of single-end and paired-end sequences from FASTQ/FASTA datasets. This tool allows bioinformatics to avoid the analysis of not necessary reads, reducing the time of subsequent procedures with the dataset. MarDRe is the Big Data counterpart of ParDRe (link above), which employs HPC technologies (i.e., hybrid MPI/multithreading) to reduce runtime on multicore systems. Instead, MarDRe takes advantage of the MapReduce programming model to significantly improve ParDRe performance on distributed systems, especially on cloud-based infrastructures. Written in pure Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for Big Data processing.

Downloads: 0 This Week

Last Update: 2019-01-23
See Project
25

Modin

Scale your Pandas workflows by changing a single line of code

Scale your pandas workflow by changing a single line of code. Modin uses Ray, Dask or Unidist to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. It is not necessary to know in advance the available hardware resources in order to use Modin. Additionally, it is not necessary to specify how to distribute or place data. Modin acts as a drop-in replacement for pandas, which means that you can continue using your previous pandas notebooks, unchanged, while experiencing a considerable speedup thanks to Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.

Downloads: 0 This Week

Last Update: 2024-09-11
See Project