Java Big Data Tools

View 278 business solutions

Browse free open source Java Big Data Tools and projects below. Use the toggles on the left to filter open source Java Big Data Tools by OS, license, language, programming language, and project status.

  • Precoro helps companies spend smarter Icon
    Precoro helps companies spend smarter

    Fully Automated Process in One Tool: From Purchase Orders to Budget Control and Reporting.

    For minor company expenses, you might utilize a spend management solution or track everything in spreadsheets. For everything more, you'll need Precoro. We help companies achieve procurement excellence and budget efficiency by building transparent, predictable, automated spending workflows.
  • Small Business HR Management Software Icon
    Small Business HR Management Software

    Get a unified timekeeping, scheduling, payroll, HR and benefits portal with WorkforceHub.

    WorkforceHub is the instantly useful, delightfully simple to use, small business solution for tracking time, scheduling and hiring. It scales as your business grows while delivering the mission-critical features an organization needs. It is tailored to, built for, and priced for small business employers.
  • 1
    MOA - Massive Online Analysis

    MOA - Massive Online Analysis

    Big Data Stream Analytics Framework.

    A framework for learning from a continuous supply of examples, a data stream. Includes classification, regression, clustering, outlier detection and recommender systems. Related to the WEKA project, also written in Java, while scaling to adaptive large scale machine learning.
    Leader badge
    Downloads: 89 This Week
    Last Update:
    See Project
  • 2
    Open Source Data Quality and Profiling

    Open Source Data Quality and Profiling

    World's first open source data quality & data preparation project

    This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/
    Leader badge
    Downloads: 64 This Week
    Last Update:
    See Project
  • 3
    Apache HBase

    Apache HBase

    Get random, realtime read/write access to your Big Data

    Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables, billions of rows X millions of columns, atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options. Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 4
    Apache RocketMQ

    Apache RocketMQ

    Distributed messaging and streaming platform with low latency

    Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability. Messaging patterns including publish/subscribe, request/reply and streaming. Financial grade transactional message. Built-in fault tolerance and high availability configuration options base on DLedger. A variety of cross language clients, such as Java, C/C++, Python, Go. Pluggable transport protocols, such as TCP, SSL, AIO. Built-in message tracing capability, also support opentracing. Versatile big-data and streaming ecosytem integration. Message retroactivity by time or offset. Reliable FIFO and strict ordered messaging in the same queue. Efficient pull and push consumption model. Million-level message accumulation capacity in a single queue. Multiple messaging protocols like JMS and OpenMessaging. Flexible distributed scale-out deployment architecture. Lightning-fast batch message exchange system.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Comet Backup - Fast, Secure Backup Software for MSPs Icon
    Comet Backup - Fast, Secure Backup Software for MSPs

    Fast, Secure Backup Software for Businesses and IT Providers

    Comet is a flexible backup platform, giving you total control over your backup environment and storage destinations.
  • 5
    HugeGraph

    HugeGraph

    A graph database that supports more than 100+ billion data

    HugeGraph is a convenient, efficient, and adaptable graph database compatible with the Apache TinkerPop3 framework and the Gremlin query language. HugeGraph supports fast import performance in the case of more than 10 billion Vertices and Edges Graph, millisecond-level OLTP query capability, and can be integrated into big data platforms like Hadoop or Spark for OLAP analysis. The main scenarios of HugeGraph include correlation search, fraud detection, and knowledge graph. Not only supports Gremlin graph query language and RESTful API but also provides commonly used graph algorithm APIs. To help users easily implement various queries and analyses, HugeGraph has a full range of accessory tools, such as supporting distributed storage, data replication, scaling horizontally, and supports many built-in backends of storage engines.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    MyCAT

    MyCAT

    Active, high-performance open source database middleware

    MyCAT is an Open-Source software, “a large database cluster” oriented to enterprises. MyCAT is an enforced database which is a replacement for MySQL and supports transaction and ACID. Regarded as MySQL cluster of enterprise database, MyCAT can take the place of expensive Oracle cluster. MyCAT is also a new type of database, which seems like a SQL Server integrated with the memory cache technology, NoSQL technology and HDFS big data. And as a new modern enterprise database product, MyCAT is combined with the traditional database and new distributed data warehouse. In a word, MyCAT is a fresh new middleware of database. MyCAT ’s objective is to smoothly migrate the current stand-alone database and applications to cloud side with low cost and to solve the bottleneck problem caused by the rapid growth of data storage and business scale.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    geometry-api-java

    geometry-api-java

    The Esri Geometry API for Java enables developers to write apps

    The Esri Geometry API for Java can be used to enable spatial data processing in 3rd-party data-processing solutions. Developers of custom MapReduce-based applications for Hadoop can use this API for spatial processing of data in the Hadoop system. The API is also used by the Hive UDF’s and could be used by developers building geometry functions for 3rd-party applications such as Cassandra, HBase, Storm and many other Java-based “big data” applications.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    BIRT Report Designer

    BIRT Report Designer

    Open Source Reporting & Data Visualization Platform

    BIRT is an open source technology platform used to create data visualizations and reports that can be embedded into rich client and web applications. Developers who use BIRT Designer are able to access information from multiple data sources easily and quickly in order to create reports and applications with stunning data visualizations. Actuate now provides a free report server, BIRT iHub F-Type, to deploy BIRT content so developers don't have to build their own infrastructure. With a flexible Open Data Access framework, developers can write custom data drivers to access data from any source, including Big Data sources like Apache Hadoop, Cassandra, and MongoDB, along with all traditional relational databases, Flat Files, XML data streams, and data stored in proprietary systems. Built for embedding, BIRT includes APIs for data access, chart generation, output formats, content execution, and integration within larger applications.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 9
    Universal Java Matrix Package

    Universal Java Matrix Package

    sparse and dense matrix, linear algebra, visualization, big data

    The Universal Java Matrix Package (UJMP) is an open source Java library which provides sparse and dense matrix classes, as well as a large number of calculations for linear algebra such as matrix multiplication or matrix inverse. Operations such as mean, correlation, standard deviation, replacement of missing values or the calculation of mutual information are supported, too. The Universal Java Matrix Package provides various visualization methods, import and export filters for a large number of file formats, and even the possibility to link to JDBC databases. Multi-dimensional matrices as well as generic matrices with a specified object type are supported and very large matrices can be handled even when they do not fit into memory.
    Downloads: 12 This Week
    Last Update:
    See Project
  • Component Content Management System for Software Documentation Icon
    Component Content Management System for Software Documentation

    Great tool for serious technical writers

    Paligo is an end-to-end Component Content Management System (CCMS) solution for technical documentation, policies and procedures, knowledge management, and more.
  • 10
    giServer

    giServer

    giServer the easy to use and extensible batch and integration server

    The giServer is an easy-to-use integration server for process automation and event-driven or scheduled execution of batch jobs. Instead of using complex XML configuration files an elaborate GUI for batch job management is included. Some possible usage scenarios are: - Automatic processing of incoming data files - Big Data applications - Process automation - Data Mining/Aggregation applications - Automatic Reporting - Processing and analysis of database records
    Downloads: 6 This Week
    Last Update:
    See Project
  • 11

    X10

    Performance and Productivity at Scale

    X10 is a class-based, strongly-typed, garbage-collected, object-oriented language. To support concurrency and distribution, X10 uses the Asynchronous Partitioned Global Address Space programming model (APGAS). This model introduces two key concepts -- places and asynchronous tasks -- and a few mechanisms for coordination. With these, APGAS can express both regular and irregular parallelism, message-passing-style and active-message-style computations, fork-join and bulk-synchronous parallelism. Both its modern, type-safe sequential core and simple programming model for concurrency and distribution contribute to making X10 a high-productivity language in the HPC and Big Data spaces. User productivity is further enhanced by providing tools such as an Eclipse-based IDE (X10DT). Implementations of X10 are available for a wide variety of hardware and software platforms ranging from laptops, to commodity clusters, to supercomputers.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 12

    Big Sack

    Big Sack: A lightweight Java Key/Value store with undo and disk cache.

    Big Sack is a Java persistence mechanism that allows storage of key value pairs following the popular Big Data paradigms. Its a very simple and straightforward way to bridge the gap between in-memory data structures and long-term storage. It has the convenience of Java SDK TreeMap and TreeSet classes and is used the same easy way, but it includes rollback through undo logging to checkpoint data so it does not wind up in an unknown state regardless of failures. Data storage in the exabyte range is possible using filesystem and/or memory-mapped IO. Three levels of configurable write-through caching at different granularities ensure performance.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    DSTK - DataScience ToolKit

    DSTK - DataScience ToolKit

    DSTK - DataScience ToolKit for All of Us

    DSTK - DataScience ToolKit is an opensource free software for statistical analysis, data visualization, text analysis, and predictive analytics. Newer version and smaller file size can be found at: https://sourceforge.net/projects/dstk3/ It is designed to be straight forward and easy to use, and familar to SPSS user. While JASP offers more statistical features, DSTK tends to be a broad solution workbench, including text analysis and predictive analytics features. Of course you may specify JASP for advanced data editing and RapidMiner for advanced prediction modeling. DSTK is written in C#, Java and Python to interface with R, NLTK, and Weka. It can be expanded with plugins using R Scripts. We have also created plugins for more statistical functions, and Big Data Analytics with Microsoft Azure HDInsights (Spark Server) with Livy. License: R, RStudio, NLTK, SciPy, SKLearn, MatPlotLib, Weka, ... each has their own licenses.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 14
    Apache Hudi

    Apache Hudi

    Upserts, Deletes And Incremental Processing on Big Data

    Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Apache InLong

    Apache InLong

    Apache InLong - a one-stop integration framework for massive data

    Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data. InLong (应龙) is a divine beast in Chinese mythology who guides the river into the sea, and it is regarded as a metaphor of the InLong system for reporting data streams. InLong was originally built at Tencent, which has served online businesses for more than 8 years, to support massive data (data scale of more than 80 trillion pieces of data per day) reporting services in big data scenarios. The entire platform has integrated 5 modules: Ingestion, Convergence, Caching, Sorting, and Management, so that the business only needs to provide data sources, data service quality, data landing clusters and data landing formats.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16

    Chordalysis

    Log-linear analysis (data modelling) for high-dimensional data

    ===== Project moved to https://github.com/fpetitjean/Chordalysis ===== Log-linear analysis is the statistical method used to capture multi-way relationships between variables. However, due to its exponential nature, previous approaches did not allow scale-up to more than a dozen variables. We present here Chordalysis, a log-linear analysis method for big data. Chordalysis exploits recent discoveries in graph theory by representing complex models as compositions of triangular structures, also known as chordal graphs. Chordalysis makes it possible to discover the structure of datasets with thousands of variables on a standard desktop computer. Associated papers at ICDM 2013, ICDM 2014 and SDM 2015 can be found at http://www.francois-petitjean.com/Research/ YourKit is supporting Chordalysis open source project with its full-featured Java Profiler. YourKit is the creator of innovative and intelligent tools for profiling Java and .NET applications. http://www.yourkit.com
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Cube Platform is a decentralized grid computing system that uses P2P Pastry protocol for communication between nodes. It's a big data storage written in Java.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    ElasticJob

    ElasticJob

    Distributed scheduled job framework

    ElasticJob is a distributed scheduling solution consisting of two separate projects, ElasticJob-Lite and ElasticJob-Cloud. ElasticJob-Lite is a lightweight, decentralized solution that provides distributed task sharding services. ElasticJob-Cloud uses Mesos to manage and isolate resources. It uses a unified job API for each project. Developers only need code one time and can deploy at will. Support job sharding and high availability in distributed system. Scale out for throughput and efficiency improvement. Job processing capacity is flexible and scalable with the allocation of resources. Execute job on suitable time and assigned resources. Aggregation same job to same job executor. Append resources to newly assigned jobs dynamically. Using ElasticJob can make developers no longer worry about the non-functional requirements such as jobs scale out, so that they can focus more on business coding.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Flamingo Project

    Flamingo Project

    Workflow Designer, Hive Editor, Pig Editor, File System Browser

    Flamingo is a open-source Big Data Platform that combine a Ajax Rich Web Interface + Workflow Engine + Workflow Designer + MapReduce + Hive Editor + Pig Editor. 1. Easy Tool for big data 2. Use comfortable in Hadoop EcoSystem projects 3. Based GPL V3 License Supporting Pig IDE, Hive IDE, HDFS Browser, Scheduler, Hadoop Job Monitoring, Workflow Engine, Workflow Designer, MapReduce.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Genie

    Genie

    Distributed Big Data Orchestration Service

    Genie is a completely open source distributed job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21

    HSRA

    Hadoop spliced read aligner for RNA-seq data

    HSRA is a MapReduce-based parallel tool for mapping reads from RNA sequencing (RNA-seq) experiments. RNA-seq analyses typically begin by mapping reads to a reference genome in order to determine the location from which the reads were originated, which is a very time-consuming step. This tool allows bioinformatics researchers to efficiently distribute their mapping tasks over the nodes of a cluster by combining a fast multithreaded spliced aligner (HISAT2) with Apache Hadoop, which is a distributed computing framework for scalable Big Data processing. HSRA currently supports single-end and paired-end read alignments from FASTQ/FASTA datasets. Moreover, our tool uses the Hadoop Sequence Parser (HSP) library (link above) to efficiently read the input datasets stored on the Hadoop Distributed File System (HDFS), being able to process datasets compressed with Gzip and BZip2 codecs.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    The idea about this project is use it to create and provide a big data structures library, like trees, list and other.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    MarDRe

    MapReduce-based tool to remove duplicate DNA reads

    MarDRe is a de novo MapReduce-based parallel tool to remove duplicate and near-duplicate DNA reads through the clustering of single-end and paired-end sequences from FASTQ/FASTA datasets. This tool allows bioinformatics to avoid the analysis of not necessary reads, reducing the time of subsequent procedures with the dataset. MarDRe is the Big Data counterpart of ParDRe (link above), which employs HPC technologies (i.e., hybrid MPI/multithreading) to reduce runtime on multicore systems. Instead, MarDRe takes advantage of the MapReduce programming model to significantly improve ParDRe performance on distributed systems, especially on cloud-based infrastructures. Written in pure Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for Big Data processing.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    ODD Platform

    ODD Platform

    First open-source data discovery and observability platform

    Unlock the power of big data with OpenDataDiscovery Platform. Experience seamless end-to-end insights, powered by unprecedented observability and trust - from ingestion to production - while building your ideal tech stack! Democratize data and accelerate insights. Find data that fits your use case and discover hints left by your peers to leverage existing knowledge. Explore tags, ownership details, links to other sources and other information to shorten and simplify data discovery phase. Forget unnerved stakeholders and wasting too much time on digging the root cause of data issues when it fails. With ODD’s automatic company-wide ingestion-to-product lineage you’ll have answers in just seconds and stakeholders won’t need to wait. Sleep well, knowing all your data is in check. Forget manual testing, days of debugging, and weeks of worrying. Know the impact of each code change with automatic testing. Enjoy lineage and alerts powered with data quality information.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Vespa

    Vespa

    The open big data serving engine

    Make AI-driven decisions using your data, in real-time. At any scale, with unbeatable performance. Vespa is a full-featured text search engine and supports both regular text search and fast approximate vector search (ANN). This makes it easy to create high-performing search applications at any scale, whether you want to use traditional techniques or a modern vector-based approach. You can even combine both approaches efficiently in the same query, something no other engine can do. Recommendation, personalization and targeting involves evaluating recommender models over content items to select the best ones. Vespa lets you build applications which does this online, typically combining fast vector search and filtering with evaluation of machine-learned models over the items. This makes it possible to make recommendations specifically for each user or situation, using completely up to date information.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next