spark free download - SourceForge

Showing 172 open source projects for "spark"

View related business solutions

$300 Free Credits for Your Google Cloud Projects
Start building on Google Cloud with $300 in free credits. No commitment, no credit card required until you're ready to scale.

Launch your next project with $300 in free Google Cloud credits—no strings attached. Test, build, and deploy without risk. Use your credits across the entire Google Cloud platform to find what works best for your needs. After your credits are used, continue with always-free tier services. Only pay when you're ready to scale. Sign up in minutes and start exploring.

Start Free Trial
Build Agents and Models on One Platform
Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.

Try It Free
1

Apache Spark

A unified analytics engine for large-scale data processing

...With Spark Streaming (microbatches) and Structured Streaming, it delivers low-latency event processing suitable for real-time analytics. The built-in MLlib library provides scalable machine learning algorithms, while GraphX enables graph computations integrated with data pipelines. Spark supports multiple languages—Scala, Java, Python, R—and connects with many storage systems like HDFS, S3, Cassandra, and streaming platforms like Kafka, making it a versatile choice for big data workloads in analytics, ETL, and data science.

Downloads: 15 This Week

Last Update: 2026-07-11
See Project
2

Spark Joy

2000+ ways to add design flair, user delight, and whimsy to products

Spark Joy is a large curated collection of resources for adding visual polish, personality, and enjoyable interactions to digital products. It gathers more than 2,000 tools, examples, libraries, articles, and references rather than providing a single software package. The collection spans CSS frameworks, layouts, typography, color systems, backgrounds, icons, diagrams, illustrations, and individual interface elements.

Downloads: 1 This Week

Last Update: 3 days ago
See Project
3

Spark NLP

State of the Art Natural Language Processing

Experience the power of large language models like never before, unleashing the full potential of Natural Language Processing (NLP) with Spark NLP, the open source library that delivers scalable LLMs. The full code base is open under the Apache 2.0 license, including pre-trained models and pipelines. The only NLP library built natively on Apache Spark. The most widely used NLP library in the enterprise. Spark ML provides a set of machine learning applications that can be built using two main components, estimators and transformers. ...

Downloads: 1 This Week

Last Update: 2026-06-24
See Project
4

Spark TTS

Spark-TTS Inference Code

Spark TTS is an open-source, PyTorch-based text-to-speech inference system that leverages large language models to produce highly natural, intelligible speech from text input. It uses an efficient single-stream architecture where speech tokens are directly reconstructed from the predictions of an LLM, removing the need for external acoustic models or complex vocoders and making the generation pipeline cleaner and faster.

Downloads: 0 This Week

Last Update: 2026-02-04
See Project
MongoDB Atlas runs apps anywhere
Deploy in 115+ regions with the modern database for every enterprise.

MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.

Start Free
5

SageMaker Spark Container

Docker image used to run data processing workloads

...The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.

Downloads: 0 This Week

Last Update: 2026-06-10
See Project
6

.NET for Apache Spark

A free, open-source, and cross-platform big data analytics framework

.NET for Apache Spark provides high-performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations.

Downloads: 0 This Week

Last Update: 2026-02-13
See Project
7

sparkmagic

Jupyter magics and kernels for working with remote Spark clusters

Sparkmagic is a set of tools for interactively working with remote Spark clusters in Jupyter notebooks. Sparkmagic interacts with remote Spark clusters through a REST server. Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required. Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib). ...

Downloads: 2 This Week

Last Update: 2025-05-28
See Project
8

Apache Kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway

Apache Kyuubi™ is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client side. At the server-side, Kyuubi server and engines' multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc.

Downloads: 3 This Week

Last Update: 2026-03-26
See Project
9

Synapse Machine Learning

Simple and distributed Machine Learning

...These tools enable powerful and highly-scalable predictive and analytical models for a variety of data sources. SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. For production-grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.

Downloads: 1 This Week

Last Update: 2026-04-04
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
10

sparklyr

R interface for Apache Spark

sparklyr is an R package that provides seamless interfacing with Apache Spark clusters—either local or remote—while letting users write code in familiar R paradigms. It supplies a dplyr-compatible backend, Spark machine learning pipelines, SQL integration, and I/O utilities to manipulate and analyze large datasets distributed across cluster environments.

Downloads: 1 This Week

Last Update: 2026-06-19
See Project
11

fugue

A unified interface for distributed computing

Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites.

Downloads: 0 This Week

Last Update: 2026-02-20
See Project
12

Deequ

Deequ is a library built on top of Apache Spark

Deequ is a library built atop Apache Spark that enables defining “unit tests for data” — that is, formal constraints or checks on datasets to ensure data quality along dimensions such as completeness, uniqueness, value ranges, correlations, etc. It can scale to large datasets (billions of rows) by translating those data checks into Spark jobs. Deequ supports advanced features like a metrics repository for storing computed statistics over time, anomaly detection of data quality metrics, and the suggestion of likely constraints automatically for new datasets. ...

Downloads: 2 This Week

Last Update: 2026-06-18
See Project
13

Cassandra Spark Connector

Apache Spark to Apache Cassandra connector

The Apache Cassandra Spark Connector allows Spark jobs (RDDs or DataFrames/Datasets) to read from and write to Cassandra tables. Compatible with Apache Cassandra (v2.1+), Spark 1.0–3.5, and Scala 2.11–2.13, it supports mapping Cassandra rows to Scala case classes, saving results back to Cassandra, and executing arbitrary CQL within Spark applications.

Downloads: 0 This Week

Last Update: 2025-08-04
See Project
14

Sail

A drop-in Apache Spark replacement written in Rust

...It is built entirely in Rust, eliminating JVM overhead and enabling predictable performance, fast startup times, and improved memory safety compared to traditional big data frameworks. Sail is compatible with the Spark Connect protocol, which means existing Spark SQL and DataFrame workloads can run without code changes, making adoption seamless for teams already using Spark-based pipelines. The framework is designed to operate across a variety of environments, including local machines, Kubernetes clusters, and cloud deployments, allowing flexible scaling based on workload requirements. ...

Downloads: 0 This Week

Last Update: 2026-07-07
See Project
15

Alire

Command-line tool from the Alire project and supporting library

Alire is a source-based package manager for the Ada and SPARK programming languages. It facilitates the building and sharing of projects within the Ada community, allowing developers to easily manage dependencies and publish their own libraries or programs. Alire aims to streamline the development process for Ada and SPARK by providing a standardized approach to package management.

Downloads: 5 This Week

Last Update: 2026-05-25
See Project
16

almond

A Scala kernel for Jupyter

...Call them from notebooks… or from your own libraries. Several plotting libraries are already available to plot things from notebooks, such as plotly-scala or Vegas. Load the Spark version of your choice, create a Spark session, and start using it from your notebooks.

Downloads: 3 This Week

Last Update: 2026-02-03
See Project
17

Apache Sedona

Cluster computing framework for processing large-scale geospatial data

Apache Sedona™ is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines. According to our benchmark and third-party research papers, Sedona runs 2X - 10X faster than other Spark-based geospatial data systems on computation-intensive query workloads. ...

Downloads: 0 This Week

Last Update: 2026-04-21
See Project
18

LakeSoul

An end-to-end, realtime and cloud native Lakehouse framework

LakeSoul is a high-performance, unified table storage framework for big data lakes, supporting both streaming and batch data in a single format. Built on top of Apache Spark and leveraging Apache Arrow and Parquet, LakeSoul provides ACID transactions, schema evolution, and time travel. It is designed for large-scale data lake architectures that require consistency, efficiency, and easy integration with modern data stacks.

Downloads: 1 This Week

Last Update: 2025-09-26
See Project
19

XGBoost

Scalable and Flexible Gradient Boosting

...It also offers parallel tree boosting (GBDT, GBRT or GBM) that can quickly and accurately solve many data science problems. XGBoost can be used for Python, Java, Scala, R, C++ and more. It can run on a single machine, Hadoop, Spark, Dask, Flink and most other distributed environments, and is capable of solving problems beyond billions of examples.

Downloads: 9 This Week

Last Update: 2026-06-18
See Project
20

Zingg

Scalable master data management and identity resolution

...The project is designed for data engineering and analytics teams working on customer 360, supplier 360, deduplication, fuzzy matching, data quality, and golden record workflows. Zingg runs on Apache Spark and can scale to large data lake, warehouse, and cloud platform environments. It supports configuration-driven pipelines where users define input data, match fields, training data, models, and output destinations. Its main value is helping organizations unify fragmented records into reliable entity clusters while keeping the process trainable, explainable, and repeatable.

Downloads: 6 This Week

Last Update: 2026-05-22
See Project
21

Daft

Distributed DataFrame for Python designed for the cloud

...Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.

Downloads: 1 This Week

Last Update: 1 day ago
See Project
22

Population Shift Monitoring

Monitor the stability of a Pandas or Spark dataframe

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules. ...

Downloads: 4 This Week

Last Update: 2026-01-09
See Project
23

Apache Iceberg

Apache Iceberg

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables, at the same time. The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on adding row-level deletes and upserts, and integration work with new engines like Flink and Hive. The Iceberg format specification is being actively updated and is open for comment. ...

Downloads: 2 This Week

Last Update: 2026-05-19
See Project
24

Bytewax

Python Stream Processing

Bytewax is a Python framework that simplifies event and stream processing. Because Bytewax couples the stream and event processing capabilities of Flink, Spark, and Kafka Streams with the friendly and familiar interface of Python, you can re-use the Python libraries you already know and love. Connect data sources, run stateful transformations, and write to various downstream systems with built-in connectors or existing Python libraries. Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. ...

Downloads: 1 This Week

Last Update: 2024-11-25
See Project
25

Volcano

A Cloud Native Batch System (Project under CNCF)

...It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including machine learning/deep learning, bioinformatics/genomics, and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, Ray, PyTorch, MPI, etc, which Volcano integrates with. Volcano builds upon a decade and a half of experience running a wide variety of high-performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open-source community. Until June 2021, Volcano has been widely used around the world at a variety of industries such as Internet/Cloud/Finance/ Manufacturing/Medical. ...

Downloads: 122 This Week

Last Update: 2026-06-27
See Project