Showing 170 open source projects for "spark"

View related business solutions
  • Go from Code to Production URL in Seconds Icon
    Go from Code to Production URL in Seconds

    Cloud Run deploys apps in any language instantly. Scales to zero. Pay only when code runs.

    Skip the Kubernetes configs. Cloud Run handles HTTPS, scaling, and infrastructure automatically. Two million requests free per month.
    Try it free
  • Try Google Cloud Risk-Free With $300 in Credit Icon
    Try Google Cloud Risk-Free With $300 in Credit

    No hidden charges. No surprise bills. Cancel anytime.

    Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.
    Start Free
  • 1
    Apache Spark

    Apache Spark

    A unified analytics engine for large-scale data processing

    ...With Spark Streaming (microbatches) and Structured Streaming, it delivers low-latency event processing suitable for real-time analytics. The built-in MLlib library provides scalable machine learning algorithms, while GraphX enables graph computations integrated with data pipelines. Spark supports multiple languages—Scala, Java, Python, R—and connects with many storage systems like HDFS, S3, Cassandra, and streaming platforms like Kafka, making it a versatile choice for big data workloads in analytics, ETL, and data science.
    Downloads: 10 This Week
    Last Update:
    See Project
  • 2
    Spark TTS

    Spark TTS

    Spark-TTS Inference Code

    Spark TTS is an open-source, PyTorch-based text-to-speech inference system that leverages large language models to produce highly natural, intelligible speech from text input. It uses an efficient single-stream architecture where speech tokens are directly reconstructed from the predictions of an LLM, removing the need for external acoustic models or complex vocoders and making the generation pipeline cleaner and faster.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    Spark NLP

    Spark NLP

    State of the Art Natural Language Processing

    Experience the power of large language models like never before, unleashing the full potential of Natural Language Processing (NLP) with Spark NLP, the open source library that delivers scalable LLMs. The full code base is open under the Apache 2.0 license, including pre-trained models and pipelines. The only NLP library built natively on Apache Spark. The most widely used NLP library in the enterprise. Spark ML provides a set of machine learning applications that can be built using two main components, estimators and transformers. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    Cassandra Spark Connector

    Cassandra Spark Connector

    Apache Spark to Apache Cassandra connector

    The Apache Cassandra Spark Connector allows Spark jobs (RDDs or DataFrames/Datasets) to read from and write to Cassandra tables. Compatible with Apache Cassandra (v2.1+), Spark 1.0–3.5, and Scala 2.11–2.13, it supports mapping Cassandra rows to Scala case classes, saving results back to Cassandra, and executing arbitrary CQL within Spark applications.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Earn up to 16% annual interest with Nexo. Icon
    Earn up to 16% annual interest with Nexo.

    More flexibility. More control.

    Generate interest, access liquidity without selling, and execute trades seamlessly. All in one platform. Geographic restrictions, eligibility, and terms apply.
    Get started with Nexo.
  • 5
    SageMaker Spark Container

    SageMaker Spark Container

    Docker image used to run data processing workloads

    ...The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    .NET for Apache Spark

    .NET for Apache Spark

    A free, open-source, and cross-platform big data analytics framework

    .NET for Apache Spark provides high-performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    Apache Kyuubi

    Apache Kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway

    Apache Kyuubi™ is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines. This "out-of-the-box" model minimizes the barriers and costs for end-users to use Spark at the client side. At the server-side, Kyuubi server and engines' multi-tenant architecture provides the administrators a way to achieve computing resource isolation, data security, high availability, high client concurrency, etc.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    sparkmagic

    sparkmagic

    Jupyter magics and kernels for working with remote Spark clusters

    Sparkmagic is a set of tools for interactively working with remote Spark clusters in Jupyter notebooks. Sparkmagic interacts with remote Spark clusters through a REST server. Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required. Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib). ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    fugue

    fugue

    A unified interface for distributed computing

    Fugue is a unified interface for distributed computing that lets users execute Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites.
    Downloads: 0 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    Synapse Machine Learning

    Synapse Machine Learning

    Simple and distributed Machine Learning

    ...These tools enable powerful and highly-scalable predictive and analytical models for a variety of data sources. SynapseML also brings new networking capabilities to the Spark Ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models. For production-grade deployment, the Spark Serving project enables high throughput, sub-millisecond latency web services, backed by your Spark cluster.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    sparklyr

    sparklyr

    R interface for Apache Spark

    sparklyr is an R package that provides seamless interfacing with Apache Spark clusters—either local or remote—while letting users write code in familiar R paradigms. It supplies a dplyr-compatible backend, Spark machine learning pipelines, SQL integration, and I/O utilities to manipulate and analyze large datasets distributed across cluster environments.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Sail

    Sail

    A drop-in Apache Spark replacement written in Rust

    ...It is built entirely in Rust, eliminating JVM overhead and enabling predictable performance, fast startup times, and improved memory safety compared to traditional big data frameworks. Sail is compatible with the Spark Connect protocol, which means existing Spark SQL and DataFrame workloads can run without code changes, making adoption seamless for teams already using Spark-based pipelines. The framework is designed to operate across a variety of environments, including local machines, Kubernetes clusters, and cloud deployments, allowing flexible scaling based on workload requirements. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 13
    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark

    Deequ is a library built atop Apache Spark that enables defining “unit tests for data” — that is, formal constraints or checks on datasets to ensure data quality along dimensions such as completeness, uniqueness, value ranges, correlations, etc. It can scale to large datasets (billions of rows) by translating those data checks into Spark jobs. Deequ supports advanced features like a metrics repository for storing computed statistics over time, anomaly detection of data quality metrics, and the suggestion of likely constraints automatically for new datasets. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Alire

    Alire

    Command-line tool from the Alire project and supporting library

    Alire is a source-based package manager for the Ada and SPARK programming languages. It facilitates the building and sharing of projects within the Ada community, allowing developers to easily manage dependencies and publish their own libraries or programs. Alire aims to streamline the development process for Ada and SPARK by providing a standardized approach to package management. ​
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    SageMaker Spark

    SageMaker Spark

    A Spark library for Amazon SageMaker

    SageMaker Spark depends on hadoop-aws-2.8.1. To run Spark applications that depend on SageMaker Spark, you need to build Spark with Hadoop 2.8. However, if you are running Spark applications on EMR, you can use Spark built with Hadoop 2.7.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    almond

    almond

    A Scala kernel for Jupyter

    ...Call them from notebooks… or from your own libraries. Several plotting libraries are already available to plot things from notebooks, such as plotly-scala or Vegas. Load the Spark version of your choice, create a Spark session, and start using it from your notebooks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Apache Sedona

    Apache Sedona

    Cluster computing framework for processing large-scale geospatial data

    Apache Sedona™ is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines. According to our benchmark and third-party research papers, Sedona runs 2X - 10X faster than other Spark-based geospatial data systems on computation-intensive query workloads. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Daft

    Daft

    Distributed DataFrame for Python designed for the cloud

    ...Underneath its Python API, Daft is built in blazing fast Rust code. Rust powers Daft’s vectorized execution and async I/O, allowing Daft to outperform frameworks such as Spark.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 19
    LakeSoul

    LakeSoul

    An end-to-end, realtime and cloud native Lakehouse framework

    LakeSoul is a high-performance, unified table storage framework for big data lakes, supporting both streaming and batch data in a single format. Built on top of Apache Spark and leveraging Apache Arrow and Parquet, LakeSoul provides ACID transactions, schema evolution, and time travel. It is designed for large-scale data lake architectures that require consistency, efficiency, and easy integration with modern data stacks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    XGBoost

    XGBoost

    Scalable and Flexible Gradient Boosting

    ...It also offers parallel tree boosting (GBDT, GBRT or GBM) that can quickly and accurately solve many data science problems. XGBoost can be used for Python, Java, Scala, R, C++ and more. It can run on a single machine, Hadoop, Spark, Dask, Flink and most other distributed environments, and is capable of solving problems beyond billions of examples.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 21
    Volcano

    Volcano

    A Cloud Native Batch System (Project under CNCF)

    ...It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workload including machine learning/deep learning, bioinformatics/genomics, and other "big data" applications. These types of applications typically run on generalized domain frameworks like TensorFlow, Spark, Ray, PyTorch, MPI, etc, which Volcano integrates with. Volcano builds upon a decade and a half of experience running a wide variety of high-performance workloads at scale using several systems and platforms, combined with best-of-breed ideas and practices from the open-source community. Until June 2021, Volcano has been widely used around the world at a variety of industries such as Internet/Cloud/Finance/ Manufacturing/Medical. ...
    Downloads: 275 This Week
    Last Update:
    See Project
  • 22
    Bytewax

    Bytewax

    Python Stream Processing

    Bytewax is a Python framework that simplifies event and stream processing. Because Bytewax couples the stream and event processing capabilities of Flink, Spark, and Kafka Streams with the friendly and familiar interface of Python, you can re-use the Python libraries you already know and love. Connect data sources, run stateful transformations, and write to various downstream systems with built-in connectors or existing Python libraries. Bytewax is a Python framework and Rust distributed processing engine that uses a dataflow computational model to provide parallelizable stream processing and event processing capabilities similar to Flink, Spark, and Kafka Streams. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    Apache Iceberg

    Apache Iceberg

    Apache Iceberg

    Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables, at the same time. The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on adding row-level deletes and upserts, and integration work with new engines like Flink and Hive. The Iceberg format specification is being actively updated and is open for comment. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Population Shift Monitoring

    Population Shift Monitoring

    Monitor the stability of a Pandas or Spark dataframe

    popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    IoTDB

    IoTDB

    Apache IoTDB

    Apache IoTDB (Database for Internet of Things) is an IoT native database with high performance for data management and analysis, deployable on the edge and the cloud. Due to its light-weight architecture, high performance and rich feature set together with its deep integration with Apache Hadoop, Spark and Flink, Apache IoTDB can meet the requirements of massive data storage, high-speed data ingestion and complex data analysis in the IoT industrial fields. In the scene of factories, there are tens of devices under LAN network. IoTDB can be installed on a local controller server in the factory to receive data from those devices. ...
    Downloads: 10 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB