Business Software for Apache Spark - Page 5

Top Software that integrates with Apache Spark as of August 2025 - Page 5

  • 1
    Amundsen

    Amundsen

    Amundsen

    Discover & trust data for your analysis and models. Be more productive by breaking silos. Get immediate context into the data and see how others are using it. Search for data within your organization by a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard. Build trust in data using automated and curated metadata, descriptions of tables and columns, other frequent users, when the table was last updated, statistics, a preview of the data if permitted, etc. Easy triage by linking the ETL job and code that generated the data. Update tables and columns with descriptions, reduce unnecessary back and forth about which table to use and what a column contains. See what data fellow co-workers frequently use, own or have bookmarked. Learn what most common queries for a table look like by seeing dashboards built on a given table.
  • 2
    Apache Kylin

    Apache Kylin

    Apache Software Foundation

    Apache Kylin™ is an open source, distributed Analytical Data Warehouse for Big Data; it was designed to provide OLAP (Online Analytical Processing) capability in the big data era. By renovating the multi-dimensional cube and precalculation technology on Hadoop and Spark, Kylin is able to achieve near constant query speed regardless of the ever-growing data volume. Reducing query latency from minutes to sub-second, Kylin brings online analytics back to big data. Kylin can analyze 10+ billions of rows in less than a second. No more waiting on reports for critical decisions. Kylin connects data on Hadoop to BI tools like Tableau, PowerBI/Excel, MSTR, QlikSense, Hue and SuperSet, making the BI on Hadoop faster than ever. As an Analytical Data Warehouse, Kylin offers ANSI SQL on Hadoop/Spark and supports most ANSI SQL query functions. Kylin can support thousands of interactive queries at the same time, thanks to the low resource consumption of each query.
  • 3
    Apache Zeppelin
    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more. IPython interpreter provides comparable user experience like Jupyter Notebook. This release includes Note level dynamic form, note revision comparator and ability to run paragraph sequentially, instead of simultaneous paragraph execution in previous releases. Interpreter lifecycle manager automatically terminate interpreter process on idle timeout. So resources are released when they're not in use.
  • 4
    Quantexa

    Quantexa

    Quantexa

    Uncover hidden risk and reveal new, unexpected opportunities with graph analytics across the customer lifecycle. Standard MDM solutions are not built for high volumes of distributed, disparate data, that is generated by various applications and external sources. Traditional MDM probabilistic matching doesn’t work well with siloed data sources. It misses connections, losing context, leads to decision-making inaccuracy, and leaves business value on the table. An ineffective MDM solution affects everything from customer experience to operational performance. Without on-demand visibility of holistic payment patterns, trends and risk, your team can’t make the right decisions quickly, compliance costs escalate, and you can’t increase coverage fast enough. Your data isn’t connected – so customers suffer fragmented experiences across channels, business lines and geographies. Attempts at personalized engagement fall short as these are based on partial, often outdated data.
  • 5
    witboost

    witboost

    Agile Lab

    witboost is a modular, scalable, fast, efficient data management system for your company to truly become data driven, reduce time-to-market, it expenditures and overheads. witboost comprises a series of modules. These are building blocks that can work as standalone solutions to address and solve a single need or problem, or they can be combined to create the perfect data management ecosystem for your company. Each module improves a specific data engineering function and they can be combined to create the perfect solution to answer your specific needs, guaranteeing a blazingly fact and smooth implementation, thus dramatically reducing time-to-market, time-to-value and consequently the TCO of your data engineering infrastructure. Smart Cities need digital twins to predict needs and avoid unforeseen problems, gathering data from thousands of sources and managing ever more complex telematics.
  • 6
    Occubee

    Occubee

    3SOFT

    Occubee platform automatically converts large amount of receipt data, information on thousands of products and dozens of retail-specific factors into valuable sales and demand forecasts. In stores, Occubee forecasts sales individually for each product and generates replenishment commands. In warehouses, Occubee optimizes the availability of goods and allocated capital, and generates orders for suppliers. In the head office, Occubee provides real-time monitoring of sales processes and generates anomaly alerts and reports. Modern technologies for data collection and processing ensure automation of key business processes in the retail industry. Occubee fully responds to the needs of modern retail and fits in with the global megatrends related to the use of data in business.
  • 7
    Acxiom InfoBase
    Acxiom enables you to leverage comprehensive data for premium audiences and insights across the globe. Better understand, identify, and target ideal audiences by engaging and personalizing experiences across digital and offline channels. With marketing technology, identity resolution and digital connectivity converging in a “borderless digital world,” brands can now quickly locate data attributes, service availability and the digital footprint across the globe to fuel informed decisions. Acxiom is the global data leader with thousands of data attributes in more than 60 countries helping brands improve millions of customer experiences every day through meaningful data-driven insights, all while protecting consumer privacy. Understand, reach and engage audiences everywhere, maximize your media investments and power more personalized experiences. Reach audiences around the globe and deliver experiences that matter with Acxiom data.
  • 8
    Deeplearning4j

    Deeplearning4j

    Deeplearning4j

    DL4J takes advantage of the latest distributed computing frameworks including Apache Spark and Hadoop to accelerate training. On multi-GPUs, it is equal to Caffe in performance. The libraries are completely open-source, Apache 2.0, and maintained by the developer community and Konduit team. Deeplearning4j is written in Java and is compatible with any JVM language, such as Scala, Clojure, or Kotlin. The underlying computations are written in C, C++, and Cuda. Keras will serve as the Python API. Eclipse Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Apache Spark, DL4J brings AI to business environments for use on distributed GPUs and CPUs. There are a lot of parameters to adjust when you're training a deep-learning network. We've done our best to explain them, so that Deeplearning4j can serve as a DIY tool for Java, Scala, Clojure, and Kotlin programmers.
  • 9
    PySpark

    PySpark

    PySpark

    PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine. Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.
  • 10
    Apache Kudu

    Apache Kudu

    The Apache Software Foundation

    A Kudu cluster stores tables that look just like tables you're used to from relational (SQL) databases. A table can be as simple as a binary key and value, or as complex as a few hundred different strongly-typed attributes. Just like SQL, every table has a primary key made up of one or more columns. This might be a single column like a unique user identifier, or a compound key such as a (host, metric, timestamp) tuple for a machine time-series database. Rows can be efficiently read, updated, or deleted by their primary key. Kudu's simple data model makes it a breeze to port legacy applications or build new ones, no need to worry about how to encode your data into binary blobs or make sense of a huge database full of hard-to-interpret JSON. Tables are self-describing, so you can use standard tools like SQL engines or Spark to analyze your data. Kudu's APIs are designed to be easy to use.
  • 11
    Apache Hudi

    Apache Hudi

    Apache Corporation

    Hudi is a rich platform to build streaming data lakes with incremental data pipelines on a self-managing database layer, while being optimized for lake engines and regular batch processing. Hudi maintains a timeline of all actions performed on the table at different instants of time that helps provide instantaneous views of the table, while also efficiently supporting retrieval of data in the order of arrival. A Hudi instant consists of the following components. Hudi provides efficient upserts, by mapping a given hoodie key consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.
  • 12
    Retina

    Retina

    Retina

    Predict future value from day one. Retina is the customer intelligence solution that provides accurate customer lifetime value metrics early in the customer journey. Optimize marketing budgets in real-time, drive more predictable repeat revenue, and elevate brand equity with the most accurate CLV metrics. Align customer acquisition around CLV with improved targeting, ad relevance, conversion rates & customer loyalty. Build lookalike audiences based on your best customers. Focus on customer behavior instead of demographics. Pinpoint attributes that make leads more likely to convert. Uncover product features that drive valuable customer behavior. Create customer journeys that positively impact lifetime value. Implement changes to boost the value of your customer base. Using a sample of your customer data, Retina delivers individual customer lifetime value calculations to qualified customers before you buy.
  • 13
    Azure HDInsight
    Run popular open-source frameworks—including Apache Hadoop, Spark, Hive, Kafka, and more—using Azure HDInsight, a customizable, enterprise-grade service for open-source analytics. Effortlessly process massive amounts of data and get all the benefits of the broad open-source project ecosystem with the global scale of Azure. Easily migrate your big data workloads and processing to the cloud. Open-source projects and clusters are easy to spin up quickly without the need to install hardware or manage infrastructure. Big data clusters reduce costs through autoscaling and pricing tiers that allow you to pay for only what you use. Enterprise-grade security and industry-leading compliance with more than 30 certifications helps protect your data. Optimized components for open-source technologies such as Hadoop and Spark keep you up to date.
  • 14
    IBM Intelligent Operations Center for Emergency Mgmt
    An incident and emergency management solution for daily operations, emergency and crisis situations. This command, control and communication (C3) solution uses data analytic technologies coupled with social and mobile technology to streamline and integrate preparation, response, recovery and mitigation of daily incidents, emergencies and disasters. IBM works with governments and public safety organizations worldwide to implement public safety technology solutions. Proven preparation techniques use the same technology to manage day-to-day community incidents when responding to crises situations. This familiarity helps ensure first responders and C3 staff can engage immediately and naturally in response, recovery and mitigation without needing access to special documentation and systems. This incident and emergency management solution integrates and correlates information sources to create a dynamic, near real-time geospatial framework for a common operating picture.
  • 15
    doolytic

    doolytic

    doolytic

    doolytic is leading the way in big data discovery, the convergence of data discovery, advanced analytics, and big data. doolytic is rallying expert BI users to the revolution in self-service exploration of big data, revealing the data scientist in all of us. doolytic is an enterprise software solution for native discovery on big data. doolytic is based on best-of-breed, scalable, open-source technologies. Lightening performance on billions of records and petabytes of data. Structured, unstructured and real-time data from any source. Sophisticated advanced query capabilities for expert users, Integration with R for advanced and predictive applications. Search, analyze, and visualize data from any format, any source in real-time with the flexibility of Elastic. Leverage the power of Hadoop data lakes with no latency and concurrency issues. doolytic solves common BI problems and enables big data discovery without clumsy and inefficient workarounds.
  • 16
    StreamFlux

    StreamFlux

    Fractal

    Data is crucial when it comes to building, streamlining and growing your business. However, getting the full value out of data can be a challenge, many organizations are faced with poor access to data, incompatible tools, spiraling costs and slow results. Simply put, leaders who can turn raw data into real results will thrive in today’s landscape. The key to this is empowering everyone across your business to be able to analyze, build and collaborate on end-to-end AI and machine learning solutions in one place, fast. Streamflux is a one-stop shop to meet your data analytics and AI challenges. Our self-serve platform allows you the freedom to build end-to-end data solutions, uses models to answer complex questions and assesses user behaviors. Whether you’re predicting customer churn and future revenue, or generating recommendations, you can go from raw data to genuine business impact in days, not months.
  • 17
    Pavilion HyperOS
    Powering the most performant, dense, scalable, and flexible storage platform in the universe. Pavilion HyperParallel File System™ provides the ability to scale across an unlimited number of Pavilion HyperParallel Flash Arrays™, providing 1.2 TB/s read, and 900 GB/s write bandwidth with 200M IOPS at 25µs latency per rack. Uniquely capable of providing independent, linear scalability of both capacity and performance, the Pavilion HyperOS 3 now provides global namespace support for both NFS and S3, enabling unlimited, linear scale across an unlimited number of Pavilion HyperParallel Flash Array systems. Take advantage of the power of the Pavilion HyperParallel Flash Array to enjoy unrivaled levels of performance and availability. The Pavilion HyperOS includes patent-pending technology to ensure that your data is always available, with performant access that legacy arrays cannot match.
  • 18
    Great Expectations

    Great Expectations

    Great Expectations

    Great Expectations is a shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling. We recommend deploying within a virtual environment. If you’re not familiar with pip, virtual environments, notebooks, or git, you may want to check out the Supporting. There are many amazing companies using great expectations these days. Check out some of our case studies with companies that we've worked closely with to understand how they are using great expectations in their data stack. Great expectations cloud is a fully managed SaaS offering. We're taking on new private alpha members for great expectations cloud, a fully managed SaaS offering. Alpha members get first access to new features and input to the roadmap.
  • 19
    Spark Streaming

    Spark Streaming

    Apache Software Foundation

    Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python. Spark Streaming recovers both lost work and operator state (e.g. sliding windows) out of the box, without any extra code on your part. By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. Build powerful interactive applications, not just analytics. Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each Spark release. You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster resource managers. It also includes a local run mode for development. In production, Spark Streaming uses ZooKeeper and HDFS for high availability.
  • 20
    5GSoftware

    5GSoftware

    5GSoftware

    Enabling cost-effective deployment of a high-performance, end-to-end, private 5G network for enterprises and communities. We provide a secure 5G overlay for bringing edge intelligence to the existing enterprise networks. Deploy 5G Core with ease. Secure backhaul connectivity. Designed to scale on-demand. Remote management and automated network orchestration. Management of the data synchronization between edge and central locations. Cost-effective all-in-one 5G core for light users. Fully functional 5G core distributed in the cloud for heavy, enterprise use. Flexibility to add additional nodes as demand grows. Flexible early billing plan (minimum 6-month commitment needed). Full control of deployed nodes in the cloud. Flexible monthly/yearly billing cycle. Cloud 5G software platform enables a seamless overlay of 5G Core deployment on your existing or greenfield enterprise IT to network to meet the ultra-fast, low-latency connectivity needs while providing complete security.
  • 21
    Lightbits

    Lightbits

    Lightbits Labs

    We help our customers achieve hyperscale efficiency and cost savings for their own private cloud or public cloud storage as a service offering. With our software-defined block storage solution, Lightbits, customers scale their business effortlessly, accelerate IT operations, and reduce cost – at the speed of local flash. Break the dependency between compute and storage to allocate resources independently to bring the flexibility and efficiency of the cloud on-premises. Deliver low latency and high performance while guaranteeing high availability for your distributed databases and cloud native applications such as SQL, NoSQL, and “in memory”. With the constant growth of data in the forever available data center, one of the critical challenges is that applications and services running at scale must stay stateful as they migrate around the data center in order to keep services available and efficient in the presence of constant failures.
  • 22
    SQL

    SQL

    SQL

    SQL is a domain-specific programming language used for accessing, managing, and manipulating relational databases and relational database management systems.
  • 23
    AI Squared

    AI Squared

    AI Squared

    Empower data scientists and application developers to collaborate on ML projects. Build, load, optimize and test models and integrations before publishing to end-users for integration into live applications. Reduce data science workload and improve decision-making by storing and sharing ML models across the organization. Publish updates to automatically push changes to models in production. Drive efficiency by instantly providing ML-powered insights within any web-based business application. Our self-service, drag-and-drop browser extension enables analysts and business users to integrate models into any web-based application with zero code.
  • 24
    Deequ

    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. We are happy to receive feedback and contributions. Deequ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12. Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library.
  • 25
    Zepl

    Zepl

    Zepl

    Sync, search and manage all the work across your data science team. Zepl’s powerful search lets you discover and reuse models and code. Use Zepl’s enterprise collaboration platform to query data from Snowflake, Athena or Redshift and build your models in Python. Use pivoting and dynamic forms for enhanced interactions with your data using heatmap, radar, and Sankey charts. Zepl creates a new container every time you run your notebook, providing you with the same image each time you run your models. Invite team members to join a shared space and work together in real time or simply leave their comments on a notebook. Use fine-grained access controls to share your work. Allow others have read, edit, and run access as well as enable collaboration and distribution. All notebooks are auto-saved and versioned. You can name, manage and roll back all versions through an easy-to-use interface, and export seamlessly into Github.
  • 26
    Yottamine

    Yottamine

    Yottamine

    Our highly innovative machine learning technology is designed specifically to accurately predict financial time series where only a small number of training data points are available. Advance AI is computationally consuming. YottamineAI leverages the cloud to eliminate the need to invest time and money on managing hardware, shortening the time to benefit from higher ROI significantly. Strong encryption and protection of keys ensure trade secrets stay safe. We follow the best practices of AWS and utilize strong encryption to secure your data. We evaluate how your existing or future data can generate predictive analytics in helping you make information-based decisions. If you need predictive analytics on a project basis, Yottamine Consulting Services provides project-based consulting to accommodate your data-mining needs.
  • 27
    RunCode

    RunCode

    RunCode

    RunCode offers online developer workspaces, which are environments that allow you to work on code projects in a web browser. These workspaces provide you with a full development environment, including a code editor, a terminal, and access to a range of tools and libraries. They are designed to be easy to use and allow you to get started quickly without the need to set up a local development environment on your own computer.
    Starting Price: $20/month/user
  • 28
    Amazon SageMaker Feature Store
    Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store for feature use across the ML lifecycle. Store, share, and manage ML model features for training and inference to promote feature reuse across ML applications. Ingest features from any data source including streaming and batch such as application logs, service logs, clickstreams, sensors, etc.
  • 29
    Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data you want from a wide variety of data sources and import it quickly. Next, you can use the Data Quality and Insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations so you can quickly transform data without writing any code. Once you have completed your data preparation workflow, you can scale it to your full datasets using SageMaker data processing jobs; train, tune, and deploy models.
  • 30
    Apache Mahout

    Apache Mahout

    Apache Software Foundation

    Apache Mahout is a powerful, scalable, and versatile machine learning library designed for distributed data processing. It offers a comprehensive set of algorithms for various tasks, including classification, clustering, recommendation, and pattern mining. Built on top of the Apache Hadoop ecosystem, Mahout leverages MapReduce and Spark to enable data processing on large-scale datasets. Apache Mahout(TM) is a distributed linear algebra framework and mathematically expressive Scala DSL designed to let mathematicians, statisticians, and data scientists quickly implement their own algorithms. Apache Spark is the recommended out-of-the-box distributed back-end or can be extended to other distributed backends. Matrix computations are a fundamental part of many scientific and engineering applications, including machine learning, computer vision, and data analysis. Apache Mahout is designed to handle large-scale data processing by leveraging the power of Hadoop and Spark.