Showing 20 open source projects for "apache"

View related business solutions
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Cloud tools for web scraping and data extraction Icon
    Cloud tools for web scraping and data extraction

    Deploy pre-built tools that crawl websites, extract structured data, and feed your applications. Reliable web data without maintaining scrapers.

    Automate web data collection with cloud tools that handle anti-bot measures, browser rendering, and data transformation out of the box. Extract content from any website, push to vector databases for RAG workflows, or pipe directly into your apps via API. Schedule runs, set up webhooks, and connect to your existing stack. Free tier available, then scale as you need to.
    Explore 10,000+ tools
  • 1
    Apache SeaTunnel

    Apache SeaTunnel

    SeaTunnel is a distributed, high-performance data integration platform

    SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies. There are hundreds of commonly-used data sources of which versions are incompatible. With the emergence of new technologies, more data sources are appearing. It is difficult for users to find a tool that can...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Dolphin Scheduler

    Dolphin Scheduler

    A distributed and extensible workflow scheduler platform

    Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available `out of the box`. Dedicated to solving the complex task dependencies in data processing, making the scheduler system out of the box for data processing.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    Conduit

    Conduit

    Conduit streams data between data stores. Kafka Connect replacement

    Conduit is a data streaming tool written in Go. It aims to provide the best user experience for building and running real-time data pipelines. Conduit comes with batteries included, it provides a UI, common connectors, processors and observability data out of the box. Sync data between your production systems using an extensible, event-first experience with minimal dependencies that fit within your existing workflow. Eliminate the multi-step process you go through today. Just download the...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 4
    Kestra

    Kestra

    Kestra is an infinitely scalable orchestration and scheduling platform

    Build reliable workflows, blazingly fast, deploy in just a few clicks. Kestra is an open-source, event-driven orchestrator that simplifies data operations and improves collaboration between engineers and business users. By bringing Infrastructure as Code best practices to data pipelines, Kestra allows you to build reliable workflows and manage them with confidence. Thanks to the declarative YAML interface for defining orchestration logic, everyone who benefits from analytics can participate...
    Downloads: 1 This Week
    Last Update:
    See Project
  • DAT Freight and Analytics - DAT Icon
    DAT Freight and Analytics - DAT

    DAT Freight and Analytics operates DAT One truckload freight marketplace

    DAT Freight & Analytics operates DAT One, North America’s largest truckload freight marketplace; DAT iQ, the industry’s leading freight data analytics service; and Trucker Tools, the leader in load visibility. Shippers, transportation brokers, carriers, news organizations, and industry analysts rely on DAT for market trends and data insights, informed by nearly 700,000 daily load posts and a database exceeding $1 trillion in freight market transactions. Founded in 1978, DAT is a business unit of Roper Technologies (Nasdaq: ROP), a constituent of the Nasdaq 100, S&P 500, and Fortune 1000. Headquartered in Beaverton, Ore., DAT continues to set the standard for innovation in the trucking and logistics industry.
    Learn More
  • 5
    Luigi

    Luigi

    Python module that helps you build complex pipelines of batch jobs

    Luigi is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 6
    Backstage

    Backstage

    Backstage is an open platform for building developer portals

    Powered by a centralized software catalog, Backstage restores order to your infrastructure and enables your product teams to ship high-quality code quickly, without compromising autonomy. At Spotify, we've always believed in the speed and ingenuity that comes from having autonomous development teams. But as we learned firsthand, the faster you grow, the more fragmented and complex your software ecosystem becomes. And then everything slows down again. By centralizing services and...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    Alluxio

    Alluxio

    Open Source Data Orchestration for the Cloud

    Alluxio is the world’s first open source data orchestration technology for analytics and AI for the cloud. It bridges the gap between computation frameworks and storage systems, bringing data from the storage tier closer to the data driven applications. This enables applications to connect to numerous storage systems through a common interface. It makes data local, more accessible and as elastic as compute.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    Increase data quality and reduce the painful cost of errors. Data engineering best practices using git-like operations on data. lakeFS is an open-source data version control for data lakes. It enables zero-copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more. Data is dynamic, it changes over time. Dealing with that without a data version control system is error-prone and labor-intensive. With lakeFS, your data lake is...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    Elementary

    Elementary

    Open-source data observability for analytics engineers

    Elementary is an open-source data observability solution for data & analytics engineers. Monitor your dbt project and data in minutes, and be the first to know of data issues. Gain immediate visibility, detect data issues, send actionable alerts, and understand the impact and root cause. Generate a data observability report, host it or share with your team. Monitoring of data quality metrics, freshness, volume and schema changes, including anomaly detection. Elementary data monitors are...
    Downloads: 0 This Week
    Last Update:
    See Project
  • The Original Buy Center Software. Icon
    The Original Buy Center Software.

    Never Go To The Auction Again.

    VAN sources private-party vehicles from over 20 platforms and provides all necessary tools to communicate with sellers and manage opportunities. Franchise and Independent dealers can boost their buy center strategies with our advanced tools and an experienced Acquisition Coaching™ team dedicated to your success.
    Learn More
  • 10
    whylogs

    whylogs

    The open standard for data logging

    whylogs is an open-source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to track changes in their dataset Create data constraints to know whether their data looks the way it should. Quickly visualize key summary statistics about their datasets. whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Mage.ai

    Mage.ai

    Build, run, and manage data pipelines for integrating data

    Open-source data pipeline tool for transforming and integrating data. The modern replacement for Airflow. Effortlessly integrate and synchronize data from 3rd party sources. Build real-time and batch pipelines to transform data using Python, SQL, and R. Run, monitor, and orchestrate thousands of pipelines without losing sleep. Have you met anyone who said they loved developing in Airflow? That’s why we designed an easy developer experience that you’ll enjoy. Each step in your pipeline is a...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    AutoGluon

    AutoGluon

    AutoGluon: AutoML for Image, Text, and Tabular Data

    AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on automated stack ensembling, deep learning, and real-world applications spanning image, text, and tabular data. Intended for both ML beginners and experts, AutoGluon enables you to quickly prototype deep learning and classical ML solutions for your raw data with a few lines of code. Automatically utilize state-of-the-art techniques (where appropriate) without expert knowledge. Leverage automatic hyperparameter tuning,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    PipeRider

    PipeRider

    Code review for data in dbt

    PipeRider automatically compares your data to highlight the difference in impacted downstream dbt models so you can merge your Pull Requests with confidence. PipeRider can profile your dbt models and obtain information such as basic data composition, quantiles, histograms, text length, top categories, and more. PipeRider can integrate with dbt metrics and present the time-series data of metrics in the report. PipeRider generates a static HTML report each time it runs, which can be viewed...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Tributary

    Tributary

    Streaming reactive and dataflow graphs in Python

    Tributary is a library for constructing dataflow graphs in Python. Unlike many other DAG libraries in Python (airflow, luigi, prefect, dagster, dask, kedro, etc), tributary is not designed with data/etl pipelines or scheduling in mind. Instead, tributary is more similar to libraries like mdf, loman, pyungo, streamz, or pyfunctional, in that it is designed to be used as the implementation for a data model. One such example is the greeks library, which leverages tributary to build data models...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    BitSail

    BitSail

    BitSail is a distributed high-performance data integration engine

    BitSail is ByteDance's open source data integration engine which is based on distributed architecture and provides high performance. It supports data synchronization between multiple heterogeneous data sources, and provides global data integration solutions in batch, streaming, and incremental scenarios. At present, it serves almost all business lines in ByteDance, such as Douyin, Toutiao, etc., and synchronizes hundreds of trillions of data every day. BitSail has been widely used and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    CueLake

    CueLake

    Use SQL to build ELT pipelines on a data lakehouse

    ...To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg). To transform data, you write SQL statements to create views and tables in your data lakehouse. CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    DataKit

    DataKit

    Connect processes into powerful data pipelines

    Connect processes into powerful data pipelines with a simple git-like filesystem interface. DataKit is a tool to orchestrate applications using a Git-like dataflow. It revisits the UNIX pipeline concept, with a modern twist: streams of tree-structured data instead of raw text. DataKit allows you to define complex build pipelines over version-controlled data. DataKit is currently used as the coordination layer for HyperKit, the hypervisor component of Docker for Mac and Windows, and for the...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    apache spark data pipeline osDQ

    apache spark data pipeline osDQ

    osDQ dedicated to create apache spark based data pipeline using JSON

    This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/ This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also. Get json example at https://github.com/arrahtech/osdq-spark How to run Unzip the zip file Windows : java -cp ....
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    Apache Kafka

    Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka is a distributed streaming platform built around durable, partitioned logs called topics, enabling high-throughput, fault-tolerant event pipelines. Producers append records to partitions, brokers replicate them for durability, and consumer groups read them at their own pace while balancing work across instances. The commit/offset model and retention policies support patterns from real-time processing to event sourcing and audit trails.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 20
    Cuelake

    Cuelake

    Use SQL to build ELT pipelines on a data lakehouse

    ...To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg). To transform data, you write SQL statements to create views and tables in your data lakehouse. CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • Next