lakeFS Integrations

Looker

Google

Looker, Google Cloud’s business intelligence platform, enables you to chat with your data. Organizations turn to Looker for self-service and governed BI, to build custom applications with trusted metrics, or to bring Looker modeling to their existing environment. The result is improved data engineering efficiency and true business transformation. Looker is reinventing business intelligence for the modern company. Looker works the way the web does: browser-based, its unique modeling language lets any employee leverage the work of your best data analysts. Operating 100% in-database, Looker capitalizes on the newest, fastest analytic databases—to get real results, in real time.

20 Ratings

View Software

Amazon Web Services (AWS)

Amazon

Whether you're looking for compute power, database storage, content delivery, or other functionality, AWS has the services to help you build sophisticated applications with increased flexibility, scalability and reliability. Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 175 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster. AWS has significantly more services, and more features within those services, than any other cloud provider–from infrastructure technologies like compute, storage, and databases–to emerging technologies, such as machine learning and artificial intelligence, data lakes and analytics, and Internet of Things. This makes it faster, easier, and more cost effective to move your existing applications to the cloud.

13 Ratings

View Software

Amazon S3

Amazon

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize your data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world. Scale your storage resources up and down to meet fluctuating demands, without upfront investments or resource procurement cycles. Amazon S3 is designed for 99.999999999% (11 9’s) of data durability.

6 Ratings

View Software

Google Cloud Storage

Google

Object storage for companies of all sizes. Store any amount of data. Retrieve it as often as you’d like. Configure your data with Object Lifecycle Management (OLM) to automatically transition to lower-cost storage classes when it meets the criteria you specify, such as when it reaches a certain age or when you’ve stored a newer version of the data. Cloud Storage has an ever-growing list of storage bucket locations where you can store your data with multiple automatic redundancy options. Whether you are optimizing for split-second response time, or creating a robust disaster recovery plan, customize where and how you store your data. Storage Transfer Service and Transfer Service for on-premises data offer two highly performant, online pathways to Cloud Storage—both with the scalability and speed you need to simplify the data transfer process. For offline data transfer our Transfer Appliance is a shippable storage server.

4 Ratings

View Software

Jupyter Notebook

Project Jupyter

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

3 Ratings

View Software

Amazon Athena

Amazon

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.

2 Ratings

View Software

Apache Hive

Apache Software Foundation

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Apache Hive is an open source project run by volunteers at the Apache Software Foundation. Previously it was a subproject of Apache® Hadoop®, but has now graduated to become a top-level project of its own. We encourage you to learn about the project and contribute your expertise. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API.

1 Rating

View Software

Apache Kafka

The Apache Software Foundation

Apache Kafka® is an open-source, distributed streaming platform. Scale production clusters up to a thousand brokers, trillions of messages per day, petabytes of data, hundreds of thousands of partitions. Elastically expand and contract storage and processing. Stretch clusters efficiently over availability zones or connect separate clusters across geographic regions. Process streams of events with joins, aggregations, filters, transformations, and more, using event-time and exactly-once processing. Kafka’s out-of-the-box Connect interface integrates with hundreds of event sources and event sinks including Postgres, JMS, Elasticsearch, AWS S3, and more. Read, write, and process streams of events in a vast array of programming languages.

1 Rating

View Software

Amazon SES

Amazon

Amazon Simple Email Service (SES) is a cost-effective, flexible, and scalable email service that enables developers to send mail from within any application. You can configure Amazon SES quickly to support several email use cases, including transactional, marketing, or mass email communications. Amazon SES's flexible IP deployment and email authentication options help drive higher deliverability and protect sender reputation, while sending analytics measure the impact of each email. With Amazon SES, you can send email securely, globally, and at scale. Using either the Amazon SES console, APIs, or SMTP, you can configure email sending in minutes. Amazon SES also supports email receiving, enabling you to interact with your customers at scale. Regardless of use case or sending volume, you only pay for what you use with Amazon SES.

Starting Price: $0.10 per month

View Software

Azure Blob Storage

Microsoft

Massively scalable and secure object storage for cloud-native workloads, archives, data lakes, high-performance computing, and machine learning. Azure Blob Storage helps you create data lakes for your analytics needs, and provides storage to build powerful cloud-native and mobile apps. Optimize costs with tiered storage for your long-term data, and flexibly scale up for high-performance computing and machine learning workloads. Blob storage is built from the ground up to support the scale, security, and availability needs of mobile, web, and cloud-native application developers. Use it as a cornerstone for serverless architectures such as Azure Functions. Blob storage supports the most popular development frameworks, including Java, .NET, Python, and Node.js, and is the only cloud storage service that offers a premium, SSD-based object storage tier for low-latency and interactive scenarios.

Starting Price: $0.00099

View Software

Astro

Astronomer

For data teams looking to increase the availability of trusted data, Astronomer provides Astro, a modern data orchestration platform, powered by Apache Airflow, that enables the entire data team to build, run, and observe data pipelines-as-code. Astronomer is the commercial developer of Airflow, the de facto standard for expressing data flows as code, used by hundreds of thousands of teams across the world.

View Software

Databricks Data Intelligence Platform

Databricks

The Databricks Data Intelligence Platform allows your entire organization to use data and AI. It’s built on a lakehouse to provide an open, unified foundation for all data and governance, and is powered by a Data Intelligence Engine that understands the uniqueness of your data. The winners in every industry will be data and AI companies. From ETL to data warehousing to generative AI, Databricks helps you simplify and accelerate your data and AI goals. Databricks combines generative AI with the unification benefits of a lakehouse to power a Data Intelligence Engine that understands the unique semantics of your data. This allows the Databricks Platform to automatically optimize performance and manage infrastructure in ways unique to your business. The Data Intelligence Engine understands your organization’s language, so search and discovery of new data is as easy as asking a question like you would to a coworker.

View Software

SimpleKPI

Iceberg Software

Managing data doesn't need to be complicated. SimpleKPI offers everything you need to monitor and visualize your business metrics. It's brimming with easy-to-use features that reduce the time it takes getting to grips with your business performance. A simple, easy-to-use dashboard that takes complex data and turns it into accessible and understandable visuals. Create top-level summaries of your KPIs to share across your organization. Choose from various charts, graphs, league tables, and widgets that help communicate an accurate understanding of your data. Business decisions shouldn't be made blindly. That's why we've integrated powerful reporting into every aspect of SimpleKPI. Get both summary and detailed information that provides a complete picture of your progress towards targets and goals.

Starting Price: $99 per month

View Software

MinIO

MinIO's high-performance object storage suite is software defined and enables customers to build cloud-native data infrastructure for machine learning, analytics and application data workloads. MinIO object storage is fundamentally different. Designed for performance and the S3 API, it is 100% open-source. MinIO is ideal for large, private cloud environments with stringent security requirements and delivers mission-critical availability across a diverse range of workloads. MinIO is the world's fastest object storage server. With READ/WRITE speeds of 183 GB/s and 171 GB/s on standard hardware, object storage can operate as the primary storage tier for a diverse set of workloads ranging from Spark, Presto, TensorFlow, H2O.ai as well as a replacement for Hadoop HDFS. MinIO leverages the hard won knowledge of the web scalers to bring a simple scaling model to object storage. At MinIO, scaling starts with a single cluster which can be federated with other MinIO clusters.

View Software

Hadoop

Apache Software Foundation

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page. Apache Hadoop 3.3.4 incorporates a number of significant enhancements over the previous major release line (hadoop-3.2).

View Software

Apache Spark

Apache Software Foundation

Apache Spark™ is a unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

View Software

Amazon Kinesis

Amazon

Easily collect, process, and analyze video and data streams in real time. Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin. Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time, so you can derive insights in seconds or minutes instead of hours or days.

View Software

Presto

#1 Contactless Dining Solution. $0 Monthly Fee. With over 100 million monthly active users and 300,000+ shipped systems, we are the largest provider of contactless dining technology across the world. Learn more about our best-selling product. Our Contactless Dining Solution enables your restaurant to provide an end-to-end contactless dining experience to your guests. Guests can view the complete menu, place orders, pay at the table, and much more—without the need for any human contact. Sign up today and be completely contactless in just 3 days. There are no recurring fees (regular payment processing charges apply) and there is no need to change your existing POS system. The solution is available globally but supplies are limited and demand is high, so reserve your spot now. With over 100 million guests already using Presto every month and 300,000+ systems shipped, we are the largest provider of contactless dining technology in the U.S. and Europe.

View Software

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the Transaction Log. In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.

View Software

MLflow

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components. Record and query experiments: code, data, config, and results. Package data science code in a format to reproduce runs on any platform. Deploy machine learning models in diverse serving environments. Store, annotate, discover, and manage models in a central repository. The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. MLflow Tracking lets you log and query experiments using Python, REST, R API, and Java API APIs. An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions. In addition, the Projects component includes an API and command-line tools for running projects.

View Software

Apache Flink

Apache Software Foundation

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream. Apache Flink excels at processing unbounded and bounded data sets. Precise control of time and state enable Flink’s runtime to run any kind of application on unbounded streams. Bounded streams are internally processed by algorithms and data structures that are specifically designed for fixed sized data sets, yielding excellent performance. Flink is designed to work well each of the previously listed resource managers.

View Software

Apache Airflow

The Apache Software Foundation

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity. Airflow pipelines are defined in Python, allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically. Easily define your own operators and extend libraries to fit the level of abstraction that suits your environment. Airflow pipelines are lean and explicit. Parametrization is built into its core using the powerful Jinja templating engine. No more command-line or XML black-magic! Use standard Python features to create your workflows, including date time formats for scheduling and loops to dynamically generate tasks. This allows you to maintain full flexibility when building your workflows.

View Software

lakeFS Integrations

Treeverse

22 Integrations with lakeFS

Looker

Amazon Web Services (AWS)

Amazon S3

Google Cloud Storage

Jupyter Notebook

Amazon Athena

Apache Hive

Apache Kafka

Amazon SES

Azure Blob Storage

Astro

Databricks Data Intelligence Platform

SimpleKPI

MinIO

Hadoop

Apache Spark

Amazon Kinesis

Presto

Delta Lake

MLflow

Apache Flink

Apache Airflow

lakeFS Integrations

Treeverse

22 Integrations with lakeFS

Looker

Amazon Web Services (AWS)

Amazon S3

Google Cloud Storage

Jupyter Notebook

Amazon Athena

Apache Hive

Apache Kafka

Amazon SES

Azure Blob Storage

Astro

Databricks Data Intelligence Platform

SimpleKPI

MinIO

Hadoop

Apache Spark

Amazon Kinesis

Presto

Delta Lake

MLflow

Apache Flink

Apache Airflow

Related Categories

Related Categories That Integrate With lakeFS