Q&A with Dremio: on Big Data, Open Source Project, and Dremio’s Data-as-a-Service Platform

By Community Team February 5th, 2019

With the advent of big data, today’s organizations must have a seamless solution to curate, accelerate, and share their data, faster and more efficiently. To address these needs and get more value from big data, data engineers and business analysts should leverage cutting-edge Data-as-a-Service (DaaS) solutions like Dremio. Powerful and scalable, Dremio’s DaaS platform helps data analysts and scientists effectively work together and seamlessly shape their data to help their organizations and their consumers successfully achieve digital transformation.

SourceForge spoke with Tomer Shiran, CEO and co-founder of Dremio, to discuss the value of a Data-as-a-Service (DaaS) solution and highlights the features and benefits of Dremion’s DaaS platform. Shiran also offers his insights on the rising trends and technologies that may impact the big data and database industries in the coming years.

Q: Can you share with us a brief overview of Dremio (year founded, industries you serve, current customers, etc.)?

Tomer Shiran, CEO and co-founder of Dremio

A: Dremio was founded in mid-2015 by Tomer Shiran and Jacques Nadeau. At the time of the founding, the team also helped create the Apache Arrow project, which is now downloaded over 1 million times a month and is core to Dremio’s Data-as-a-Service platform. Dremio is deployed by thousands of organizations in over 100 countries by companies who want to get more value from their data, faster. Current customers include Diageo, Microsoft, OVH, Quantium, Royal Caribbean, Standard Chartered, TransUnion, UBS, and VirginOrbit.

Q: Who are the forces behind Dremio? What are the pain points the company seek to solve?

A: Today, data is core to many businesses, yet accessing and analyzing data is still highly dependent on core IT functions such as data engineers. Consumers of data spend far too much time waiting for IT to provision the data they need for a given task and, as a result, the scarce skills of data scientists and business intelligence (BI) users are underutilized. While infrastructure and most areas of technology have moved to an “as-a-Service” model, data itself is still managed centrally by a select few, far from any kind of self-service model for the end user.

Dremio was created to fundamentally change the way data consumers discover, curate, share, and analyze data from any source, at any time, and at any scale. Dremio works with existing data, so rather than first consolidating all your data into a new silo, Dremio is able to access data where it is currently managed. Second, Dremio is designed for governed self-service, providing an experience like Google Docs for datasets, making it easy for data consumers to work independently and at their own pace while continuing to use their favorite BI and data science tools. Finally, Dremio solves the hard problem of data acceleration with a patented feature called Data Reflections, which can accelerate data from existing system by up to 1000x in many deployments.

Q: What exactly is Data as a Service (DaaS)? How does this new paradigm empower today’s data engineers and help data consumers achieve digital transformation?

A: Data-as-a-Service is a paradigm for making data easy to discover, curate, share, and analyze no matter where it is being managed, no matter how big it is, and no matter what tool is used for analysis or visualization. Data-as-a-Service integrates several functional areas into a single, scalable, and self-service solution: data acceleration, data federation, data lineage, data curation, and data catalog. By adopting the Data-as-a-Service paradigm, companies can make their data consumers more self-sufficient and independent, while making their data engineers more productive.

Dremio currently provides Data-as-a-Service as an Apache-licensed project that users can run wherever they like.

Q: What are some of the most notable changes in the technology space, particularly in the database and software development industries, in recent years? And what perks and challenges did these bring to today’s software developers and data engineers?

A: The last 5 to 10 years have been a great time for application developers in terms of data management. Today, teams can choose between traditional RDBMS, NoSQL, file systems, object stores, and native cloud services to manage the data for their apps. While this flexibility has allowed teams to develop apps and new features more quickly, it has created massive challenges for companies in terms of analytics.

The ecosystem for data analytics is largely based on the relational paradigm, with millions of data consumers using tools that assume SQL and relational databases for accessing and analyzing data. With the proliferation of data management technologies and microservices architectures, which break data up across logical persistence layers, companies struggle to bring their data together into a high performance relational system that integrates with existing tools and skills.

Data-as-a-Service provides a fundamentally different approach that allows data to be managed in whichever persistence technology is most appropriate for a given use case while providing the functionality, performance, and security controls required for integrating with the traditional tools deployed on millions of desktops.

Q: How are these challenges being tackled by Dremio’s Database-as-a-Service (DaaS) platform? Can you provide us with sample use cases?

A: Dremio’s Data-as-a-Service platform is frequently deployed on top of multiple database, file system, and object store sources, then made available for data consumers to discover and analyze the data themselves. For example, a common pattern is to deploy Dremio on top of a data lake (eg, Amazon S3, Hadoop, ADLS) and relational databases. A Tableau user can then use Dremio to first search a catalog to discover datasets that are appropriate for their given task, then easily join data between the data lake and a relational database for interactive access on large volumes of data.

As another example, a data engineer could develop a training dataset for a predictive model using Dremio’s virtual dataset, then share access to this dataset with other members of their team. In the background, Dremio implements the training dataset in a virtual context, so no copies of the data are created unnecessarily. As a result, each user or team can create precisely the dataset they need for a given job without creating endless copies of the data that have to be governed and secured by IT.

We have users deploying Dremio on multi-PB datasets and clusters of hundreds of nodes. The scalability of the solution is one of the key features that drive adoption. There’s really not a better way to get the performance and ease of access that Dremio provides. The fact that Dremio also provides a data catalog, data curation capabilities, and native data lineage are added capabilities that make life easier for data consumers and data engineers.

Q: Tell us more about Dremio’s DaaS platform. What are the key features and capabilities that make it stand out from other alternatives in the database industry?

A: To build Data-as-a-Service without Dremio, you would need to deploy a number of specialized, niche products, and a lot of time integrating and operating these technologies together. These include but are not limited to:

Data acceleration. Typically this is addressed through a combination of aggregation tables in data warehouses and data marts; proprietary cubes or extracts that are specific for a BI tool; in-memory data grids; and other types of physical approaches to performance that create unnecessary data copies that create governance and security risks.
Data catalog. Most companies do not have an inventory of their data assets, and if they do they are typically out of date or unreliable, as well as difficult to access. While some niche products exist in the market, they only solve a small piece of the larger problem.
Data curation. ETL and several data prep tools exist in the market. These are largely IT-driven and difficult to operate. Because these tools generate copies of the data as their output, they are never deployed widely in an organization—companies do not want to open the door to thousands of copies of their data. In addition, these tools also only solve a part of the problem: once the data is transformed, you need to put it somewhere for analysis, typically a data warehouse or data mart.
Data federation. Companies need to be able to query data where it exists today. No company will ever succeed in consolidating all their data assets into a single silo.
Data provenance and lineage. As data is accessed, transformed, blended together, and used for different workloads, it is important to be able to understand the lineage and provenance of the data through all stages. Most tools in this space require explicit registering of these steps with a central service, rather than passively capturing these aspects of use automatically.

Dremio simplifies Data-as-a-Service by integrating these capabilities in a self-service solution that is licensed as open source. Rather than integrating and operating multiple distinct, proprietary technologies, Dremio allows companies to move more quickly. In addition, Dremio allows companies to do things as an integrated solution that would be impractical or impossible with multiple niche products.

For example, in Dremio a user can create a new virtual dataset that is a join of multiple data sources, including calculated measures. At creation time Dremio will automatically 1) capture the lineage and provenance of the data; 2) add the new vds to Dremio’s searchable catalog; 3) apply any governance and security controls that are appropriate given the physical data source; 4) mask sensitive data as necessary; 5) make the new dataset available to other users as appropriate based on their LDAP group membership, and 6) track the history of all use of the new dataset.

Q: Dremio is offered in a commercially licensed enterprise edition as well as in an Apache-licensed open source community edition. As advocates of the open source model, why is an open source approach important to developers today? And how has the open source approach benefitted your company?

A: We are big proponents of open source, and have a number of engineers on staff who participate in several Apache projects, including Apache Arrow. We believe building on open source allows us to develop Dremio more quickly and reliably, and we also believe that by making Dremio itself open source, we open the idea of Data-as-a-Service up to everyone. We’ve been very deliberate in making sure that Dremio Community Edition is fully-featured and usable by any company, and this has proven out in the rapid and widespread adoption of the project since we first made it available less than 2 years ago. Our Enterprise Edition provides some advanced features in the areas of security and management that we feel are compelling for large organizations.

Q: Looking into the future, what rising trends, strategies, and technologies do you think will impact the big data and database world? How is Dremio meeting these head-on?

A: Companies are hard at work trying to balance, on the one hand, ease of use and self-service with, on the other hand, governance and security. Data consumers want to be able to work independently and at their own pace, rather than waiting in line for their turn with IT. At the same time, IT is under enormous pressure to ensure data security is carefully and thoroughly enforced. Balancing these two demands with a single source of data would be hard enough, but to try and do so with hundreds and even thousands of data silos is virtually impossible.

In parallel, most companies are replatforming much of their technology stack to the cloud, and for years will need to find ways to be successful in a hybrid model with some dataset on-prem, others in the cloud, and many in the midst of a transition.

Dremio is focused on this reality by giving IT the tools it needs to secure and govern access, no matter the underlying physical source, adatad by making data more of a service within an organization that is delivered so that data consumers can be more independent and self-sufficient.

About Dremio
Dremio is a premier Data-as-a-Service (DaaS) platform vendor with an aim to help data engineering teams more productive and data consumers more self-sufficient. Headquartered in
Mountain View, California, Dremio offers a new and cutting-edge approach to data analytics to help organizations get more value from their data. Dremio specializes in big data, business intelligence, data analytics, machine learning, SQL, NoSQL, Apache Arrow, MongoDB, Hadoop, Elasticsearch, S3, MapR, Jupyter, Hadoop, Java, and more.

About Tomer Shiran
Tomer Shiran is the co-founder and CEO of Dremio. Previously he was the VP Product at MapR, where he was responsible for product strategy, roadmap, and new feature development. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer holds an MS in Electrical and Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology.

Tags: big data, DaaS platform, Data-as-a-Service, database