Q&A with data Artisans: on Stream Processing Powered by Apache Flink and data Artisans Streaming Ledger

By Community Team November 30th, 2018

In the era of big data and Internet of Things (IoT), companies wanting to stay ahead of the curve must learn how to manage, analyze, and manipulate troves of information in order to boost business efficiency and create more relevant and meaningful customer experiences. Fortunately, modern innovation such as stream processing is now available to help businesses harness the power of big data by evaluating real-time data when it’s in motion and turning it into actionable insights for better decision-making.

One powerful, open source stream processing framework that unifies real-time analytics and event-driven applications is Apache Flink. Originally created by data Artisans, Apache Flink works as a cutting-edge stream processing engine that powers large-scale stateful applications, including search and content ranking, real-time analytics, and fraud detection.

In this article, Stephan Ewen, the co-founder and CTO of data Artisans and Project Management Committee member of the Apache Flink project, talks to SourceForge about the advantages of using stream processing powered by Apache Flink. Ewen also shares how businesses can take advantage of data Artisans Streaming Ledger for efficient and scalable transaction processing and application development.

Q: The founders of data Artisans initiated the development of the open source Apache Flink project to enable real-time stream processing of data. Could you offer us a brief overview of Stream Processing in the simplest terms?

Stephan Ewen CTO and co-founder of data Artisans

Stephan Ewen, the co-founder and CTO of data Artisans and Project Management Committee member of the Apache Flink project

A: Stream processing started as the processing of data in motion: the continuous computing on data directly as it is produced or received. Virtually all data–including user activity on a website, interactions with mobile applications, financial trades, database modifications, application and machine logs, sensor events and so on–is produced as a continuously generated stream of events. Hence, stream processing is a widely applicable technology.

Before stream processing, data was often stored in a database, a file system, or other forms of mass storage. Applications would periodically query the data or process the data. Stream processing turned this paradigm around: The application logic, analytics, and queries run continuously, as event streams flow through them. Upon receiving an event from the stream, a stream processing application reacts to that event: it may trigger an action, update an aggregate or other statistics, or “remember” that event for future reference.

As technology evolves, we see stream processing with Flink as the new paradigm and unified way of processing data. What Flink allows you to do is to run processing logic ad-hoc on data “at rest” and continuously on data “in motion”. This is possible by conceptualizing batch processing (querying data from a data lake, file system or other mass storage) as a special bounded stream of events that has a beginning and an end. With a technology like Flink, companies can benefit from a unified data processing framework that allows leveraging any data asset available to the business, whether is generated in real-time from or is historic and stored in a database.

Q: Why should organizations implement Stream Processing?

A: Stream processing enables businesses to respond immediately to events as they are generated and when they are most critical. This means they are better able to react to new opportunities, address customer inquiries in real-time, provide a more personalized, unique experience for each customer, and identify potential problems before they impact the business.

We’ve found that companies realize business ROI faster with real-time data applications powered by stream processing than with traditional analytics or business intelligence technologies that provide retrospective insights. In our experience, the adoption of Flink in the enterprise starts with very specific use cases and companies realize business ROI quickly because they get live applications up and running as the first step. For example, after 1 year of Flink being in production at Alibaba, the company reported a 30% increase in conversion rate during Singles Day 2016 ($25 billion of merchandise sold in a single day this year).

Q: Can you tell us about the difference between Batch Processing and Stream Processing? Is one better than the other?

A: Stream processing is the processing of data directly at the time of generation in real-time streams. This benefit is pivotal for critical use cases such as real-time fraud detection, machine sensors, or to address customer needs exactly when they occur. Stream processing is the technology of choice for all cases that need real-time actions on events. Classic databases, in comparison, are based on the approach that data must first be stored before companies can retrospectively gain insights on the data, through the use of analytics or business intelligence applications that query the data in batches. Typically with batch processing, reports are run at the end of the day to show daily sales information and other business metrics.

Q: What are the advantages of Apache Flink compared to other technologies like Apache Spark or Apache Kafka?

A: Apache Flink is used in analyzing and processing very high-volume data streams with very low latency. This can be data that is read by a message queue or a file system. Using Apache Flink and a streaming data architecture, organizations can respond to insights from events within milliseconds, as well as address the needs of historical computing with a single platform.

Apache Flink is powerful, yet flexible and expressive. At the same time, Flink is agnostic in its use, which proves the great acceptance and the wide range of applications in practice.

Compared to Apache Kafka and Apache Spark, Apache Flink was designed and developed as a high throughput, low latency, and accurate semantic stream processing framework, while other frameworks have their origins in batch processing (Spark) or message storage and distribution (Kafka) and were later supplemented by the ability of data stream processing.

Apache Flink is today used for the largest stream processing use cases in the world. For example, with Flink, Netflix processes more than 5 trillion events per day (50+ million events per second) on thousands of CPU cores.

Q: For which applications or application scenarios is the use of stream processing like Apache Flink interesting?

A: Apache Flink is the fastest-growing open source project, and the use cases are constantly expanding. Businesses use Apache Flink to run mission-critical applications such as real-time analytics, machine learning, anomaly detection in cloud activities, search, content ranking, and fraud detection. Use cases in the financial services sector include master data management and capital risk management; in e-commerce, real-time recommendations is a popular use case.

Q: data Artisans recently introduced a new technology that significantly increases the range of use cases for stream processing with Apache Flink. Can you tell our readers more about this? How does it work?

data Artisans Platform with Streaming Ledger

A: Yes, the new product is data Artisans Streaming Ledger, a patent-pending technology that enables distributed serializable ACID transactions for applications based on a streaming architecture. With the introduction of Streaming Ledger as part of data Artisans Platform, stream-processing applications can be created that read and update multiple distributed data entries with ACID guarantees. For the first time, the strongest consistency guarantees are available for stream processing applications, which are also offered by most (but not all) relational databases. So far, stream processing frameworks worked consistently on a single entry.

data Artisans Streaming Ledger processes event streams across multiple shared states/tables with serializable ACID semantics. Similar to serializable ACID transactions in a relational database management system, each transaction modifies all tables in isolation against simultaneous changes. This ensures full data consistency as in today’s best relational databases.

data Artisan’s Streaming Ledger provides these guarantees while maintaining the full scale-out capabilities of exactly-once stream processing, without impacting the application’s speed, performance, scalability or availability. These applications can now take full advantage of all benefits of data stream processing and blend naturally into a streaming data architecture.

Q: Who is data Artisans Streaming Ledger best for?

A: data Artisans Streaming Ledger is designed for today’s data-driven industries. It offers high throughput, so large-scale applications like inventory management, pricing, billing, supply-demand matching, logistics or position keeping can be efficiently transformed to consistent streaming applications without requiring an underlying relational database. We’re seeing tremendous interest from companies across a wide range of sectors including financial services, retail/e-commerce, media, and telecommunications.

Q: What’s next on the roadmap for Apache Flink?

A: New features and ongoing efforts in the Apache Flink development include, for example, strengthening Flink’s SQL interface (and streaming SQL execution) for more use cases, adding support for complex event processing (CEP) to SQL, and adding more resource elasticity & automatic scaling. The Flink community also continues its groundwork effort to make the development and operations of Flink applications as seamless as possible, and adding more connectors and integrations. Finally, the community is adding further optimizations for the processing of “bounded data” (i.e., batch-style processing).

About data Artisans
Founded by the original creators of open source Apache Flink, data Artisans delivers cutting-edge, real-time data applications to businesses small and large. Headquartered in Berlin, Germany, with offices in San Francisco, California, data Artisans delivers turnkey stream processing solution to global brands such as Alibaba, Netflix, ING, and Uber. The company is backed by Tengelmann Ventures, Intel Capital, and b-to-v Partners.

Tags: big data, open source solution, real-time analytics, stream processing