Will real-time streaming finally take off?
Like commercial fusion reactors, real-time streaming is a tempting technology, but one that constantly needs only a few more years (or decades) of R&D. But some in the industry notice that something has changed over the past year, and that real-time streaming is finally on its way.
“Every year we wait for the year where the flow of workload takes off, and I think last year it was,” said Databrick’s CEO Ali Ghodsi during his keynote address at the Data + AI Summit last week. “We actually saw 2.5X revenue growth for our streaming workloads last year, so I think streaming is finally happening.”
Of course, streaming data, which some call real-time data, is not a new topic. It has been used in various forms for decades. However, with the first dot-com boom, valuable new types of events, such as clickstreams, became available. In the following years, large data streams have been turbocharged, and new technologies, such as Apache Kafka, have emerged to help manage them. But the means to build operational and analytical applications on top of the channeled data have been somewhat available only to the largest organizations.
The people at Databricks indicate that this may be changing. But why?
“I think it’s because people are moving to the right of this data AI maturity curve,” Ghodsi said during the keynote, “and they have more and more AI usage cases that just need to be real-time, like real-time fraud detection.”
In other words, companies are accelerating their movement from traditional, backward-facing BI workloads to more advanced, forward-looking AI-driven technologies, which he calls the AI maturity curve. These AI-driven predictions need to be made in shorter time windows, hence the need for real-time technology.
Although we do not have insight into the extent of Databrick’s revenues from streaming data in real time, we have an idea of the investments the company makes in that technology. In 2021, it hired Karthik Ramasamy, the creator of Apache Storm and Apache Pulsar, to lead the development of Structured Streaming, the Spark API at a high level of power management.
Ramasamy will be heavily involved in Project Lightspeed, a new initiative Databricks unveiled last week to overtake Structured Streaming. According to a blog post written by Ramasamy and his Databricks colleagues, the main goals of Project LightSpeed include:
- Improve latency and ensure predictability;
- Improve functionality for processing data with new operators and APIs;
- Improving ecosystem support for links;
- And simplifies distribution, operation, monitoring and troubleshooting.
In addition, developers will try to better address technical challenges with real-time streaming, including things like offset management; asynchronous checkpoint; and the state checkpoint frequency.
Lightspeed will provide additional functionality useful for managing events and building real-time applications, such as stateful operators; advanced window; state management; and asynchronous I / O. It will also add “a powerful but simple API for storing and manipulating state” in Python, the company says.
Whether real-time streaming is actually ready to go to the next level or not, it seems that Structured Streaming is getting much better.
It’s not “Mobile Spark”, but it’s nearby
Databricks opens Delta Lakehouse at Data + AI Summit
Databricks strengthens management and secure sharing in Lakehouse