Back to Blog

Real-Time Data Pipelines with Spark, Kafka, and Cassandra (on Docker)

In my experience as a Big Data architect and data scientist, I’ve worked with several different companies to build their data platforms. Over the past year, I’ve seen a significant increase in focus on real-time data and real-time insights. It’s clear that real-time analytics provide the opportunity to make faster (and better) decisions and gain competitive advantage.

Immediate insights into real-time data can help you in several ways, including the ability to:

  • Spot potential opportunities and risks before it’s too late;
  • Tap into data that’s always on (e.g. sensors, machine logs, web logs, connected devices);
  • React quickly to changing conditions (e.g. identify health issues, replace faulty machinery).

For example, if you’re launching an on-line ad campaign, you’ll want to see how the campaign performs by measuring user engagement, ad views, clicks, downloads, purchases and so on. With real-time data, you can tune the knobs for your campaign while those events are happening. If you’re in the health care industry, you can use real-time monitoring information to detect anomalies, provide early warning, and save patients’ lives. And for those in other industries – whether you work for a financial institution, an insurance firm, an airline, or a telephone company – the ability to track and analyze events in real-time can help you make pro-active decisions, eliminate risks, and provide more competitive offers to your customers.

Real-time is quickly becoming the next phase in the evolution of the Big Data industry.

So why isn’t everyone using real-time data today?

Why aren’t more companies doing real-time analytics to stay one step ahead of their competition? To be clear, some are doing it. Large financial institutions have invested a great deal in their own event-processing frameworks specifically designed for their needs – whether for real-time risk analysis or fraud detection. Big Internet companies have built their own solutions to meet the real-time demands of web-scale businesses dealing with massive amounts of data.

But what about all the other enterprises that would benefit from the ability to gain insights and act immediately on events as they happen? In my opinion, there seem to be four issues holding them back:

  • High cost: Until recently, there were no affordable and easy-to-use frameworks for real-time analysis.
  • Missing requirements: The IT teams in these enterprises may not fully understand the business requirements for real-time analytics unless they are clearly articulated as technology requirements.
  • Limited exposure: The business users often don’t realize what technology is available or what’s possible, especially when it comes to analyzing new data streams, until they see it working in practice.
  • Lack of expertise: It can be hard for IT teams to implement new frameworks because these technologies often have a steep learning curve, and most of their existing staff don’t have the expertise to get started.

The first issue has been partly addressed by the development of several new open source frameworks – including some that were developed by big Internet companies that pioneered the use of real-time analytics.

To address the remaining three issues, what I’ve seen work is to provide a way for business users (i.e. business analysts, data professionals, and developers) to get started, experiment, and iterate rapidly. This allows both the business and IT teams to ramp up over time and quickly traverse through a learning cycle (data->analysis->feedback->more/different data->different analysis->feedback) in a continuous manner. In doing so, they can overcome their lack of exposure and expertise with these tools – and fill in their missing use case requirements – for real-time analytics.

Building and scaling real-time data pipelines

These issues are particularly challenging because the technology, tools, and mindset for building real-time data pipelines are different than for traditional data analysis or the large-scale distributed batch processing made popular with Hadoop.

With real-time, you’re not analyzing data that is stored somewhere after the event; you’re analyzing streams of data that are continuous and always on. Systems built for real-time need to have the ability to collect those data streams, process the data quickly, take immediate action, and store the data for continuous analysis. In parallel, the system needs to evaluate the actions taken and update the model if needed in a very short period of time.

From a technology standpoint, this means you need a system that can do the following:

  • Capture data streams coming in high volumes at high velocity. Each message may not be huge in size, but the throughput can be quite substantial.
  • Scale to process these streams and run a transformation/aggregation (or any model) very quickly. At times, you may have to access data from other sources while running a model.
  • Provide quick lookup to processing stage, and persist these messages, at scale.

In my experience working with companies that have tried to build real-time data pipelines, this type of initiative typically starts as someone’s pet project. They may run it on their laptop, on a few virtual machines, or on a public cloud service. I’ve also seen situations where multiple different teams are experimenting with one element of the overall system without a clear view into the end result. The data pipeline may work well for one power user or for a specific use case. But when they try to stitch it all together and build another pipeline, or scale it to serve multiple users and use cases, it can be cumbersome and unwieldy.

In general, this can be a daunting and complex undertaking. The tools and on-premises infrastructure required for this type of system are time-consuming to assemble and most organizations lack the skills to deploy and wire the needed components. It’s hard to get started and even more difficult to scale in a repeatable and consistent way, with support for all stages of the application lifecycle (e.g. development, quality assurance, user acceptance testing). The overall complexity of the deployment sooner or later takes the attention away from the business problems and use cases they are tying to address.

Spark Streaming, Kafka, and Cassandra

As I mentioned previously, there are several open source frameworks and tools now available for real-time analytics. In particular, the combination of Spark Streaming, Kafka, and Cassandra has emerged as a great fit and a good place to start for building real-time data pipelines.

This new trinity of open source frameworks delivers on key requirements for real-time analysis: including high-throughput, low latency, and a stream-processing framework that is extensible to support growing demand. Kafka is a high-throughput, distributed, publish-subscribe messaging system to capture and publish streams of data; Spark Streaming is an extension of the core Spark API that allows you to ingest and process data in real-time from disparate event streams; and Cassandra provides a scalable and resilient operational database for real-time analytics.

However, as I outlined earlier, many organizations don’t have the expertise in-house to stitch together these frameworks and the infrastructure required. And without practical experience in building real-time data pipelines, it can be difficult for business users, data professionals, and their IT counterparts to deploy the system that they need for initial prototyping, development, and a learning cycle of continuous iteration with these tools.

This is where we can help. BlueData just announced a new Real-time Pipeline Accelerator solution specifically designed to help organizations get started quickly with real-time data pipelines. With BlueData’s EPIC software platform (and help from BlueData experts), you can simplify and accelerate the deployment of an on-premises lab environment for Spark Streaming, Kafka, and Cassandra.

It’s the fastest and easiest way to get up and running with a multi-tenant sandbox for building real-time data pipelines. In my experience, even if you’re already using these technologies for real-time analytics in a production environment, having a sandbox environment for development – with the flexibility to upload new libraries and change configurations – can significantly improve productivity.

This new solution addresses the challenges and complexities of building real-time data pipelines in several ways. But as a data scientist, I’m particularly excited about some of these capabilities:

  • With either BlueData’s web-based user interface or command line API, you can spin up instant clusters for Spark, Kafka, and Cassandra (using pre-configured Docker images) in a matter of minutes.
  • You can scale these clusters if and when your use case demands change. You can easily access each cluster from your laptop using the web-based user interface or SSH for the command line interface.
  • You can also use web-based Zeppelin notebooks for your personal workspace and to collaborate with others.

Real-Time Data Pipeline with BlueData

I’ll explain these particular benefits in a little more detail below.

Focus on your real-time use cases and not the infrastructure

Now you can instantly provision an integrated end-to-end data pipeline to “capture streams->analyze (model/score)-> and store”.  As shown in the screenshot below, you can create new clusters for Spark, Kafka, and Cassandra (running in Docker containers) with just a few mouse clicks – without worrying about the inner-workings and infrastructure for these technologies. This helps eliminate the issues around lack of expertise and exposure to these frameworks that I mentioned earlier.

Create a new cluster with BlueData:

create_cluster

The BlueData EPIC software platform takes care of deploying multi-node clusters with resources, configurations, networking, port access, storage and other low-level details that may take your IT teams weeks (if not months) to figure out and assemble. All the clusters are created using best practices for that application and are not modified in any way. And it’s all in a multi-tenant environment that can be easily extended to add new tenants / users or new applications as needed.

Cluster management with BlueData:

clusters_deployed

Boost productivity with immediate access to developer-friendly tools

The BlueData platform provides out-of-the box support for web-based Zeppelin notebooks and other JDBC-supported tools to improve productivity. Your developers, data scientists, and business analyst teams can quickly ramp up and traverse through a continuous and iterative learning cycle – for rapid prototyping, development, testing, and quality assurance with real-time analytics applications.

As shown below, once the clusters are up and running, power users can get command line access to these environments (running on Docker) and start coding immediately.

Access to Kafka cluster:

Kafka_access

Access to Spark cluster:

spark-shell

Access to Cassandra cluster:

Cassandra_access

Other users don’t like command line tools, so we also provide out-of-the box integrations with web-based Zeppelin notebooks and other GUI-based based applications for development. The BlueData software platform automatically provisions these tools along with the clusters for immediate use.

Access to web-based Zeppelin notebook for Spark developers:

Zeppelin_access

So if you’ve been thinking about real-time analytics – but you weren’t sure how to get started – now there’s a way to get up and running quickly that makes it easy for both your developers and data professionals as well as your IT teams. BlueData now provides a turnkey on-premises solution for Spark, Kafka, and Cassandra in a ready-to-run sandbox environment for multiple users on shared infrastructure. We even provide sample use cases and data to help you build two end-to-end real-time data pipelines as a starting point.

You’ll have a lab environment that can be used to explore multiple different real-time analytics use cases, shared with multiple users to support pipeline development, and easily scaled to grow with your organization’s needs over time. And with the BlueData EPIC software platform, you’ll have a multi-tenant infrastructure platform that can be easily extended to additional Big Data uses cases and applications—for both data in motion and data at rest—with support for Spark and Hadoop as well as leading business intelligence, analytics, visualization, and data preparation tools.

To learn more about the new Real-Time Pipeline Accelerator, download the solution brief here. You can also watch our on-demand webinar about building real-time data pipelines with Spark, Kafka, and Cassandra here.