Back to Blog

Big Data and Container Orchestration with Kubernetes (K8s)

Here at BlueData, we’ve been leading the charge on deploying and running Big Data applications like Hadoop and Spark on containers: we first announced our support for Docker more than two years ago.

At that time, there was no clear choice for container orchestration. More importantly, the existing tools for container orchestration simply didn’t meet our enterprise customers’ requirements for production deployments of large-scale distributed systems. So we developed our own purpose-built container orchestration for Big Data and included it in our BlueData EPIC software platform.

Since then, we’ve been following this space very closely. We’ve paid particularly close attention to the container orchestration wars (e.g. Kubernetes vs. Docker Swarm vs. Mesos Marathon). We’ve also actively tracked the maturity and evolution of the ecosystem around Kubernetes. And with major technology vendors from AWS to Docker to Microsoft jumping on the bandwagon along with Google and Red Hat, the momentum behind Kubernetes is undeniable.

It is now safe to say that Kubernetes won the container orchestration wars; it’s clearly the de facto standard for stateless applications and microservices. But what about Big Data and stateful applications?

Kubernetes and Big Data

Hadoop and Spark on Kubernetes?

More and more of our enterprise customers are asking us about Kubernetes. Naturally, enterprises that embrace Kubernetes for container orchestration want to evaluate whether they can run every application workload in Kubernetes. While this may be a possibility at some point in the future, Kubernetes is not yet ready to meet this goal. It’s an exciting time and the Kubernetes ecosystem is evolving rapidly – but there is still a lot of work to be done.

The biggest challenge is the lack of a mechanism for Kubernetes to run truly stateful applications: including the broad Big Data ecosystem of distributed systems like Hadoop, Spark, Kafka, analytics, business intelligence, data ingestion, visualization, data science, machine learning, and deep learning.

Some Big Data applications, such as Spark, are being re-architected to better conform to the stateless, microservices architecture supported by Kubernetes. However, most Big Data applications are not. It would require a significant investment from the open source and vendor communities to rewrite them as microservices.

In addition, the majority of these multi-services Big Data applications require HDFS or some form of persistent root storage for storing and analyzing large amounts of data. Unfortunately, persistent storage and HDFS on Kubernetes are not supported at this time. For example, with K8s, modifying the storage resources used within the cluster node requires the node to be temporarily removed from the cluster; this is not ideal for Big Data applications.

“Persistent” storage refers to storage that exists beyond the life span of a single container. In K8s, there is the Persistent Volumes feature. This feature allows persistent storage to be attached, or “mounted”, into a container. The Persistent Volumes feature does address part of the requirement that Big Data applications have for persistent storage.

However, many Big Data applications store configuration information in ”local” or “root” storage within a container. This local, or root storage, cannot be stored on Persistent Volumes. In order to successfully run many Big Data applications on K8s, a solution for persisting the root storage will need to be found.

As noted earlier, Kubernetes was not available when BlueData began using containers nearly three years ago. And only recently has Kubernetes added functionality for pluggable network architectures and persistent storage volumes – both of which are absolutely critical for running Big Data applications in an enterprise environment. More recently still, K8s has added experimental support for StatefulSets pods which are also required for running complex stateful applications.

There is a lot of interesting activity among the Kubernetes community in this area – including projects for Kubernetes on Spark and Kubernetes on HDFS. But most of these initiatives are still in beta and in the experimental stage. And while K8s has been adding support for features that are required for running Big Data applications in containers, it is still missing functionality in several areas (including the persistent root storage feature as described earlier).

Organizations that want to utilize K8s for Big Data applications will need to wait until the Kubernetes community catches up; it could easily take 12-24 months for an enterprise-ready, stable version that can be used for production environments.

Our Point of View on Kubernetes

At BlueData, we believe that Kubernetes is a great container orchestration tool for stateless applications such as web servers and microservices.

However, K8s still has significant challenges for long running, distributed, multi-services Big Data applications: including persistent storage, security, performance, and several other operational requirements. Big Data applications break the typical assumptions for container orchestration; the blind placement of individual services into containers will lead to all sorts of problems. And based on our experience, enterprises that try to make this work will likely fail.

My colleague Tom Phelan wrote about some of these challenges (and in general, the pitfalls of trying to build an in-house solution for running Hadoop and Spark on containers) in a recent blog post here. We also held a webinar on this topic, highlighting the requirements for running Big Data in containers, as illustrated below and in the replay here:

All that having been said, we’re continuing to watch the evolution of Kubernetes very closely. In fact, we’ve been investigating and exploring how we could integrate the BlueData EPIC software platform with Kubernetes for container orchestration.

Turnkey Container-Based Solution for Big Data 

BlueData EPIC is a turnkey platform that enables our customers to deploy and run Big Data applications on secure, fault-tolerant, and fully managed Docker containers. These containers serve as the nodes for Hadoop or Spark clusters; other Big Data frameworks like Kafka and Cassandra; as well as for analytics, business intelligence, ETL, visualization, and data science tools. Our solution was specifically designed from the ground up for running Big Data applications on containers – either on-premises, in the public cloud, or in a hybrid architecture.

As noted above, the BlueData EPIC platform includes our own purpose-built container orchestration layer to schedule and deploy distributed applications using Docker containers. BlueData EPIC also offers a wide array of purpose-built features and capabilities for Big Data applications including lifecycle management, multi-tenancy with secure network isolation, support for Kerberos and encrypted HDFS, our IOBoost technology for performance optimization, our DataTap functionality for compute / storage separation, and more.

In short, BlueData EPIC was designed to meet all of the requirements for containerized Big Data applications in production for large-scale enterprise deployments – container orchestration is just one of many features.

Our Work with Kubernetes

The BlueData engineering team has been experimenting with Kubernetes, and we have developed a prototype that shows how BlueData EPIC can run on K8s and launch Big Data clusters using the K8s container orchestrator.

Without BlueData EPIC, running distributed Big Data applications on K8s is a major challenge: administrators will have to manage multiple YAML files, support multiple enterprise user groups, manage multiple Docker images, manage application-specific service dependencies and cluster-level monitoring, and more. But we think Big Data administrators that want to use K8s shouldn’t have to worry about low-level Kubernetes functionality (e.g. YAML files, connectivity to Docker registries, exhaustive CLI-based kubectl commands).

Here are a few of the specific features and benefits that BlueData EPIC could bring to Kubernetes:

  • BlueData EPIC provides a simple web-based UI or REST API that could natively run on K8s as a pod to orchestrate and manage the lifecycle of Big Data clusters (such as Hadoop, Spark, Kafka, etc.), without needing to learn new Kubernetes-specific concepts and toolsets.
  • BlueData EPIC enables secure multi-tenancy with enterprise LDAP/AD integration for user authentication and authorization. Any cluster that is launched using BlueData EPIC is pre-configured with enterprise credentials to securely manage access policies to the individual Big Data services. 
  • BlueData EPIC comes with ready-to-run Docker images that can either be managed natively or managed externally in a standard Docker trusted registry with the necessary security credentials. BlueData EPIC can deploy a wide range of open source Big Data frameworks unmodified (e.g. Hadoop, Spark, Kafka, Cassandra), so you could get up and running with your first set of Big Data clusters on Kubernetes in minutes.
  • BlueData EPIC offers an App Workbench that enables easy creation and deployment of reusable, distributed application images using simple scripting constructs. Enterprise customers can deploy pretty much any distributed framework as well as non-containerized Big Data applications (e.g. BI and ETL tools) using the App Workbench. This eliminates the need for enterprise administrators to create YAML deployment files and manage multiple Docker images for a distributed Big Data application.
  • BlueData EPIC uniquely offers secure remote connectivity (i.e. our DataTap functionality) to existing HDFS-based data repositories, without degradation in performance and resilience when accessing remote shared datastores. It also handles multiple HDFS protocol versions.

BlueData and Kubernetes

The following figure shows a high-level experimental architecture integration between BlueData EPIC and Kubernetes, in this case for a Cloudera Hadoop ( CDH) cluster and a standalone Spark cluster:

As shown in the diagram above, the BlueData EPIC controller can run on Kubernetes; the controller is deployed as a StatefulSet pod (with its own public IP address), either using the Kubernetas Web UI or the kubectl command line using a BlueData-supplied YAML file. Once the pod is up and running, system administrators can log in to the BlueData EPIC administrator UI as they would do in a standard deployment of BlueData EPIC. From that point onwards, the user experience of deploying and managing clusters is identical to that of a bare-metal cluster on-premises. Administrators and users of the containerized clusters would have a seamless experience of managing tenants, clusters, and images as though they were using bare-metal servers.

This video demonstrates a prototype of how you can deploy BlueData EPIC and create a CDH cluster on Kubernetes:

We’re excited about the possibilities this could offer to our customers in the future – providing a potential plug-in to K8s that makes running containerized Big Data clusters simple and transparent in a secure, multi-tenant environment.

Kubernetes is rapidly maturing as the de facto standard for container orchestration. But K8s still has a long way to go to support stateful Big Data applications (e.g. persistent volume support, automated multi-tenancy support, enterprise-grade security). Stay tuned for additional updates from our team as we experiment with using Kubernetes orchestration to deploy and run Big Data applications on the BlueData EPIC software platform.