The juggernaut that is Kubernetes has been underway and gaining momentum for some time now. It provides an extensible container orchestration framework for automating the deployment, scaling, and management of any containerized application. It has a rich ecosystem of plugins for handling everything from storage to security. And while it was originally designed for running stateless applications, it has been gaining ground in the support of stateful applications as well.
However, there hasn’t been any easy way to deploy and manage distributed stateful applications consisting of a multitude of co-operating services (e.g. for Big Data and AI use cases) with Kubernetes. Spoiler alert: that’s the problem we’re now tackling head on.
As one of the founders of BlueData, I’ve been working in this space for the past several years. I’ve been fortunate to work with dozens of teams at enterprises across multiple industries and geographies to successfully deploy large-scale distributed stateful services such as Hadoop, Spark, Kafka, Cassandra, TensorFlow, and other analytics, data science, machine learning, and deep learning tools in containerized environments.
Together with my colleagues, we’ve developed a lot of learnings and expertise on what it takes to run such containerized applications securely and reliably in the enterprise. Our knowledge is the result of our close collaboration with Fortune 500 and Global 2000 enterprise organizations that have a wide range of distributed stateful workloads for Big Data analytics, data science, machine learning, etc. — and need to meet stringent enterprise-grade security, performance, and lifecycle management requirements.
We see Kubernetes as the de facto standard in the industry for container orchestration; as such, we want to ensure that complex distributed stateful services are well supported in the Kubernetes community. So our engineering team has been busy working with Kubernetes, developing prototypes with Kubernetes in our labs, and collaborating with multiple enterprise organizations to evaluate the opportunities (and gaps) in using Kubernetes for these complex distributed stateful applications.
To this end, we’re introducing a new Kubernetes open source initiative we call BlueK8s. The BlueK8s initiative will include a number of projects to help bring enterprise-level capabilities for distributed stateful applications to Kubernetes. The first open source project in this initiative, dubbed Kubernetes Director or KubeDirector for short, is a custom controller which simplifies and streamlines the packaging, deployment, and management of complex distributed stateful applications for Big Data and AI use cases.
There are a lot of existing open source projects in the Kubernetes community for addressing various requirements with stateless as well as stateful applications. For example, the Kubernetes Operator framework provides a great toolkit for building and deploying application-specific Operators to manage the lifecycle of a particular application. This is done by implementing a simple finite state machine, also known as a reconciliation loop:
- Observe: Determine the current state of the application
- Analyze: Compare the current state of the application with the expected state of the application
- Act: Take the necessary steps to make the running state of the application match its expected state
While the implementation of a Kubernetes Operator for managing a cloud native stateless application is fairly straightforward, such is not the case for all applications. Most applications for Big Data analytics, data science, and AI / ML / DL are not implemented in a cloud native architecture, and many of these applications are stateful. In fact, a distributed data pipeline typically consists of multiple different applications each with their own unique attributes; and these applications vary widely depending upon the use case. This means that they cannot be decomposed into self-sufficient and easily containerizable microservices without a lot of rework. Nor can you easily create an application-specific Operator for each possible permutation. They are a jumble of tightly integrated processes with interdependencies that are not well understood and whose state is distributed across multiple configuration files.
This is the gap that we intend to address with the new Kubernetes Director (aka KubeDirector). KubeDirector is built upon the Kubernetes custom resource definition (CRD) framework and achieves the following:
- It leverages the native Kubernetes API extensions, design philosophy, and authentication
- There is a minimal learning curve for those developers familiar with Kubernetes
- It is not necessary to decompose an existing application to fit microservices patterns
- It provides native support for preserving application configuration and state
- It utilizes an application-agnostic deployment pattern, minimizing the time to onboard stateful applications to Kubernetes
- It is application-neutral: supporting many applications simultaneously via application-specific instructions specified in YAML format configuration files
- It supports the management of distributed data pipelines consisting of multiple applications such as Spark, Kafka, Hadoop, Cassandra, TensorFlow, etc. — including a variety of related tools for data science, machine learning, business intelligence, ETL, analytics, and visualization
With KubeDirector, it is not necessary to build and implement an application-specific Kubernetes Operator in order to manage the cluster for a stateful application. KubeDirector will manage the cluster for you. All communication with KubeDirector is performed via kubectl commands. The expected state of a cluster is submitted as a request to the API server and stored in the Kubernetes etcd database. KubeDirector will apply the necessary application-specific workflows to transform the current state of the cluster into the expected state of the cluster. Different workflows can be specified for each application type, as illustrated in the diagram below — illustrating a simple example (using KubeDirector to deploy and manage containerized Hadoop and Spark application clusters):
For a more detailed technical description of KubeDirector — including an architecture overview — you can check out the github project wiki. The first pre-alpha version of KubeDirector is in development now, and will be available soon at https://github.com/bluek8s/kubedirector.