Back to Blog

Containerization for Big Data and Machine Learning

Today we announced the latest release of the BlueData EPIC software platform, with several exciting new innovations for the containerization of Big Data and machine learning workloads.

This new ‘summer release’ for BlueData EPIC represents dozens of new features developed by our software engineering team over the past few months. In large part, this new functionality was based on input and collaboration with our rapidly growing roster of enterprise customers. These innovations will deliver even more agility and infrastructure cost savings for our customers’ Big Data deployments, along with new capabilities to accelerate their AI and machine learning initiatives – whether on-premises, in the public cloud, or in a hybrid approach.

Powering Digital Transformation in the Enterprise

Enterprises in all industries and across all geographies are embarking on digital transformation. We’ve long seen data (i.e. huge and growing volumes of structured and unstructured data) as the fuel for this transformation, and it’s now becoming increasingly clear that AI is the engine. But the implementation of Big Data and AI technologies for enterprise deployments is complex, and this transformation doesn’t happen overnight. It’s a journey: many enterprise organizations first embarked on this journey with Hadoop for large-scale data processing, and then later with Spark and real-time streaming data, and now more recently with machine learning (ML) and deep learning (DL) technologies such as TensorFlow.

Many enterprises were early adopters of first-generation Big Data architectures, leveraging a variety of tools available in the Hadoop ecosystem. No doubt, there are success stories and these organizations are seeing significant value from their investments in Big Data analytics and data lakes (e.g. with HDFS). But these organizations have also faced mounting pressure to do more, faster and cheaper, with the latest and greatest analytics and data science tools. The honeymoon period for these initial Big Data deployments is over. Business leaders want to capitalize on their data with cloud-like OpEx consumption, while minimizing their CapEx investment in new Big Data technology infrastructure. In fact, they are demanding on-demand agility and pay-as-you-go models from their internal Big Data architecture teams.

So these enterprises need to provide this cloud-like agility and elasticity (e.g. leveraging containerization) while extending to new workloads like distributed ML and DL for AI use cases – and at the same time, optimize and derive more value from their existing infrastructure investments in large-scale Hadoop clusters and HDFS storage. To meet this challenge, our customers have identified a few key levers to achieve these goals for their Big Data and AI initiatives:

  • Container-based automation: Containers are now widely recognized as a fundamental building block to simplify and automate deployments of complex application environments, with portability across on-premises infrastructure and public cloud services.
  • Hybrid cloud and multi-cloud architectures: With on-demand compute available from multiple public cloud services, enterprises can extend their on-premises infrastructure and overcome delays in procuring and installing new physical servers.
  • Decoupling of compute and storage resources: With the separation of compute from storage, these organizations can reduce costs by scaling these infrastructure resources independently. And they can leverage their existing storage investments in file, block, and object storage to extend beyond their petabyte-scale HDFS clusters.

With these key concepts in mind, I’d like to highlight and drill down into two specific new features in the BlueData EPIC summer release – each of which were the result of ongoing collaboration with our enterprise customers to address the goals outlined above.

Hadoop “Edge Node” Hell

If your organization is using Hadoop at scale with a few hundred nodes or more, you’ve probably already guessed what I’m talking about. Edge nodes (also known as gateway nodes) play a very important role in the architecture for Hadoop clusters:

  • These nodes host the client applications (e.g. BI, ETL, ML / DL, data science), cluster administration tools (e.g. Ambari, Cloudera Manager), as well as the staging area for ingesting data into HDFS.
  • Another common use case is to extend the existing Hadoop cluster to deploy new compute-only services (e.g. Spark, Kafka). One of the benefits of this approach is the ability to onboard new YARN-managed workloads that are administered by the same cluster administrator (Ambari or Cloudera Manager) with secure access to HDFS (data nodes) running in the same cluster.

In a large-scale enterprise deployment, one of the unintended outcomes of the sprawl of Hadoop clusters (which in itself has proven to be an administrative challenge) is the sprawl of these edge/gateway nodes. The dynamics here are driven by different users or groups that require a specific client application (i.e. their preferred BI, ETL, ML / DL, or data science tools) and the separation of staging data for each individual use case or workload.

As a result, it is not uncommon to have literally hundreds of edge nodes in a large-scale Hadoop deployment. While each Hadoop data node could be justified by the economic value of the data storage, the proliferation of compute-only edge nodes (which are also deployed as physical servers) pose a significant cost and administrative challenge.

Introducing “External Hosts”

With the new BlueData EPIC summer release, we’re delivering a game-changing innovation that allows our customers to simplify and streamline their management of Hadoop edge/gateway nodes. With the ability to consolidate all their edge nodes as containers (virtual nodes) on BlueData EPIC, they can reduce their physical infrastructure by up to 80% – and gain a new level of flexibility that enables them to onboard new users and groups in hours (versus weeks or even months with the traditional physical Hadoop deployment process).

BlueData provides a lightweight agent that can be seamlessly installed on the relevant Hadoop nodes, also known as “External Hosts”. A simple approval workflow and management interface is provided that then enables the External Hosts to be securely networked to private, non-routable containers (virtual nodes) running on BlueData EPIC, thereby allowing them to be used as on-demand, elastic edge/gateway nodes. Hadoop administrators can then target these virtual nodes using their cluster administration and management tool of choice (e.g. Cloudera Manager, Ambari).

See below for a screenshot of the user interface for the External Hosts feature:

Container Migration (with Optional, External Storage)

We are also thrilled to deliver what may be one of the most anticipated features for many of our enterprise customers in this new EPIC summer release: the migration of stateful containers between hosts to support maintenance and disaster recovery scenarios.

This new feature provides the ability to selectively use existing external block storage (e.g. Ceph RBD, ScaleIO, EBS) or file-based storage (e.g. GlusterFS, NFS) as the volumes for containers managed by BlueData. Block storage is the recommended approach in this release.

This new BlueData EPIC functionality was carefully designed and architected as an optional add-on feature for our customers (thereby minimizing upgrade impact). But it enables some very compelling use cases for the migration of containerized workloads in the both the short term and the long term.

In the short-term, these use cases include:

  • Container migration between hosts (i.e. moving designated containers leveraging external storage volumes from one host to another)
  • Disaster recovery (i.e. catastrophic failures of BlueData EPIC server hosts)
  • Database workloads with high availability (i.e. MySQL, Postgres)

And for the longer-term, use cases may include:

  • Production-grade relational databases (e.g. Oracle)
  • The ability to backup and restore fully configured clusters

Operational Requirements for Container Migration

While Hadoop and Spark offer robust high availability capabilities, some enterprise customers have operational requirements that demand the ability to move containers (i.e. virtual nodes) from one host to another. Some examples are outlined below:

  • The BlueData EPIC server host needs to be replaced for maintenance and/or as part of server refresh cycle. In this scenario, all of the infrastructure and stateful containers (including those running master services, cluster administration tools, Zookeeper, etc.) will need to be seamlessly moved with minimal to no downtime.
  • The SLA for the containerized clusters (e.g. Spark or Hadoop) not being met due to the performance of the underlying BlueData EPIC server host. There is a need to re-balance the virtual nodes onto physical host machines with more memory and/or faster disks.
  • Migrate all containers from rack A to rack B.
  • Migrate all containers from data center location A to data center location B.
  • Leverage EBS-backed spot instances on AWS to deploy BlueData EPIC and containers. Restart these containers on a separate instance when the spot instance is terminated or lost.

With this exciting new feature, BlueData addresses these operational requirements through the combination of compute/storage separation and containerization. By leveraging existing storage investments and/or open source software-defined storage products, our customers can further optimize their containerized infrastructure for Big Data analytics and machine learning workloads. BlueData’s existing capability for using direct attached storage (also referred to as node storage – as defined in our storage white paper here) is also available and continues to be a great fit for worker nodes (e.g. node managers, Spark workers).

In addition, the BlueData EPIC summer release delivers an intuitive user interface and one-click task-based operations (e.g. to vacate hosts) for simplified configuration and operations with container migration. BlueData also uniquely preserves specific root directories, thereby allowing for container migrations to retain critical state (e.g. YUM installs after cluster creation) that would otherwise have been lost (thereby rendering the cluster useless after container migration).

The screenshot below shows the user interface to set up and test connectivity to an optional external storage system for this new functionality. In the summer release, block storage (e.g. Ceph RBD, ScaleIO, EBS) will be production-ready and recommended – while other storage systems will be for “Dev/Test” only.

The screenshots below illustrate a simple workflow where a container is migrated or vacated from one specific host to another host with one-click:

The new summer release for BlueData EPIC marks a significant milestone for our platform, developed in close collaboration with our enterprise customers. With this release, BlueData is delivering innovation that far outpaces any other container-based software platform available in the market – with purpose-built functionality designed for large-scale analytics and machine learning workloads. By combining the power of containerization with new functionality for automation and compute / storage separation, BlueData is helping our customers to deliver greater agility, faster time-to-insights, and lower overall TCO for their Big Data and AI initiatives.

Learn More at DataWorks Summit in San Jose

If you’ll be at the DataWorks Summit in San Jose this week (June 19-21), visit the BlueData booth (#S5) in the expo hall to see a demo of the new BlueData EPIC summer release. And make sure you attend my session on “What’s the Hadoop-la about Kubernetes?” at 4:50pm today (Tuesday, June 19) to learn about running stateful Big Data applications on containers.