Back to Blog

A Hybrid and Multi-Cloud Playbook for AI and Big Data Workloads

Today, on behalf of the entire product team at BlueData, I’m thrilled to announce the general availability of BlueData EPIC on Microsoft Azure and Google Cloud Platform (GCP). This includes support for multi-cloud deployments as well as support for each of these public cloud services in a hybrid architecture.

Here at BlueData, we’ve been continuing to drive innovation and new capabilities for our container-based software platform: BlueData EPIC. Our goal is to provide a turnkey solution for Big-Data-as-a-Service and AI-as-a-Service in the enterprise – running on any data center infrastructure, on any public cloud, in a hybrid cloud model, or in a multi-cloud deployment.

Last spring, we introduced our support for hybrid deployments spanning on-premises and AWS environments. And last fall, we announced initial directed availability for BlueData EPIC on GCP and Azure. Since then, we’ve worked closely with many of our customers on the cloud strategy (public cloud, hybrid cloud, and/or multi-cloud) for their AI and Big Data initiatives. And we’ve incorporated those learnings to extend BlueData EPIC’s multi-cloud and hybrid cloud functionality, now with GA support for all three major public cloud services.

Some of our customers are “all-in” on cloud and have been early adopters of the public cloud for their analytics and machine learning workloads. But other customers have deployments that are primarily on-premises (e.g. Fortune 500 enterprise organizations that were not born in the cloud); the statements and questions we hear from them can be summarized as follows:

  • Our Big Data deployment is on-premises today, and we have a broad range of highly customized applications using multiple analytics, data science, and machine learning tools. These tools went through extensive testing and validation to meet our unique requirements.
  • We’re evaluating moving these environments to the cloud, but the data is on-premises (e.g. due to security or regulatory considerations). How should we approach this?
  • Can we migrate all or a subset of these existing Big Data application environments to the public cloud, with minimal to no impact on my data science and developer user community?

My team and I are very fortunate to have a front row seat (and sometimes the hot seat!) with enterprise customers who are attacking these challenges head on. One aspect that is often underappreciated is the extensive engineering, training, and operational processes that are in place around these existing Big Data environments on-premises. And the data volumes can range from hundreds of terabytes to several petabytes. Another important issue is that the end users (i.e. data analysts, data scientists, developers) have relatively little tolerance for change (whether it is user interface, syntax or tools) – and they have even less tolerance for any delay in getting the analytics and machine learning environments they need to drive their business initiatives.

In other words, developing a cloud strategy in this area involves a wide range of considerations that go well beyond the typical options we often hear about – such as moving all of your on-premises data to cloud storage (e.g. Amazon S3, Azure Storage, Google Cloud Storage); and/or using the standard Big Data services offered by public cloud providers (e.g. Amazon EMR, Azure HDInsight, Google DataProc); and/or re-deploying existing Big Data analytics and machine learning tools on public cloud infrastructure.

Ideally, a comprehensive cloud strategy would take these considerations into account –providing simplification and automation; ensuring minimal to no additional skills overhead; reusing existing Big Data products, versions, and configurations (e.g. without requiring additional extensive testing and validation) as well as existing operational and security models. All this, while making it invisible and seamless to the end user community of data scientists, analysts, and developers.

This is where BlueData fits in …

By leveraging containers to run Big Data and AI workloads, the BlueData EPIC software platform provides an abstraction layer to deploy these tools (e.g. Hadoop, Spark, Kafka, TensorFlow, and more) identically whether on-premises or in the cloud – with common governance and security for users, software bits, and data. And by enabling compute / storage separation, BlueData also enables these tools to access data wherever it resides, with control over who can access that data.

Here is a glimpse of a playbook being used by our customers for their cloud strategy – to move from on-premises to a hybrid or multi-cloud approach, all the while making it transparent to their end users, eliminating data duplication, and reusing their existing Big Data and AI tools.

Step 1: Add BlueData as an abstraction layer to enable compute / storage separation

  • Deploy BlueData EPIC standalone on a specific public cloud (AWS, Azure, or GCP)
  • Or deploy BlueData EPIC in a hybrid model that spans on-premises infrasturcture and one or more public clouds
  • Enable BlueData’s DataTap connectivity to an existing on-premises data lake (e.g. HDFS)

Step 2: Containerize the ‘compute’ clusters (including analytics, BI, ETL, and machine learning tools) and automate their deployment on BlueData EPIC

  • Run the ‘compute services’ (e.g. YARN, Spark, Hive, Impala) and ‘user interface/apps’ (e.g. Hue, BI / ETL tools) as unmodified ‘containerized’ clusters on BlueData EPIC
  • Selectively place containerized clusters on specific on-premises or public cloud hosts using BlueData’s ‘Host Tags’ feature
  • Transparently route the end users to these ‘containerized’ compute clusters

Step 3: Selectively migrate containerized clusters between on-premises and public cloud hosts

  • Leverage ‘Host Tags’ in BlueData EPIC to control placement of containerized Big Data clusters
  • Use DataTap to access on-premises data from the containerized clusters running on the public cloud
  • Use external persistent storage volumes for the containerized Big Data clusters to migrate ‘stateful’ containers between on-premises and cloud environments

Some of the key benefits of this approach include:

  • Common self-service user interface: Data science teams can have the same user experience to spin up on-demand environments for AI and Big Data workloads, regardless of the infrastructure.
  • Common governance, security, and control: Administrators and IT teams can ensure consistent authentication, access, and enterprise-grade security in a hybrid or multi-cloud deployment.
  • In-place analytics with compute / storage separation: Customers can tap into data wherever it’s stored (on-premises or cloud), resulting in less data duplication and reduced data transfer costs.
  • Flexibility and infrastructure portability: The same Docker application images can be used with any public cloud or on-premises, avoiding cloud lock-in and future-proofing the deployment.

So how do we do it? Here’s a quick drill-down of the some of the features and functionality that BlueData EPIC provides to enable multi-cloud and hybrid deployments:

  • Enterprise-grade security and compliance: Cloud-specific credentials such as IAM profiles, subnet and security group identifiers are not stored in BlueData software. Our customers have complete control over their cloud instances, including the ability to use their own certified OS images/versions, tagging rules, and placement in different subnets/availability zones. Meanwhile, the end users of these Big Data applications can transparently use the same security authentication and authorization (e.g. AD/LDAP, SSO) – and they aren’t even necessarily aware whether the BlueData EPIC deployment is running on-premises or in the public cloud.
  • Higher utilization of cloud instances: With its CPU over-provisioning feature, BlueData EPIC allows for denser packing of containers in each cloud instance, thereby increasing utilization and reducing costs for public cloud deployments.
  • Multi-tenancy and Isolation: BlueData EPIC provides secure multi-tenancy and strict network isolation between tenants, without interfering with any cloud-specific networking constructs. It includes an embedded and cloud-agnostic private container network, enabling this isolation irrespective of how the cloud instances that are installed with BlueData EPIC are organized (whether on a single subnet or across multiple subnets).
  • Flexible Cloud Storage Access: With BlueData, containerized clusters can run jobs against cloud storage (e.g. Amazon S3, Azure Blog Storage, Azure Data Lake Storage, Google Cloud Storage). Automation of the configurations to these storage systems can be performed via the BlueData Action Scripts feature. For use cases where data needs to be accessed or shared locally in the container, cloud storage can be mounted as file system into the container and the BlueData FS Mounts feature can be used. In addition to flexible data access, the use of FS Mounts allows for access control of different buckets to different tenants.
  • Hybrid data access: BlueData provides the unique and differentiated ability to tap into on-premises storage (e.g. a HDFS data lake) from compute clusters running in AWS, Azure, or GCP — assuming network connectivity between the cloud instances and on-premises clusters. BlueData’s DataTap is a patented compute/storage separation technology that allows for this secure, optimized HDFS connectivity.
  • Pre-built automation templates: A new automation toolkit can be used for both on-premises and cloud deployments. It includes a command line (CLI) tool written in Golang that can run from a Windows, Mac, or Linux machine. This tool takes various credentials as input in the form of environment variables (or through yaml) from the user to create and manage the deployment without storing any credentials. BlueData EPIC also supports the native AWS, Azure, and GCP deployment tools. For example, BlueData now supports the GCP ‘Deployment’ construct that can be deployed using gcloud command line tool or the Azure ‘Resource Manager Template’ for a fully automated deployment of BlueData EPIC.

The screenshots below show how this works with Azure and GCP:

Microsoft Azure Deployment

Azure administrators can launch a pre-built Resource Manager Template to launch a BlueData EPIC deployment with a custom name (e.g. bd-epic) and parameters such as instance types and the size of the deployment in their own account:

Google Cloud Platform (GCP) Deployment

GCP administrators can launch a deployment in their own account by using the gcloud command line tool. This CLI offer the flexibility to specify parameters in a YAML file:

“Great things are done by a series of small things brought together.”  – Vincent Van Gogh

The general availability of BlueData EPIC on Azure and GCP – along with innovations enabling hybrid and multi-cloud deployments from our other recent product releases – provides a strong foundation for our customers as they continue their cloud journey, while ensuring flexibility and minimal disruption to their existing environments. By combining the power of containerization with new functionality for automation and compute / storage separation, BlueData is helping our customers to deliver greater agility, faster time-to-insights, and lower overall TCO for their Big Data and AI initiatives.