Back to Blog

Big Data Analytics in a Hybrid Cloud Environment

Here in Silicon Valley, it’s clear that spring is just around the corner and the hills are a vibrant green. After a wet winter, the rains are turning into spring showers and the sky is full of clouds.

What does this have to do with Big Data? Well, springtime in North America may not arrive officially for a couple more weeks … but the spring release for the BlueData EPIC software platform is already here. Today we announced this exciting new version of our Big-Data-as-a-Service platform. And much like our local weather, “cloud” is here today and very much in the forecast.

This new release delivers the industry’s first truly hybrid architecture for Big Data: with a unified solution for Big-Data-as-a-Service (BDaaS) across on-premises and public cloud environments. Now, our customers can easily use the power of Amazon Web Services (and other public clouds in the future) as an extension to their own Big Data infrastructure. Conversely, our customers can tap into on-premises data from their Big Data deployments on the AWS public cloud. They can provide self-service, elastic, and secure environments for Big Data analytics whether in an enterprise data center, in the public cloud, or some combination of the two. All from the same interface, with the same user experience regardless of the underlying infrastructure.

Last June, I wrote about our vision for BDaaS: extending across hybrid architectures and multiple public clouds. And in December we released general availability for AWS as our first public cloud. Our customers have been very receptive to our vision for BDaaS, and we’ve been implementing it with a multi-phase plan. This new release delivers the next milestone in this vision, helping enterprises with an on-premises Big Data deployment to seamlessly leverage public cloud infrastructure for greater agility, flexibility, and cost savings.

The BlueData EPIC platform uses the inherent flexibility and portability of container technology – with embedded and fully-managed Docker containers – to enable BDaaS on any infrastructure. In December, we introduced BlueData EPIC on AWS, with the ability to tap into both Amazon S3 and on-premises storage. Now, with this release, BlueData EPIC is the first and only BDaaS solution to support the full set of deployment scenarios (on-premises, on AWS, and in a hybrid model) outlined below:

Let’s dig into the some of the key features and benefits of BlueData EPIC’s new spring release.

“Single Pane of Glass” for Big Data On-Premises and on AWS

For enterprises with an on-premises BlueData EPIC deployment, now they can also deploy and manage Big Data clusters on AWS. From one common interface, they can selectively deploy “containerized” clusters – with Docker-based application images for their preferred Hadoop distributions as well as Spark standalone, Kafka, and other Big Data frameworks and tools of choice – on either their own data center infrastructure (on physical servers or virtual machines) or on AWS (and eventually other public clouds).

BlueData EPIC system administrators can simply extend their on-premises deployments to AWS in a few mouse clicks, without using multiple different bolt-on products for different data platforms. They don’t need the help of an AWS DevOps specialist, nor do they need to undergo the learning curve associated with the intricacies of AWS-specific concepts. Onboarding a new “tenant” for AWS is simply a matter of inputting a few parameters (as shown below):

  • IAM Instance Profile: This AWS-recommended approach allows administrators to leverage AWS Identity and Access Management (IAM) to ensure secure access to the appropriate Amazon cloud resources and services – so that users can only access the services they need (e.g. a specific set of Amazon S3 buckets).
  • VPC Subnet ID: This specifies the Amazon VPC network that will be used to host the Amazon EC2 instances spun up within BlueData EPIC.
  • Region: This indicates the AWS region (e.g. us-east-2) for the above VPC subnet.

From there, the BlueData EPIC software takes care of the rest. BlueData leverages these credentials to spin up the necessary Amazon EC2 instances, inject the appropriate Docker image, and configure the Big Data cluster with all the appropriate security controls (including Active Directory and LDAP authentication as described below) as well as data access controls – whether to Amazon S3 buckets (based on the IAM Instance Profile) and/or to on-premises storage (via BlueData’s DataTap technology).

And once the cluster has been configured and spun up, BlueData EPIC provides a simplified user interface to scale the resources on AWS – thereby eliminating the long lead times and costs associated with adding on-premises infrastructure.

Multi-Tenant Platform for Hybrid Architecture with Security and Control

One of key advantages of deploying BlueData EPIC in a hybrid architecture is the ability to use a common security model (with AD/LDAP integration) – along with the same application access controls, data access controls, QoS, resource quotas, and other policy controls – across both on-premises and AWS environments. This unified approach to security and control lowers the risk for multi-tenant enterprise deployments; it also reduces the administrative overhead, while enabling faster onboarding of new user groups (i.e. tenants) and use cases for Big Data.

Administrators can ensure secure login and authentication regardless of whether the infrastructure is on a physical server or in the public cloud. They can use the same security policies like AD/LDAP as well as Kerberos authentication, with seamless access to the BlueData EPIC UI or the cluster nodes (as shown in the screenshots below).

Unified AD/LDAP Security Model for On-Premises and AWS

Furthermore, BlueData EPIC provides the ability to control which tenants (e.g. specific user groups or projects) have access to which specific Docker-based application images. Administrators and operations teams can govern what Big Data / data science / system environments are available to which users or groups of groups, and when those images should be made available.

Control over Docker-Based Application Images, by Tenant

Workload Portability and Reproducibility across On-Premises and AWS

The BlueData EPIC platform includes an App Store that allows customers to install pre-integrated, pre-tested Docker-based application images for common Big Data frameworks and tools. They can also personalize their own App Store by creating new Docker-based application images using our App Workbench – adding their preferred Big Data applications, data processing frameworks, and data science tools.

With this new release, these same application images can be used for deploying containerized clusters either on-premises or on AWS.

Use the Same Docker-Based Application Images on Any Infrastructure

By leveraging the same images – along with a common orchestration approach using BlueData EPIC – our customers can significantly reduce the time, effort, and cost required to spin up clusters on AWS that are 100% identical to their on-premises counterparts (or vice-versa).

This is essential to ensure greater agility and flexibility for data science teams – they need to constantly recreate environments, run parallel environments, and iterate quickly and often with their models in different ways (e.g. to compare and contrast techniques, trying out a new tool or the latest version). It can also provide cost savings and efficiency for Big Data developers and QA teams throughout the application lifecycle (e.g. to spin up AWS clusters for dev/test/QA and the same cluster on-premises for production). And it can help ensure security, minimize risk, and optimize the usage of existing infrastructure by creating identical environments in the public cloud for backup and disaster recovery.

Without BlueData, reproducing environments like this (i.e. creating the identical environment on-premises and on AWS) can be a time-consuming and error-prone exercise. Not only do you need to recreate the exact same on-premises OS image on an Amazon Machine Image (AMI); you also need to add all the necessary software bits, custom code, patches, etc. in order to achieve the same end state. With BlueData, it’s simply a matter of a few clicks to reproduce the identical environment on AWS and on-premises.

On-Premises Spark Cluster with BlueData EPIC

Identical Spark Cluster on AWS with BlueData EPIC

Additional Platform Improvements 

The spring release also includes some new platform enhancements to further simplify installation and improve performance – and more.

For example, we’ve introduced a new agent-based install option that eliminates the manual steps and customization scripts needed when key-based ssh access (e.g. passwordless ssh) is not available.

In addition, this release incorporates new performance enhancements and tunings for Hadoop workloads such as MapReduce and Hive. Stay tuned for details on the results of our recent performance testing in an upcoming blog post.

Learn More at Strata + Hadoop World in San Jose

If you’ll be at the Strata + Hadoop World event in San Jose next week (March 14-16), you can see all of this in action: stop by our booth to see a demo of the new spring release. Or you try BlueData EPIC on AWS – just apply here for a free two-week trial.