Back to Blog

Big-Data-as-a-Service. On-Prem or in the Cloud. It’s BDaaS

Today, we announced the directed availability of BlueData EPIC Enterprise for AWS – as part of our broader strategy to provide a flexible Big-Data-as-a-Service (BDaaS) platform for both on-premises and public cloud deployments.

Will the real BDaaS please stand up …

The Real BDaaSHere at BlueData, we are maniacally focused on simplifying and streamlining Big Data infrastructure for deploying the myriad applications and data platforms (both open source and commercial) in this dynamic ecosystem. Our software is BDaaS. And yes, that rhymes with SaaS.

But before I go into what’s new with our BDaaS software, I’d like to share three observations that are key to understanding the Big Data journey for enterprises today:

  • Choice and flexibility are imperative
  • Cloud is not a panacea
  • Data has gravity

Choice and flexibility are imperative

If you follow the trajectory of Spark and Kafka, just to name a few projects purely on the data processing side (there are several others in every category including NoSQL, data integration, machine learning, etc.), it’s clear that enterprises want to quickly adopt and implement the latest Big Data innovations in order to gain competitive advantage.

Data scientists and developers in these organizations want to use the best possible data platform and/or tool for the job at hand. And when appropriate, they will swap out their current platform or tool when a better option becomes available – as is happening today where Spark is replacing Hadoop MapReduce, and new technologies like Kafka and Flink are gaining in popularity.

To date, Hadoop distributions carried the torch of packaging some core services like Hadoop HDFS storage, YARN resource scheduler, and a collection of compute services focused on batch processing, SQL etc. While they serve the key purpose of providing cost-effective storage in the form of a data hub or data lake, most enterprises want compute services independent of what is available in their Hadoop distribution.

For example, at the recent Spark Summit in San Francisco, a session from Nielsen highlighted several pain points around on-premises Hadoop environment / cluster sharing, job workflow orchestration, logging / debugging, development tooling, and more. At the end of the day, they created a Docker-based deployment including Spark (data processing), Luigi (workflow orchestration), and Graylog (logging/monitoring).

This is just one example – there are many others. What’s clear is that whether on-premises or in the public cloud, the data science and data analyst teams at most enterprises want to leverage a unique set of data platforms, their own preferred tools, and the latest versions and technologies for their Big Data environments.

BlueData is the first and only software solution that allows enterprises to create their own Big-Data-as-a-Service environments with their data frameworks, applications, and tools of choice. In a separate announcement, we introduced the new summer release for our BlueData EPIC software platform – with new functionality to help organizations quickly deploy the latest versions of their preferred Big Data distributions, applications, and tools.

In doing so, we offer the ultimate in flexibility and choice for our customers. No other Hadoop-as-a-Service or Spark-as-a-Service solution (whether on-premises or in the cloud) provides that kind of flexibility.

Cloud is not a panacea

The public cloud offers significant potential benefits for Big Data deployments – including self-service, agility, elasticity, and a pay-as-you-go model. The CapEx-intensive model for most bare-metal on-premises Hadoop deployments takes weeks or months to get approvals, procure servers, rack and stack them, deploy compliant operating systems, configure storage and networking, etc. By comparison, the on-demand elasticity and OpEx model for Hadoop and Spark in the public cloud can be very attractive.

But what about deploying the data processing frameworks, tools, and analytical applications that your users want? Do the cloud providers offer this flexibility and choice? This is where the challenges begin.

In most cases, the complexity and effort for deploying an enterprise’s preferred Big Data environment (Hadoop distribution, applications, etc.) is similar in public cloud infrastructure as it is on bare-metal servers. For example, this blog post here describes the steps required just for deploying Hadoop on AWS EC2.

While the public cloud provides easy access to infrastructure (EC2 instances or VMs with storage and networking), it’s still hundreds of clicks and complex configuration choices for deploying Hadoop. Multiply this effort for each of the other tools in a typical Big Data environment and you’ll require a new dedicated DevOps team to get it all to work.

The public cloud is also a different operating environment, which makes it particularly challenging for enterprises to manage it in the same way that they manage their own data centers. If they’re not careful, it’s easy for the OpEx bills to add up. It’s also easy to get locked into one public cloud. So naturally, CIOs at most enterprises want the flexibility to leverage multiple cloud services from different vendors – while also ensuring there is governance and visibility for the ongoing costs over time.

Bottom line, the cloud operating model can be very attractive – but it’s by no means a panacea for enterprise Big Data deployments and there are still some significant limitations.

Data has gravity

Perhaps the biggest single limitation is where the data lives. Enterprises have decades of data stored on-premises. Using the public cloud for analytical processing enables greater agility – but how do you manage what data gets migrated and copied to the cloud, what stays, and how to integrate with existing data workflows?  And of course, there may be security, data privacy, data governance, performance, or regulatory reasons for keeping the data on-premises.

Are you sure this is how we get data into the cloudCopying terabytes or petabytes of data from one storage system to another storage system – let alone to a cloud storage system – is a potential data management nightmare. It’s slow, complex and costly. Often the preference is to keep the ‘hot’ data (which could be multiple years worth of data in many cases) where it is generated, whether that is on-premises or in the cloud. The challenge is then to figure out how best to utilize these data sources with minimal to no data duplication and movement.

Today, the public cloud provider requires the migration and movement of data en masse to the cloud. And in some cases, there are requirements to move data out of cloud storage (e.g. data generated by social apps) to on-prem for integration with other on-prem data. Either way, the options today are less than optimal. So data gravity is perhaps the single greatest contributor to the inertia to adopt the public cloud for enterprise Big Data deployments.

Enter Big-Data-as-a-Service (BDaaS)

The majority of Big Data deployments today are on-premises, but public cloud adoption is accelerating. A new report by Wikibon (“Big Data in the Public Cloud Forecast, 2016-2026”) estimates that worldwide Big Data revenue in the public cloud was $1.1B in 2015 and will grow to $21.8B by 2026 – or from 5% of all Big Data revenue in 2015 to 24% of all Big Data spending by 2026.  However, the report highlights ongoing regulatory concerns as well as the structural impediment of moving large amounts of data offsite as inhibitors to mass adoption of Big Data deployments in the public cloud.

Here at BlueData, we’ve been watching this closely and we knew where the puck was going from the get go. Our thesis has been that enterprises would eventually embrace public cloud for Big Data, and the fact that we’ve built our platform on Docker containers provides inherent flexibility and portability across on-premises and cloud environments. But we also see the need for enterprises to control and manage the transition of their Big Data workloads (and the data itself) to the public cloud – with a ‘hybrid’ and/or ‘multi-cloud’ approach.

More specifically, we’ve seen use cases for multiple deployment options:

  • Compute and data on-premises
  • All compute and data on a specific public cloud
  • Compute in the cloud with some data on-premises (PII data) and some data in the cloud (e.g. social)
  • All of the above, with the use of multiple public clouds and/or managed data centers (e.g. compute on-premises + compute and data for a specific line of business in cloud #1 + compute and data for a different line of business in cloud #2)

The diagram below illustrates our vision for the BlueData EPIC software platform as the “single pane of glass” for Big Data deployments whether on-premises or in the public cloud:

BDaaS architecture diagram

With the BlueData EPIC software platform, we can help organizations and enterprises across multiple industries to simplify and streamline their Big Data deployments – whether on-premises or in the cloud. Some organizations may adopt public cloud services for the majority of their Big Data analytics. Others will continue to maintain their Big Data deployments on-premises due to the considerations outlined above – including performance requirements, industry regulations, or data gravity issues. And others will likely have a mix of on-premises and cloud environments (from multiple cloud providers) depending upon the user requirements, use cases, and phases in the development lifecycle for their Big Data applications.

There are other providers focused on offering Hadoop-as-a-Service and/or Spark-as-a-Service in the public cloud or as a hosted service. But only BlueData will offer a Big-Data-as-a-Service (BDaaS) software platform that can deliver any Big Data distribution and application on any infrastructure, whether on-premises or in the public cloud. By using Docker containers (secure, embedded, and fully managed), we can be agnostic about the infrastructure – whether physical server, virtual machine, and now cloud at scale. And we can offer the same user experience – through one single pane of glass – with the same enterprise-grade security, isolation, resource controls, and storage connectors (i.e. BlueData’s DataTap technology).

Introducing BlueData EPIC Enterprise on AWS

The first step in the evolution of this BDaaS strategy spanning both on-premises and public cloud is the initial introduction of BlueData EPIC Enterprise running on AWS – in a directed availability program. BlueData has offered a free community edition of BlueData EPIC running on AWS since last year; but until now, BlueData’s enterprise edition was available only for on-premises deployments.

Over time, we will be introducing general availability for AWS, support for other public clouds (e.g. Microsoft Azure, Google Cloud Platform), and the ability to leverage on-prem data with compute in the public cloud using DataTap.

Our goal with the directed availability program is to ensure that the new AWS offer meets customer expectations. After this initial roll out to a select group of customers, the software will be made generally available. This brief technical video provides an overview of how EPIC Enterprise on AWS works:

It’s a relatively simple process. It all starts with provisioning the BlueData EPIC Controller instance via a BlueData-provided AMI (which will be posted in the AWS Marketplace for one-click install at the time of general availability). Until then, we will provide an AMI with a CloudFormation script to launch the BlueData EPIC Controller. With the usual SOCKS proxy setup that is used by most web applications running on EC2 instances, you now have your own ready-to-use BDaaS environment with enterprise-ready features.

As the administrator of this BDaaS environment, you can leverage all the administrative, security, and resource controls of the BlueData EPIC platform:

  • The ability to create tenants (map to AWS IAM user keys), associate/onboard users with specific roles (e.g. Tenant Admin or Tenant User), as well as apply quotas for CPU (VM cores), memory and storage for each tenant. You’ll have the peace of mind knowing that a tenant did not inadvertently rack up thousands of dollars in AWS bills.
  • The BlueData EPIC App Store with a pre-configured set of Docker images including multiple distributions and versions of Hadoop, Spark, Kafka, Cassandra, etc. You can use our App Workbench to create Docker images with your preferred versions of Hadoop or Spark, and your own other tools of choice, and register them in your own App Store.
  • A set of specific Amazon EC2 flavors so that you can not only control costs but also ensure that the right set of flavors (i.e. compute, memory) are used.
  • Full visibility into the usage of resources by every tenant, including the type and number of EC2 instances being used. Best of all, the ability to easily distinguish the instances by tenant or cluster and then stop or terminate those instances as necessary.

Tenant administrators and users can use the simple Web-based BlueData EPIC interface to spin up clusters of their choice, but within the limitations of the resource quotas assigned by the site administrator. An existing user of BlueData EPIC would see no difference in the user experience on AWS versus on-premises. And a new user would likely find it much more intuitive and easier to use than Amazon EMR or any other similar service.

Create New Cluster

When a user requests a cluster with N nodes and clicks ‘Submit’, the BlueData EPIC management layer transparently utilizes the Amazon AWS APIs and spins up EC2 instances of the specific flavor, injects the relevant BlueData Docker application image and orchestrates the entire cluster. As such, EC2 instances (and eventually virtual instances of any public cloud) are utilized as ‘dumb’ Linux hosts to run the BlueData management service. So the end result is a set of EC2 instances running lightweight, embedded BlueData-managed Docker containers.

Any subsequent operations, such as adding or removing nodes from the cluster, results in adding or removing EC2 instances as well as the embedded BlueData-managed Docker containers.

EPIC on AWS - How it Works

You can apply for the directed availability program for BlueData EPIC Enterprise on AWS at this link: And stay tuned for more from BlueData in the coming months as we move to general availability for AWS and more. It’s gonna be BDaaS.