On behalf of the entire BlueData team, I’m thrilled to announce the general availability (GA) of BlueData EPIC on AWS. With this announcement, BlueData is executing on our goal of offering the first and only Big-Data-as-a-Service software platform to support:
- Choice and flexibility: Provides data science teams with the ability to create their own unique Big Data environments using their data frameworks, applications, and tools of choice. With BlueData, they will be able to deploy these environments on-premises, on any public cloud (starting with AWS), or on a combination of the two.
- Security, visibility and control: Enables multi-tenancy with flexible resource controls, in-depth cost reporting, and a unified security model leveraging enterprise authentication (e.g. Active Directory / LDAP). With BlueData, enterprise IT teams can safely control and manage the transition of their Big Data workloads to or from the public cloud.
- Hybrid architectures: Allows the use of in-place analytics with Hadoop or Spark compute running in the cloud (e.g. on Amazon EC2) against on-premises storage (such as your HDFS data lake, network connectivity permitting) and/or cloud storage (e.g. Amazon S3). This is particularly useful for scenarios where there may be large volumes of sensitive data on-premises in addition to data natively generated in the public cloud.
This past June, I wrote a blog post about BlueData’s vision for Big-Data-as-a-Service (BDaaS) and we introduced the directed availability of BlueData EPIC on AWS. Our vision for a “hybrid” and/or “multi-cloud” approach resonated with enterprises of all sizes that want to control and manage the transition of their Big Data workloads (and the data itself) to the public cloud. And the customer response to the BlueData EPIC on AWS directed availability program was overwhelming, so we had to apply some strict selection criteria to ensure that there was a good fit. Here are some of the criteria we applied, with specific customer examples:
- Clear, concrete use cases on AWS: A financial services company wanted to accelerate the time-to-market for a data-driven application on AWS using a combination of Big Data tools preferred by their developers and data scientists: HDFS (Hadoop cluster), Spark 2.0.1 with Jupyter notebook, and Cassandra 3.9. Their existing IT staff didn’t have the AWS DevOps skills to pull this off on their own; and with BlueData EPIC, their team didn’t need these skills.
- Hybrid strategy for their Big Data environments: One of our existing on-premises customers wanted to reduce their infrastructure costs (servers and storage) by offloading some of their existing Big Data workloads to AWS, while using the exact same code and identical Docker application images for Hadoop and Spark as were being used in their on-premises BlueData EPIC implementation. They wanted to avoid manual deployment on EC2 or creating custom AWS AMIs; BlueData EPIC provides this automation and supports their need for a hybrid architecture.
- Onboard and update their own preferred Big Data applications: The QA team at a leading data integration software vendor wanted to test their new product on AWS with different commercial versions of Hadoop (such as Cloudera CDH, Hortonworks HDP, and MapR). With BlueData EPIC, they could quickly create Docker application images with their own code and other Hadoop artifacts; this helped to significantly reduce their QA cycle times and improve team productivity.
The directed availability program for BlueData EPIC on AWS delivered what we wanted and more. While the feedback from customers validated many of our assumptions, we also uncovered many use cases and functional requirements that we didn’t anticipate. And our customers highlighted several value propositions that resonated even more than we expected. I’ll recap a few of those benefits here.
Multi-Tenancy and Resource Quotas on AWS
Multi-tenancy is a foundational capability of the BlueData EPIC software platform and we’ve had this in the on-premises version of our software from the beginning. With BlueData EPIC on AWS, administrators can set quotas for CPU, memory, and storage as well as other policies such as AD/LDAP group mappings, AWS IAM user mappings, and Hadoop Kerberos settings.
Our customers in the directed availability program onboarded multiple teams of developers, QA engineers, and data scientists, and restricted each team to specific resource quotas in order to control costs. We realized that there is no mechanism in AWS to limit the number of EC2 instances on a per IAM user (the user management capability in AWS); instance limits are enforced only on a per account basis.
These customers found tremendous value in BlueData EPIC’s “tenant” policies on AWS, since they didn’t want have to create distinct AWS accounts for each team. By deploying BlueData EPIC in a specific AWS account (e.g. their company account or an account dedicated to their line of business), they were able to create multiple “logical” tenants for different teams with specific quotas for CPU (e.g. 100 cores), memory (e.g. 200 GB) and storage (e.g. 1 TB). BlueData EPIC ensures that the number of EC2 instances spun up by each tenant does not exceed the quotas assigned to that tenant.
Big Data DevOps: Abstracting Users from AWS
While AWS is awesome in terms of the breadth of infrastructure and services offered, it can also be overwhelming for inexperienced users. That’s why you often find DevOps teams writing tool chains and wrapper interfaces on AWS so that the average developer can focus on the task at hand. And if you ever decided to use a different cloud provider, you would need to leverage that cloud provider’s specific APIs to recreate those wrapper interfaces.
A key insight we had during the directed availability program was that Big Data application developers and data scientists were not particularly interested in learning the AWS tool chain. And the DevOps engineers at these companies were overwhelmed by the requirements of deploying and managing clustered products like Hadoop, Spark, Cassandra, Kafka, and <name your next Big Data product> on AWS. For example, orchestrating a Hadoop cluster with EC2 instances that could be stopped for a period of time (to save money) and restarted without losing state (including the IP address/FQDN and any metadata) was quite a challenge. Let alone scaling the cluster up and down or integrating with Amazon S3.
So our directed availability customers loved the fact that BlueData EPIC provides a “DevOps in a Box” value proposition that significantly reduced the DevOps time and effort required. Not only does BlueData EPIC orchestrate clusters of different Big Data products in a simple and consistent manner, but it also provides them with simple hooks and guardrails (e.g. the BlueData App Store and App Workbench) to help them do their own thing. In other words, with BlueData EPIC they could add new versions of different Big Data tools (whether Hadoop, Spark, NoSQL, or practically any distributed data platform) with the peace of mind of not having to worry about orchestration, resource management, security, and elasticity (scale up/down) on AWS. And perhaps best of all, they’ll be able to use these same hooks with any cloud provider (i.e. reducing their AWS lock-in).
Cost Controls and Visibility
For Big Data workloads, the public cloud is often synonymous with ephemeral jobs. That is, spin up a cluster, ingest data, run the job, perhaps auto-scale to finish the job in a certain period of time and then terminate the cluster. While BlueData EPIC supports this job-centric use case, we’ve also supported many use cases that required the build out of carefully crafted, long running clusters that supported data pipelines such as a combination of Spark, Kafka, and Cassandra or tailored data science environments with R, Python, and Jupyter Notebook.
As expected, one of the key concerns that our directed availability customers had with AWS was the ongoing pay-per-use cost of their EC2 instances for Big Data workloads. Ongoing cost visibility was also an equally important administrative concern in the context of a multi-tenant environment. Our directed availability customers really liked some key features of BlueData EPIC on AWS in this area:
- The ability to stop an entire Big Data cluster in one click so that you can save on EC2 compute costs. And the ability to restart the cluster with another click and have all the data, metadata, and networking configurations restored in a matter of minutes.
- The ability for administrators to view all the clusters across all the tenants so they can stop clusters and manage resources as needed.
- The auto-tagging of all BlueData-managed EC2 instances with tenant name and tenant ID to permit cost analysis based on tenant usage.
Hybrid Architectures and On-Premises Data Connectivity
Another key takeaway from the directed availability program was that many enterprises often view AWS as an extension to their own data center, much like a co-location facility. As such, we found that the use of Amazon VPC (Virtual Private Cloud) combined with site-to-site VPN and AWS Direct Connect was much more commonplace that we’d initially imagined.
When BlueData EPIC spins up Big Data clusters in this configuration (with Amazon VPC, site-to-site VPN and AWS Direct Connect), the user experience is so seamless that end users can’t tell whether the cluster is on AWS or running on-premises. Security policies like AD/LDAP as well as Kerberos authentication are applied seamlessly to support secure login to the BlueData EPIC UI or the cluster nodes, as shown in the screenshots below.
Centralized AD/LDAP Configuration
AD/LDAP Integration with Cluster Nodes
Perhaps most importantly, with BlueData’s DataTap technology, Hadoop and Spark clusters on AWS can run in-place analytics against on-premises data – specifically against data stored in existing HDFS storage. BlueData also provides the ability to run analysis that combines data on-premises and in the cloud (from Amazon S3) as shown in the screenshots below.
Job Against S3 Cloud Storage
Job Against On-Premises Storage
This “hybrid” data connectivity enables use cases where there may be large volumes of secure on-premises data in a HDFS data lake; this is particularly common in industries such as financial services, healthcare, and retail. With BlueData’s DataTap capability combined with site-to-site connectivity and AWS Direct Connect, our directed availability customers realized that they could save capital expenditures on server infrastructure by utilizing the cheapest cloud compute resources to run their Hadoop and Spark clusters. This can also mitigate the security and governance concerns of copying sensitive on-premises data to public cloud storage such as S3. With BlueData EPIC on AWS, they can utilize their secure site-to-site connection to allow Hadoop and Spark clusters running on EC2 to process data in their on-premises HDFS storage. Data is streamed securely to the compute cluster for processing and the results are written back to the on-premises storage.
Now Generally Available: BlueData EPIC on AWS
Today’s GA announcement (the press release is posted here) highlights some of the key benefits we uncovered during the directed availability program for BlueData EPIC on AWS:
- Simplified user experience for both administrators and data science teams, abstracting the AWS-specific infrastructure so they can focus on their Big Data needs.
- Faster AWS onboarding for multiple teams and Big Data workloads, eliminating the need for DevOps expertise and reducing the cost and time involved.
- Greater agility and flexibility, with self-service clusters pre-configured on Amazon EC2 for Spark, Hadoop, Kafka, Cassandra, and other Big Data applications.
- Reduced AWS costs through the use of fine-grained resource quotas, start/stop controls, and cost reporting in a multi-tenant environment.
- Faster time to insights with pre-built cluster integrations to Amazon S3 and in-place analytics against on-premises data.
- Improved data governance with integrations to Amazon VPC (including site-to-site VPN), Active Directory, and Kerberos for authentication.
You can read some of the technical details about EPIC on AWS in the spec sheet here or check out this three-minute video for a demo of how it works:
Whether you’re developing innovative data-driven products, or you’re an analyst using business intelligence tools to uncover insights from Big Data, or you’re a data scientist working on creating a new machine learning algorithm, you need all the right Big Data systems and infrastructure in place to do your job.
But as many organizations have found, the systems and infrastructure required for Big Data analytics can be time-consuming and expensive to implement in an enterprise environment – whether in your own data center or on AWS, it’s exceedingly complex. With BlueData EPIC on AWS, we’re continuing to help simplify and accelerate Big Data deployments for our customers – to ensure faster time-to-insights, lower costs, and faster time-to-value. It’s EPIC.
And now that BlueData EPIC on AWS is generally available, you can apply for a free trial at www.bluedata.com/AWS.