Today we announced version 3.0 of the leading software platform for Big-Data-as-as-Service in the enterprise: BlueData EPIC. This release incorporates powerful new innovations and functionality based on the feedback and input from our customers – delivering even greater scalability, security, performance, and flexibility for their Big Data deployments.
But before I go into detail on what’s new, I want to take a step back and provide some perspective on why I think it’s truly EPIC …
Prior to joining BlueData recently, I worked at Cloudera (through the acquisition of Xplain.io) — and prior to that at Pivotal (via EMC / Greenplum) where I led product management for their Hadoop software business. Back in 2012, I still remember my first demo and presentation about Hadoop to a Fortune 500 company: I covered the basics of Hadoop and the type of problems it can solve for enterprises. A few years later, I was doing similar presentations focused on increasing Hadoop adoption at other large enterprises. There was less of a need to teach the basics of Hadoop; but the time-consuming complexity of the traditional bare-metal deployment model for Hadoop often stalled adoption. And there were significant challenges to meet the enterprise-grade security and operational requirements typically taken for granted in traditional data warehousing and database systems.
Fast forward to 2017: I joined the BlueData team and over the past several months I’ve been fortunate to work with some of the largest and most well-respected companies in the world on their Big Data deployments. I immediately realized the power of the BlueData EPIC software platform when during my very first presentation and demo to a large European financial services company, a seasoned Big Data architect remarked: “I’m getting goose-bumps. This is exactly what I’ve been looking for”. Since then, I’ve found that EPIC’s ability to spin up secure and massively scalable Hadoop and Spark clusters in minutes – running in Docker containers – often elicits this type of reaction.
Now, with version 3.0, BlueData has raised the bar yet again – incorporating the feedback and input from dozens of our customers. This new release provides the security, performance, and operational requirements that these large enterprises demand for large-scale Big Data deployments in production. And at the same time, it provides the agility, flexibility, and cost-efficiency of Docker containers – with portability across on-premises, public cloud, and hybrid environments.
And with that context, I’ll dive right into some of the EPIC new features in version 3.0.
Scalable and Flexible Networking with Docker Containers
BlueData EPIC provides cloud-like agility and efficiency for on-premises Big Data deployments with virtual clusters running on embedded and fully-managed Docker containers. The EPIC platform in effect uses Docker containers as lightweight virtual machines, so until now each container would typically need its own network IP address to communicate with the external world and to interoperate with other containers in the virtual cluster. In this mode, a single physical server can run dozens of lightweight containers and achieve higher resource utilization to improve efficiency. While this is very attractive from a TCO perspective, it can also put additional operational constraints to acquire routable IP addresses from the networking team in an enterprise IT department. As a result, some of our large enterprise customers experienced delays in securing the required set of routable IP addresses for their containerized Big Data deployments — due to operational constraints in their organization.
To address this challenge, we’re happy to announce innovative new functionality called the BlueData EPIC Gateway: an optional software feature that completely eliminates the need for routable IP network addresses for each container. By removing the requirement for routable IP addresses, we now provide greater flexibility to configure the container network for our customers’ Big Data deployments. And without this network address limitation, it allows virtual clusters on the BlueData EPIC platform to scale to hundreds of virtual nodes and thousands of containers – without impacting the performance or user experience for analysts and data science teams.
The image below presents a logical view of a virtual cluster running on BlueData EPIC, with both Hortonworks (HDP) and Cloudera (CDH) distributions:
As illustrated in this graphic, the data scientists and analysts accessing the CDH or HDP services running in BlueData EPIC don’t need to know that there is a gateway serving their request – in fact, they don’t even need to know that their cluster is running on Docker containers. It is completely transparent to the user for their day-to-day operations; the cluster behaves and performs exactly as it would in a bare-metal environment, without any modification to the Hadoop distribution. However, from the perspective of enterprise IT networking, the use of the BlueData EPIC Gateway greatly simplifies the infrastructure and security requirements for running Big Data applications in Docker containers.
Advanced Monitoring with Elasticsearch, Metricbeat, and Kibana
All enterprises require monitoring and management for their Big Data environments. Most of our customers integrate BlueData EPIC’s monitoring dashboard, system level performance data, and service status data into their Network Operating Center (NOC) monitoring systems to get a centralized view of their cluster and service health status. Many of them also make use of specialized Big Data tools like Cloudera Manager or Ambari; within BlueData EPIC, those tools are pre-integrated with CDH and HDP respectively. But our customers have also asked us for the flexibility to add deeper monitoring capabilities, all the way down to the individual container or virtual node.
So in the spirit of constant innovation, in this release we introduced BlueData EPIC Monitoring – based on the widely used Elasticsearch, Metricbeat and Kibana (EMK) framework. This new functionality provides fine-grained monitoring of system-level resources (e.g. CPU utilization, memory utilization, storage) as well as container-level monitoring. As with earlier releases, our customers can easily integrate EPIC monitoring into their existing monitoring systems and pull the data using REST APIs. And, thanks to integration with Elasticsearch, the new EPIC Monitoring feature also provides advanced searching capability. The chart to the right shows an example of monitoring CPU utilization for a containerized Spark cluster running on BlueData EPIC.
In addition, system administrators for BlueData EPIC can use Kibana to enable advanced customizations and visualization for their monitoring data. Since Kibana puts significant demand on CPU and memory, we advise our customers to activate it only when needed. However, we certainly expect that some of our advanced users will find it very compelling to use the flexibility and control that Kibana provides to monitor and debug their containerized clusters.
Enhanced Multi-Tenancy on AWS and Hybrid Architectures
We’ve continued to see more enterprises embrace the public cloud for some of their Big Data workloads; and the hybrid model is very attractive for those that have significant on-premises environments. However, due to the stringent security requirements imposed by AWS, a virtual cluster created in this environment would require direct Internet access in order to access RPM repositories or integration with other cloud-native services. While the Amazon cloud makes every attempt to secure the nodes, our customers recognized that this imposes extra cost to procure public IP addresses for the instances and also a potential risk due to direct access to Internet.
So in this release, we’ve introduced a new feature that eliminates the need to assign the public IP addresses to virtual clusters in Amazon EC2. Instead the virtual clusters can be configured with a proxy service that would provide public Internet access to the virtual nodes. When running BlueData EPIC on AWS, tenants and instances can now be isolated across different Amazon subnets, security groups, regions, and virtual private cloud (VPC) networks. And the whole process can be automated using AWS CloudFormation templates. With this new feature, we’re providing additional support to ensure highly secure and highly available multi-tenant environments on AWS.
Performance Optimizations for Big Data Workloads on Docker Containers
A couple months ago, we announced the results of Intel’s performance benchmark tests comparing Hadoop running on bare-metal versus a containerized environment on BlueData EPIC. We were very pleased with the outcome: as outlined in Intel’s accompanying white paper, the tests showed comparable and in some cases higher performance for the BlueData environment (e.g. BlueData EPIC outperformed bare-metal by approximately 2% higher for a 50 Hadoop compute node configuration with 10 terabytes of data in HDFS).
As part of this benchmarking effort, our team worked closely with Intel to investigate, test, and enhance performance for Big Data deployments on the BlueData EPIC platform. To this end, we asked Intel to help identify specific areas that could be improved or optimized. The goal was to increase the performance for Hadoop and other real-world Big Data workloads in a container-based environment. The Intel research team investigated potential bottlenecks and identified some potential performance improvements. And now we’ve implemented these enhancements to the EPIC platform, including the following:
- By reconfiguring the network maximum transition unit (MTU) size from 1,500 bytes to 9,000 bytes and enabling jumbo-sized frames, we’ve seen an increase in performance.
- By optimizing the transfer of storage I/O requests from the BlueData implementation of the HDFS Java client to the BlueData caching node service (cnode), we reduced the latency of transferring storage I/O requests.
- By optimizing the calls to the remote HDFS to determine data block locality, BlueData minimized latency in how quickly YARN can launch jobs in the virtual cluster. This also yielded a significant improvement in performance.
Kerberos Passthrough for Enterprise-Grade Security with a Hadoop Data Lake
Many large enterprises have adopted the idea of a Hadoop / HDFS data lake to reduce their storage costs and to eliminate data silos with a single shared repository of data. However, in many cases this vision has been hindered by the fact that there was no sure-fire way to segregate and restrict access to the data in the data lake. If a user had access to the data lake, then he or she could access all the data stored in the data lake. To address this challenge, many of our customers have use BlueData EPIC’s DataTap capability to provide secure access controls and ensure that tenants and users can only use a subset of the data stored in an HDFS data lake. DataTap can be configured on a per tenant basis, with a Kerberos “proxy” user, and integrated with AD/LDAP credentials.
In EPIC 3.0, we’ve introduced a new optional security feature that passes Kerberos credentials from the compute Hadoop cluster to the remote HDFS cluster for authentication. This new feature, called Kerberos Credential Passthrough, is in addition to our existing Kerberos proxy configuration. With the new passthrough functionality, we offer even greater security and data governance for their Hadoop data lake. Now BlueData EPIC can duplicate the Kerberos interaction between compute and HDFS services in a co-located bare-metal Hadoop deployment – while delivering all the benefits of a containerized environment, with decoupled compute and storage. For example, our customers can mix and match different versions and distributions between their Hadoop compute cluster and their remote HDFS.
The screenshot shown here illustrates the creation of DataTap utilizing the new Kerberos Credential Passthrough functionality. When configured, this service will automatically generate service tickets and HDFS delegation tokens for the current user and authenticate access rights against the common Kerberos KDC (Key Distribution Center) server. This provides greater control over access to remote HDFS-based data lakes without compromising security and auditing capabilities.
Action Scripts to Automate the Deployment of Complex Clusters
BlueData EPIC makes it easy to deploy and manage complex virtual clusters (for Hadoop, Spark, and other Big Data frameworks) with just a few mouse clicks. However, some of our more advanced customers wanted the ability to run custom scripts in the virtual clusters to automate certain deployment-specific tasks (e.g. installing custom packages, monitoring logs). For example, their users may want the ability to add their own specific version of a data science tool (e.g. a Zeppelin or Jupyter notebook) without rebuilding the Docker-based application image or stopping the virtual cluster in BlueData EPIC.
In this release, we’ve introduced an exciting new Action Script feature. With an Action Script, users and tenant administrators can easily modify specific cluster parameters, review logs, and install required packages after a virtual cluster is launched. This new feature simplifies operational tasks – like adding a custom RPM package to a virtual cluster – and eliminates the need to build an entirely new Docker-based application image or launch a new cluster for these relatively minor changes. Action Scripts can also be used for advanced customer operations to automate certain routine tasks.
Quick Launch Templates For Improved User Productivity
With many of our customers, we’ve noticed that some users of BlueData EPIC often launch the same-sized cluster with certain common parameters and then tear that cluster down after the job is complete. So in this release we’ve introduced a new Named Templates feature; this feature is targeted at those users that want to have a cluster template handy for rapid experimentation without selecting the same cluster parameters every time. This improves user productivity and helps define frequently-used cluster parameters to enable even faster and easier deployment for new virtual clusters.
As indicated in this screenshot, users are provided with a set of Quick Launch Templates – these templates can be easily configured for specific cluster types, flavor sizes, and the number of workers. Users also have the ability to change the order in which the templates are deployed. The new Quick Launch functionality in EPIC 3.0 allows data scientists and other users to set up their preferred templates with commonly-used cluster parameters, and then use those templates to instantly launch those virtual clusters.
New Docker-Based Application Images for Distributed Data Science
The BlueData EPIC software platform provides data science teams with the ability to quickly create new environments with their preferred tools – with open-source Big Data frameworks like Hadoop, Spark, Kafka, and Cassandra as well as notebooks and IDEs like Zeppelin, Jupyter, and RStudio. The choice of tools continues to expand – and the pace of innovation in the data science ecosystem is changing rapidly – so we’re constantly expanding and extending the Docker-based application images that we provide out-of-the-box with the BlueData EPIC platform. In this release, we’ve extended the collection of data science tools in our App Store with several new images including:
- Spark 2.1.1 with Zeppelin, Jupyter, and RStudio
- Spark on Apache Mesos (using Mesos as the resource scheduler)
- R libraries pre-installed on all nodes (sparklyr, devtools, knitr, tidyr, ggplot2, shiny)
- R Hadoop client for accessing HDFS from R
BlueData EPIC also supports a “bring your own app” model that allows our customers to create their own Docker-based application images using BlueData templates, in order to add their preferred Big Data applications and data processing frameworks to the App Store. To learn more about how BlueData EPIC 3.0 helps with large-scale distributed data science operations, check out my colleague Nanda’s new blog post here.
Learn More at DataWorks Summit / Hadoop Summit in San Jose
If you’ll be at the DataWorks Summit / Hadoop Summit in San Jose this week (June 13-15), you can see EPIC 3.0 for yourself: just visit the BlueData booth in the community showcase for a demo of the new release. You can also try BlueData EPIC on AWS – just apply here for a free two-week trial.