In my last blog post, I wrote about a major new release of the BlueData EPIC software platform: version 3.0. Today, we announced a new fall release building on the success of version 3.0 and introducing some of the most sought-after features that our rapidly expanding roster of enterprise customers have asked for.
There are a lot of new features packed in this release, so I’ll just touch on some of the highlights: including container placement using host tags, support for GPU acceleration, enabling deep learning frameworks, new performance enhancements, new monitoring capabilities, and additional security functionality. We’re continuing to innovate and extend our lead as the solution of choice for enteprises that want to run Big Data workloads on Docker containers – delivering greater agility, faster time-to-value, and all the enterprise-grade functionality they expect when running on bare-metal.
I’m also excited to announce the directed availability of BlueData EPIC on Google Cloud Platform (GCP) and Microsoft Azure – as part of our broader strategy to provide a flexible Big-Data-as-a-Service (BDaaS) platform for deployments on-premises, on any public cloud, or in hybrid and multi-cloud architectures.
First, let’s talk about what’s new in the new BlueData EPIC fall release (version 3.1) …
Container Placement: Location, Location, Location …
You hear it all the time in the real estate market: it’s all about location. Houses are valued differently and appreciate in value differently depending on their location and environment. Distributed systems are similar in some respects, particularly when deploying Big Data workloads that need to make use of the system resources in the underlying infrastructure.
In an ideal situation, distributed systems like Hadoop can be location-agnostic when deploying various Big Data services. However, in practice, there are very few homogeneous distributed systems in practice. When deployed on-premises, the physical infastructure for these distributed systems are typically built with a heterogenous type of physical host. Therefore, when running a particular Big Data workload on Docker containers, you need to take advantage of the available underlying physical servers to ensure the appropriate utilization and performance. For example, a memory-intensive workload like Spark would perform better if deployed on servers with a high memory footprint and performance characteristics (e.g. a local SSD for fast storage access). The same principles apply to Big Data deployments in the public cloud or hybrid architectures.
So with this new fall release of BlueData EPIC, we’re introducing an exciting feature built specifically to enable targeted deployment of containerized Big Data workloads on the appropriate physical hosts or cloud instances. This new feature provides “host tags” that the system administrator can define; these host tags enables precise placement for containers on certain desired hosts based on various critera. By ensuring that specialized Big Data workloads and services are deployed on hosts where they can run most efficiently, this can help to improve infrastructure utilization and ensure optimal performance.
The following screenshot shows how to create and define tags using the web-based interface for BlueData EPIC systems administrators. Typically, the system administrator will create these tags to define various characteristics of the physical machines that will be part of the cluster. For example, the underlying physical cluster could span two racks and some machines could contain GPUs for deep learning. The system administrator can simply create tags for “Rack = 1, 2” and “GPU Enabled = Yes”.Once these tags are created, they can be assigned to specific physical hosts (or cloud instances). The following screenshot shows some of the tags applied to specific machines with certain characteristics (e.g. MemoryType = High or Medium, Rack = 1 or 2, GPU Enabled = Yes).
When creating new virtual clusters in BlueData EPIC, these tags could be assigned as “placement contraints” to target specific hosts. For example, in the following screenshot, a worker container for CDH Hadoop is configured to be placed in a physical host tagged as “GPU Enabled = Yes” and “Memory = High”. When the virtual cluster is created, the BlueData EPIC platform will automatically deploy the CDH worker container on a physical host that satisfies those particular placement constraints.
In this new release, we’ve also introduced the concept of “service roles” that can be defined for a given cluster application image. Combined with host tags, these new service roles allow for precise placement of containers on the right underlying host. For example, containers with a Spark worker role can be placed on servers or instances containing a large amount of memory or a local SSD for fast storage access.
GPU Acceleration and Deep Learning
Over the last several months, I’ve seen increased interest in advanced machine learning and deep learning at many of our enterprise customers. As in Hadoop’s early days when we witnessed individual teams deploying their own Hadoop clusters on dedicated bare-metal servers, the data scientists at some of these enterprises have started to deploy deep learning frameworks (such as TensorFlow) in their own bare-metal environment or on a public cloud service. But deploying and configuring these workloads on GPU-enabled servers or cloud instances is not exactly something a typical data scientist is equipped to do.
Our customers have realized that BlueData EPIC (powered by Docker containers) provides an ideal platform for these use cases. In particular, I’ve been closely working with some of our customers in the financial services industry to enable them to create shared pools of GPU-enabled servers so that their data scientists can easily share GPUs across their workloads and create deep learning clusters on-demand – without needing to wait weeks for the availability of dedicated, specialized GPU-enabled hardware.
Now BlueData EPIC can now support clusters accelerated with GPUs, and run TensorFlow on GPUs or on Intel architecture CPUs. Moreover, instead of dedicating GPU-enabled servers to a particular workload as in a typical bare-metal environment, BlueData customers can now use the new features this fall release to allow multiple workloads and tenants to share these GPUs in a containerized environment. As described earlier, their system administrators can use host tags and service roles to specify placement of Docker containers running TensorFlow on the right infrastructure – whether for machines configured with GPUs or CPUs, and in the public cloud or on-premises.
This release also adds new ready-to-run Docker-based application images for deep learning workloads (including both TensorFlow and Intel’s BigDL framework), further expanding our support for a wide range of data science and machine learning tools. For more on this topic, you can refer to a couple new blog posts by my colleague Nanda on using TensorFlow (here) and BigDL (here) with the BlueData EPIC platform.
Enterprise-Grade Security at Every Step
I remember back when the first version of Hadoop was introduced, the open source community wasn’t ready for enterprise-grade security capabilities such as encryption, authorization, and AD/LDAP integration. As a community, we’ve come along a long way since then to instill confidence in storing and processing sensitive enterprise data in Hadoop. With the recent onslaught of data breach issues, financial corporations and other enterprises are increasingly becoming sensitive to data security. The open source community has been rising to the challenge to meet the stringent demands for user authentication, authorization, data-at-rest encryption, and data-in transit encryption.
While a Hadoop cluster can be made to comply with most enterprise-grade security and governance policies, it is still very difficult to configure and manage these security aspects. It’s challenging in a bare-metal environment, and it can be even more complicated for on-demand containerized clusters created for different lines of business for specific use cases in a multi-tenant architecture.
One of the most sought after features for data security is Transparent Data Encryption (TDE), which delegates encryption/decryption to the compute cluster when accessing encrypted data in a remote HDFS cluster. In distributed platforms such as Hadoop, multiple users could be accessing that data; encrypting data using key-based encryption schemes such as TDE prevents one user from accessing another user’s data.
With this new fall release, BlueData EPIC extends its enterprise-grade security to provide TDE support. Now it’s easy to connect virtual compute clusters to read/write data in an encrypted remote HDFS cluster (as illustrated in the example below, for CDH running on a Docker container and using BlueData’s DataTap technology to tap into remote HDFS).
Along with TDE, this new release introduces cross-realm support for Kerberos KDCs (Key Distribution Centers). In my blog post about EPIC 3.0, I wrote about our support for Kerberos Credential Passthrough functionality. Now we’ve taken it a step further, to support customers that have created multiple Hadoop clusters on BlueData secured using a single corporate-managed KDC. A typical Hadoop cluster requires ten or more service principles defined in the KDC. This could quickly overwhelm a systems administrator managing multiple clusters in an enterprise, which may end up with thousands of service principles.
With cross-realm KDC support, administrators can create a trusted relationship between a local KDC and the corporate KDC. This ensures that the service principles can be created in the local KDC without polluting the corporate KDC. Users will still be authenticated against the corporate KDC, ensuring that only authorized users can access secured Big Data workloads and storage.
Increased Performance with Intel Optane SSDs and Intelligent Caching
As part of our business collaboration agreement with Intel, the Intel and BlueData engineering teams have been working together to deliver compute, I/O performance, and storage optimizations that allow BlueData customers benefit from the latest advancements in Intel platforms. Recently, as a result of this collaboration, Intel published a white paper with groundbreaking results that showed the same or higher performance for Big Data workloads like Hadoop running on BlueData EPIC versus the same workload running on equivalent bare-metal servers.
In this release, we are upping the performance game with support for optional Intel Cache Acceleration Software (CAS) and Intel Optane Solid State Drives (SSDs) to deliver optimal performance for Big Data workloads. Intel CAS delivers selective optimized caching by intelligently identifying frequently accessed files and puts them in SSD cache – enabling fewer random accesses, increased I/O bandwidth, and increased performance. The combination of BlueData EPIC and optional Intel CAS software can provide our customers with increased performance, reduced I/O bottlenecks, and lower TCO in operating a high performance Big Data cluster.
Disk I/O and Network Monitoring at the Container Level
In our last release, we integrated the widely used open source Elasticsearch, Metricbeat, and Kibana (EMK) framework to provide fine-grained monitoring of system-level resources and virtual clusters in BlueData EPIC: including CPU, memory, and other key metrics. By using Elasticsearch and Metricbeat, our customers can monitor this data for their clusters and also easily integrate the datasets into their existing monitoring tools. Many of our customers also wanted the ability to drill down from the physical cluster to the virtual cluster and all the way down to individual Docker containers to understand their performance characteristics.
In this new release, we’ve further enhanced this container-level monitoring functionality to enable fine-grained monitoring of virtual clusters and containers for container-level disk I/O and network throughput. System administrators and tenant administrators can precisely understand and monitor individual virtual clusters and containers for better management of resources. For example, a tenant administrator can monitor a virtual cluster to identify a potential performance bottleneck and then dynamically increase or decrease the cluster size to meet the SLA for that particular workload.
Directed Availability for Google Cloud Platform and Microsot Azure
Amazon Web Services (AWS) is the clear leader in the public cloud market, and AWS was the natural choice when we first announced our support for running BlueData EPIC in the public cloud: starting with a directed availability program last summer and then general availability last December. Since then, we also added support this past spring for deployments in a hybrid architecture with AWS – with Big Data workloads running on either on-premises infrastructure and/or in the AWS cloud, and with the ability to leverage the elastic compute of the public cloud while keeping data on-premises.
But our vision has always been to run the BlueData EPIC platform on any infrastructure, including any public cloud service. Many of the enterprise customers I’ve been working with don’t want to tie themselves to a single cloud provider for fear of lock-in and price rigidity. However, moving workloads (especially Big Data workloads with very large datasets) from one cloud provider to another is a potentially expensive and laboreous process, especially if it involves copying data across cloud providers. In fact, some of the customers I’ve worked with want to keep their data in their own enterprise data centers for governance and compliance reasons – while using the public cloud for dev/test, for off-loading compute processing, and/or particular use cases where it makes sense based on agility, cost savings, and data gravity considerations.
With BlueData EPIC, we want our customers to be able to provide the same Big-Data-as-a-Service (BDaaS) experience to their users regardless of where the underlying compute and storage infrastructure is – whether on- or off-premises, and irrespective of the cloud service provider. We want to provide the ability to deploy and manage Big Data workloads across multiple cloud environments, and keep their data on-premises in secure private data centers if desired. Leveraging the inherent workload portability provided by Docker containers, they’d be able to move compute workloads from one cloud provider to another in order to take advantage of pricing or other considerations, without needing to move all of their data as well.
The DataTap functionality in the BlueData EPIC platform is a perfect way to enable this hybrid and cloud-agnostic architecture. By decoupling compute from storage, DataTap allows compute clusters running in the public cloud to access data stored in a private on-premises data center in an efficient manner – providing an ideal situation for security-conscious enterprise organizations. And by leveraging the host tags offfered in our new release, these organizations could deploy their containerized Big Data clusters on the cloud provider of their choice for that particular workload or use case. The data science teams and analysts accessing and using those virtual clusters wouldn’t necessarily know which cloud provider it’s running on – or even whether it was running on-prem or in the cloud at all.
With this vision in mind, our intention from the beginning has been to support multiple cloud providers; and our enterprise customers are increasingly adopting both hybrid and multi-cloud strategies for their Big Data initiatives. At the same time, we certainly recognize that both Microsoft and Google have emerged as major players in the public cloud market, so it was a natural decision to choose Azure and GCP as the next cloud services to support.
Effective immediately, we’re extending our public cloud support to Azure and GCP in a directed Availability (DA) program – similar to what we did with AWS.
Our goal with this directed availability program is to ensure that we meet customer expectations for running BlueData EPIC on Azure and CGP. After this initial roll out to a select group of customers, the software will be made generally available (targeted for early 2018). Organizations interested in trying BlueData EPIC on Azure or GCP for free during the initial directed availability period can apply here.
Learn More at the Strata Data Conference in New York
If you’ll be at the Strata Data Conference in New York this week (September 26-28), you can see a demo of the new BlueData EPIC fall release as well as support for Azure and CGP; just visit the BlueData booth (#433) in expo hall.
And make sure you attend the case study session with Barclays UK on “Enabling data science self-service with the Elastic Data Platform” at 1:15pm Wednesday September 27th. You can hear how Barclays UK uses BlueData EPIC to deploy Cloudera and Spark on Docker containers, running on Dell EMC infrastructure while preparing for a hybrid cloud deployment model.