Today we announced the new spring release of our BlueData EPIC software platform: a purpose-built solution designed to deliver Big-Data-as-a-Service and accelerate Hadoop and Spark deployments on-premises using Docker containers. With over two dozen new features and enhancements, this new release is full of exciting new functionality. But before I dive into what’s new, I want to take a brief look back …
This past fall, I wrote a blog post about version 2.0 of the BlueData EPIC platform. It was an equally exciting and feature-packed release – and it’s been great to see the response and adoption from customers in a wide range of industries across financial services, pharmaceutical, healthcare, technology, telecommunications, energy, government, education, and more. We struck a chord with these enterprise organizations because they found Hadoop and Spark to be cumbersome to install and manage – and because they’ve found it difficult to keep up with the ever-evolving menagerie of Big Data frameworks and tools.
In particular, one of the reasons that more and more customers have adopted BlueData EPIC in the past several months is the need for a *true* multi-tenant Big Data infrastructure. These organizations realized that resource balancing multiple jobs on a single Hadoop cluster does *not* solve the fundamental requirements of a multi-tenant deployment or help them keep up with the rapidly evolving Big Data ecosystem – which now extends well beyond the original Hadoop framework to include Spark, Kafka, Flink, Cassandra, MemSQL, Splunk, and much more.
The illustration below depicts the typical Hadoop deployment before BlueData: with bare-metal servers and growing cluster sprawl; low utilization leading to high costs; duplication of data across different clusters; and limited ability to take advantage of changes in the Big Data ecosystem due to a relatively rigid and inflexible cluster-by-cluster infrastructure. It also shows what it looks like after their BlueData implementation: with a simple, agile, and container-based deployment; higher utilization and lower costs; elimination of data duplication; and a comprehensive multi-tenant infrastructure that provides the flexibility to support the Big Data ecosystem’s rapid pace of innovation.
From a product standpoint, it’s been very rewarding to see our customers leverage the full breadth and depth of the BlueData EPIC software platform for use cases that have ranged from traditional data processing using Hadoop to real-time analytics using relatively new technologies now available in the Big Data ecosystem like Spark, Kafka, and Cassandra. With this usage and feedback from our customers, we’ve come up with some really cool ideas and innovative requirements to further accelerate enterprise-wide adoption in a true multi-tenant architecture – and deliver new value for Big Data deployments.
So on behalf of our entire R&D team, I’m excited to introduce you to the new spring release of the BlueData EPIC software platform. We focused on five key areas in this release, with an emphasis on providing powerful enterprise-grade controls along with simple, easy-to-manage interfaces.
Quality of Service (QoS) controls at a “tenant” level
The traditional concept of Hadoop “multi-tenancy” is limited to users sharing a single Hadoop cluster; it’s based on complex YARN capacity scheduling and associated queues with resource minimums. And it’s not really multi-tenancy.
With the BlueData EPIC platform, we’ve built a true multi-tenant Big Data solution from the ground up. The concept of a “tenant” is similar to that of a tenant in a typical cloud infrastructure. The resources (CPU, memory and storage) that are allocated to a tenant define the size of the different virtual clusters, whether they are Hadoop or Spark (including different distributions or versions), or other data platforms such as Cassandra or ElasticSearch that may not be governed by YARN.
One of the things that our customers asked for in this multi-tenant model was a mechanism to prioritize resources for a specific tenant (e.g. “Production”, which may have multiple virtual clusters running mission-critical data processing applications) over other tenants (e.g. “Ad-Hoc Analytics”, where data scientists and developers might be experimenting with the latest innovations and tools from the ecosystem).
In this new release, BlueData EPIC now allows administrators to provide a QoS (Quality of Service) multiplier for each tenant. This multiplier specifies a relative prioritization – and hence a prioritized allocation of the resources from a pool of shared infrastructure. This ensures that clusters or jobs in this tenant are more likely to meet their service level agreements (SLAs), even if that means slowing down the jobs in other tenants.
The elegance of this approach is that customers can still use a resource scheduler (such as YARN) to further optimize the resource usage in a specific application that might be executed on a Hadoop cluster in the tenant. But now they get all the benefits of tenant-level controls for QoS.
My colleague Tom Phelan recently wrote a blog post about QoS in a multi-tenant Big Data environment using Docker containers. With this release, we’ve addressed the challenge he identified in prioritizing resources for different tenants sharing the same underlying Big Data infrastructure. It’s epic.
Enterprise-grade security and data governance
Another building block for a multi-tenant Big Data infrastructure is the separation of compute and storage. This has been a key area of focus for us here at BlueData and one of our software innovations is a technology called DataTap.
DataTap enables multiple virtual clusters to leverage one or more shared storage platforms, whether it’s a traditional HDFS cluster with DAS (Direct Attached Storage), or an enterprise storage system such as EMC Isilon, or any combination thereof. With DataTap, our customers can accelerate their time-to-insights by enabling analytics teams to access specific data sets in a production data lake without the delays in duplicating data to different clusters – and without the associated challenges of data movement and governance.
On top of our existing security features for Big Data including support for Active Directory and Kerberos, this new spring release of the BlueData EPIC platform delivers additional new functionality for even more comprehensive data access controls, auditing, and governance – in a model with separated compute and storage using DataTap, where multiple virtual clusters can access multiple HDFS-based remote storage systems. Here are some of the key new features:
- Data access policy: This is the ability to provide ‘read-only’ access to a remote storage system in order to prevent pollution of the data lake with new user-specific directories, files, etc.
- Usage auditing: This enables tracking of the specific user who initiated a given job/query on a virtual compute cluster, all the way through to the remote HDFS. The user name and the IP address of the virtual cluster node that initiated the job are logged in the remote HDFS to support auditing. Another key benefit is that the remote HDFS can enforce directory/file-level access controls.
- Interoperability with security and auditing products: With this release, BlueData EPIC provides seamless interoperability with Cloudera Navigator for lineage and auditing of jobs executed using CDH 5.4 or higher. Similarly, we provide interoperability with Apache Ranger when the remote storage is HDP 2.2 or higher.
Fine-grained storage controls
In addition to connecting to remote storage using DataTap, the BlueData EPIC platform can leverage a portion of the local storage (i.e. disks) to configure a shared, persistent, multi-tenant HDFS. The remainder of the local storage is reserved for system use. This tenant HDFS storage is a popular option and our customers have found many uses for it – ranging from a staging area for sample data sets shared by multiple virtual clusters, to persistent storage for result sets produced from analytics against ‘read-only’ remote DataTaps.
With the new BlueData EPIC spring release, we’ve delivered new features and enhancements that enable platform administrators to better utilize the local storage (including any SSDs) that might be available on the hosts. For example, we now provide the following capabilities:
- Install time controls on the specific disks (and the number of disks) used for persistent, multi-tenant HDFS and Docker container storage. Available SSDs are commonly used for the Docker container storage, to speed up intermediate processing steps where shuffle data is stored on the Docker container storage.
- Granular tenant-level quotas for HDFS storage as well as Docker container storage. This latter storage is also referred to as “node storage” in BlueData EPIC.
- Placement of the shared HDFS services in their own c-groups, along with a variety of related performance enhancements.
Enhanced App Workbench to “bring your own app”
From its inception, BlueData’s focus has been to accelerate the adoption of Big Data in the enterprise. We’ve included pre-integrated images of open source Hadoop distributions such as Cloudera and Hortonworks, as well as Spark standalone, with our platform. We introduced an App Store model for a variety of popular Big Data applications – including business intelligence, ETL, search, and other analytical tools. And with BlueData’s self-service interface, users can easily spin up multi-node clusters for any of these applications within minutes.
Last year, we first introduced “bring your own app” support and it’s definitely caught on. Our customers have been augmenting the BlueData-provided distributions and applications with Docker images of their own preferred tools – ranging from custom Python libraries (typically for data science) such as NumPy, SciPy, and Pandas; to the latest version of Spark (e.g. Spark 1.6) or different versions and distributions of Hadoop; to their applications of choice for Big Data analytics and business intelligence.
This usage model inspired us to add several enhancements to our App Workbench so that our customers can be self-sufficient and get started with the latest innovations and releases in the Big Data ecosystem – without necessarily waiting on our release cycle or relying on BlueData’s engineering and services teams.
The most significant of these new enhancements is the fact that we now provide easy-to-use templates for the most common “bring your own app” use cases. These image templates consist of a Docker file and an associated BlueData configuration file. For example, in the BlueData EPIC spring release, we’ve provided templates that allow our customers to:
- Create and register Docker images of CDH 5.x versions in their own App Store. By specifying a few simple configuration parameters, such as the location of the relevant Cloudera parcels, this template packages all the necessary software and allows administrators to make new versions of CDH 5.x immediately available to users.
- Create Docker images of new versions of Spark 1.x.
- Create Docker images of BI/ETL tools such as Platfora, Splunk for Hadoop (Hunk), or Trifacta as edge nodes to a Hadoop cluster.
Support for new distributions and data platforms
We’ve also significantly expanded the pre-integrated images provided in our standard App Store – including support for real-time data pipelines. For example, the BlueData EPIC platform now includes these additional applications and frameworks out-of-the-box:
- Pivotal HD with Ambari, Apache HAWQ, and SpringXD
- Apache Geode / Pivotal GemFire
- Apache Kafka
- Apache Cassandra (Datastax)
While it’s exciting that our App Store includes these additional frameworks and applications, we expect that our customers will continue to add many more – to leverage BlueData EPIC’s resource pooling, security, ease of deployment, and multi-tenant capabilities for their own Big Data tools of choice.
As I mentioned at the outset of this blog post, the ever-evolving and expanding ecosystem of Big Data tools demands a highly flexible approach. Our goal is to make it as fast, simple, and cost-effective as possible for our customers to deploy their preferred Big Data applications in a true multi-tenant environment – and to enable rapid prototyping, testing, and development of emerging tools and frameworks as they become available. That’s our vision for Big-Data-as-a-Service: a self-service, elastic, and secure platform for our customers to run the Big Data tools they need, when they need them, to accelerate time-to-insights.
If you want to see all of this in action and you’ll be at Strata + Hadoop World next week (March 29-31) in San Jose, stop by the Bluedata booth: we’ll be featuring demos of our new spring release. And to learn more about Big-Data-as-a-Service, mark your calendar at the event for a session with my colleagues Tom Phelan and Joel Baxter on “Hadoop in the cloud: Good fit or round peg in a square hole?” at 1:50pm on Wednesday, March 30.