Today we announced a major new release of the leading Big Data infrastructure software platform: BlueData EPIC 2.0. Version 2.0 brings game-changing innovations that make Big Data deployments on-premises even easier, faster, and more powerful.
For a quick overview of BlueData EPIC and what’s new, you can check out the demo below. And then read on for more detail on what we’ve done and why it’s truly EPIC.
The Journey to BlueData EPIC 2.0
Just a year ago, we launched the BlueData EPIC software platform at Strata + Hadoop World in New York and our version 1.0 product won the Startup Showcase award. Our unique value proposition of agility, security, and cost savings by virtualizing Big Data infrastructure was an immediate hit: we tapped into the need for instant Hadoop and Spark clusters with a cloud-like experience ala Amazon EMR (Elastic MapReduce), delivered in an on-premises deployment.
Since then, I’ve been fortunate to work with many customers who have implemented our software for a wide range of Big Data use cases. They provided valuable initial feedback that was incorporated into our summer release: EPIC version 1.5. This customer input resulted in enhancements such as auto-provisioning of management consoles including Cloudera Manager and Apache Ambari for easier troubleshooting of Hadoop clusters, as well as comprehensive Kerberos authentication for Hadoop services on top of the LDAP and Active Directory security we delivered in version 1.0. And, at the same time, we introduced our initial support for running Hadoop and Spark on Docker containers (more on this below).
As we’ve added new deployments and I’ve met with more and more customers, I continued to encounter three key themes:
Getting physical servers for a Big Data project is a barrier to adoption. Securing multiple physical servers for on-premises Hadoop or Spark deployments is a significant impediment for even the largest organizations. This is especially true for enterprises that are just getting started with Big Data; they often evaluate different tools across a range of different use cases before they zero in on a well-defined production use case. So they need flexibility in their initial deployment.
We’ve seen some smaller companies use cloud services like Amazon EMR to avoid the cost and complexity of this challenge. But for most other mid to large enterprises, putting their data and systems in the public cloud is simply not an option for their Big Data initiative. So (somewhat ironically for a vendor introducing virtualization for Big Data infrastructure) we were often asked if we could install the BlueData EPIC platform on a virtual machine; otherwise, they needed to wait for hardware. For example, a large organization in the financial services industry wanted to leverage their existing scale-out dynamic compute infrastructure where the lowest unit of infrastructure was a VM.
The initial releases of the EPIC platform provided a turnkey solution to virtualize the infrastructure for Big Data distributions, but it needed to be installed on physical servers with a Linux operating system. Version 1.x of our EPIC software created virtual machines running on that physical infrastructure and spun up virtual Big Data clusters. Installing EPIC software on a set of VMs (as opposed to physical servers) would have meant running VMs within VMs, which is neither recommended nor supported. Once they have the hardware to install our software, they can spin up virtual Hadoop or Spark clusters on-demand within minutes. But for customers that couldn’t readily procure a physical server to get started, this posed a challenge even for us.
The rapid growth and adoption of Apache Spark in the enterprise. Spark continues to generate more and more interest and adoption is accelerating – with an emphasis on using Spark for faster, near real-time data processing. For many organizations, Spark has become the catalyst for their Big Data initiative. BlueData always offered Spark with our EPIC platform since version 1.0 – providing the ability to quickly create Spark clusters with Hadoop/YARN (i.e. as part of the open source Hadoop distributions we package) or in standalone mode. The majority of our customers prefer the Spark standalone mode; and they love BlueData’s multi-tenant resource management model, which allows users to spin up their own instant Spark clusters via self-service.
But interest in Spark is quickly expanding beyond a core cadre of data scientists in these organizations. As just one example, a large healthcare company cited Spark-as-a-Service and told me they wanted those same capabilities on-premises for their business analysts. These and many other similar interactions made it clear to me that enterprises don’t just want Spark clusters accessible by a handful of data scientists or developers using command line and programming tools. They need a comprehensive solution for Spark cluster creation with web-based user interfaces to write their programs and visualize data, extending Spark adoption beyond data scientists to the broader business analyst community.
It’s all about the Apps. Everyone in this field loves products that help extract value from their data; they want tools to help them visualize, search, and slice & dice their data. As a matter of fact, most business analysts don’t really care all that much about the intricacies Hadoop, Spark, YARN, Spark RDDs, and so on. What they really care about is time to insight, and that means having the right analytics tool for the right job. The underlying data platforms and infrastructure are a means to that end. They want to go quickly from discovering data using search or other easy-to-use interactive approaches; to profiling, wrangling, analyzing, and visualizing the relevant data sets with their preferred business intelligence or analytics tools; and then creating the business metrics and KPIs to help drive business decisions.
Over the past several months, I’ve worked with many IT organizations focused on deploying and scaling their Hadoop or Spark infrastructure. But in reality, their business users want an app-first focus where Big Data analytics applications are the priority; they want the Hadoop and Spark compute services to be created or managed transparently, deployed behind-the-scenes to support their data analysis flow from discovery to actionable metrics.
I took these and other data points from various customer interactions to our engineering team and we went to the whiteboard to map out our path to EPIC 2.0 and beyond. We decided to be bold, and go big.
“Whatever you can do, or dream you can…begin it; boldness has genius, power and magic in it” – Johann Wolfgang von Goethe
So What’s New in BlueData EPIC 2.0
Thanks to BlueData’s rock-star engineering team, we’ve packed a lot of exciting new functionality in BlueData EPIC version 2.0. These new capabilities align well with each of the three themes I outlined above.
Docker containers for enterprise deployments, with ultimate flexibility. In BlueData EPIC 2.0, we’re introducing support for Docker containers in our enterprise edition – enabling our customers to run Hadoop and Spark on containers in a production deployment. Here what Nick Stinemates, head of at business development and technical alliances for Docker, has to say about what we’re doing:
“BlueData continues its innovation in Big Data infrastructure by leveraging the power of Docker to improve performance and accelerate the development lifecycle. BlueData’s integration with Docker provides an enterprise-grade solution for running Hadoop and Spark on containers, enabling organizations to scale linearly and improve operational efficiencies.”
Timing is everything and we are fortunate to be at the right place and the right time, as the Docker open source community is quickly maturing the container APIs for adoption by software vendors like us. That having been said, it was no mean feat to make this happen. So I’m very proud of our engineering team, which took the learning from our prior releases to build the infrastructure capabilities necessary for the entire lifecycle spanning container provisioning, orchestration, load balancing, storage, networking (e.g. IP management) and security (e.g. tenant isolation with VLANs). All packed in a turnkey software platform tailored for Big Data, with an intuitive and simple user experience for deploying Hadoop and Spark environments.
Our customers won’t even be able to tell that our software platform uses Docker containers under the hood. But Docker containers provide us with the building blocks to transform Big Data infrastructure. Containers are super lightweight, so they will help ensure comparable or even better performance when compared to bare-metal (especially for CPU intensive jobs) and significantly better server utilization. Containers use the host OS kernel so there isn’t the overhead of a guest OS (like you have in VMs); this means streamlined OS administration, especially around ongoing security patches. More importantly, the time to spin up a cluster is 5X faster compared to with virtual machines. Packaging and deploying Big Data applications (e.g. Hadoop distributions, business intelligence and analytics tools, or custom Big Data apps) will be simple, straightforward, and streamlined; in fact, application deployment is why Docker containers became popular in the first place.
And since our EPIC platform will now use Docker containers for Big Data cluster nodes, enterprises can now deploy our software on virtual machines (in addition to physical servers). This means ultimate infrastructure flexibility for deploying BlueData EPIC software. And yes, that means running a Hadoop or Spark cluster on a container, on a virtual machine, on a physical server. It’s a bit mind-boggling.
And this is just the beginning. I’ll leave it to your imagination where we are headed next with containers (stay tuned).
Extending Apache Spark to business analysts. BlueData EPIC 2.0 also provides greater integration and support for new Spark innovations, including Apache Zeppelin for data exploration and visualization. By enabling self-service Spark clusters pre-integrated with web-based Zeppelin notebooks, EPIC can accelerate Spark data analysis for business analysts that may have less technical expertise than traditional data scientists. In addition, EPIC 2.0 introduces support for SparkR, Spark Streaming, MLlib, Spark Streaming-SQL, and a Hive metastore to store/reuse your table definitions. With BlueData EPIC, enterprises can simplify and accelerate their deployment of Spark on-premises – either with Hadoop or in standalone mode, independent of Hadoop.
With EPIC version 2.0, BlueData delivers a comprehensive Spark solution for business analysts that don’t have the command line expertise of most Spark developers and data scientists. Our integration with Zeppelin Notebooks provides a simple, out-of-the box user interface to learn Spark and visualize Spark results. You can learn more in a new blog post on Spark and EPIC 2.0 from my colleague, Nanda, here.
App Store for one-click deployment of Big Data tools. With EPIC 2.0, we now offer an enhanced “App store” for common Big Data applications. In addition to Spark and open source Hadoop distributions from Cloudera and Hortonworks, software applications from partners such as AtScale, Arcadia Data, Platfora, and Splunk are now included in the App Store and available via one-click deployment. Docker images of these products will be available in the App Store for fully automated self-service deployment – allowing our customers to move quickly beyond clusters to business analytics.
BlueData has partnered with industry leaders in multiple application categories including search, business intelligence, visualization and OLAP to accelerating time to business value with Big Data analytics. With EPIC 2.0, our customers will benefit from one-click deployment of these tools with a dedicated Hadoop or Spark cluster under the hood. The EPIC platform not only provisions the Docker container with the appropriate software, it also automates various configurations so that the end user can launch the application and get started.
Some of our partner products offer a free trial period in our App Store while others require a trial license key to be input into the web interface (once the application has been provisioned). You can reach out to us or contact the software vendor directly to get started. This release also expands the scope of BlueData’s “bring your own app” support by allowing users to quick add images for other Big Data applications or data processing platforms (such as Kafka or Samza) to the App Store.
These are just the highlights of what’s new in EPIC 2.0 – there are dozens of other enhancement in this release ranging from user interface updates to security and auditing improvements. For a brief recap of the new functionality in EPIC 2.0, you can refer to the following presentation:
If you’re going to Strata + Hadoop World in New York City, make sure you stop by our booth in the expo hall to see a demo and learn more about BlueData EPIC 2.0. And don’t miss my session on “Requirements for secure, multi-tenant Hadoop” on Wednesday, September 30th.