Back to Blog

The Proof is in the Pudding (I Mean, in the Benchmarking)

Back in the fall of 2012, we started BlueData with the firm belief that Big Data workloads would run in a virtual environment – and achieve the inherent virtualization benefits of flexibility, agility, and cost-efficiency – without paying a performance penalty.

At the time, we were in a fairly small fringe group of believers in this premise. But we persisted. Early on we transitioned from a Big Data virtualization platform based on type 2 hypervisors (virtual machines) to one based on embedded Docker containers, because the use of containers resulted in less CPU and memory overhead.

To boost the performance of Big Data workloads running on our BlueData EPIC software platform, we developed IOBoost, which provides application-aware data caching. And we implemented our DataTap technology to accelerate the access of Big Data workloads to remote persistent storage. This was key as it allowed us to offer independent compute and storage scalability – a primary value proposition of virtualization / containerization that permits flexibility, agility, and scalability of workloads.

We ran our first benchmarks a couple years ago. We wanted actual proof that the BlueData EPIC platform could run Big Data workloads like Hadoop and Spark without a performance penalty. The goal of this benchmarking was to compare performance of the same workload on the same physical cluster of servers, running the same Hadoop distribution, for bare-metal versus BlueData EPIC.

We based our initial benchmarking on the HiBench benchmark, a set of shell scripts that are distributed under the Apache license. HiBench is designed to stress-test Hadoop clusters while measuring speed and performance. For our benchmarking we ran tests using Enhanced DFSIO, Teragen, and Terasort.

We found that the Big Data jobs ran as fast (or even slightly faster) on the EPIC platform than on bare-metal for some of these micro-benchmarks. The benchmarking results (using 1 terabyte of data) are summarized in the chart below. We used a single virtual node per physical host to demonstrate that there is minimal to no overhead with the BlueData EPIC platform.

The execution times for bare-metal were used as the baseline (i.e. 100%); the bars in the chart above illustrate the corresponding execution time for the same test on BlueData EPIC compared to the bare metal execution time (lower is better). BlueData EPIC was significantly faster than bare-metal for the Teragen and DFSIO Write tests. BlueData EPIC was just slightly slower than bare-metal for the TeraSort and DFSIO Read tests.

But we were not satisfied with this. Micro-benchmarks are good, but they are not always comparable to real-world Big Data workloads. We wanted to prove that BlueData EPIC ran “real world” Big Data workloads without sacrificing performance. To that end, in 2015 we entered into a strategic technology and business collaboration with Intel and embarked on a very detailed performance analysis study.

In this study, we selected the BigBench benchmark. BigBench was developed specifically to address real-world use cases. Together, we spent a lot of time and went through many benchmark iterations and bottleneck analyses. What came out of this effort is nothing short of breakthrough.

Again we started with an apples-to-apples comparison for environments running on bare-metal and with BlueData EPIC, using the same Hadoop software and hardware configuration.

We ran the benchmarks on 10, 20, and 50 node configurations in order to be sure that horizontal scalability was not a gating factor. The results showed that Big Data jobs ran as fast, or in some cases slightly faster, on the BlueData EPIC platform than on bare-metal. For example, benchmark tests showed that the BlueData EPIC platform demonstrated an average 2.33% performance gain over bare-metal (for a configuration with 50 Hadoop compute nodes and 10 terabytes of data).

The chart below shows the overall performance of BlueData EPIC compared to bare-metal on 10, 20, and 50 host clusters (in this case, higher is better):

This chart shows the ratio of BlueData EPIC compared to bare-metal performance across three test runs – overall and in each of the three BigBench benchmark phases (power, throughput, and load).

The performance gains over bare-metal are due to BlueData’s IOBoost and DataTap technologies.

However, our continued focus on (and success in) achieving maximum performance is also a testament to the ongoing collaboration between Intel and BlueData to investigate, benchmark, and improve our software platform. Together, the collaboration resulted in this unprecedented milestone: proof of comparable – and in some cases even better – performance for Big Data workloads running in a container-based environment versus a bare-metal environment.

We announced these exciting results today. And the details of the benchmarking methodology and results are published in a new Intel white paper posted here: “Bare-metal performance for Big Data workloads on Docker containers.”

So now we have the evidence to back up what we knew to be true all along. We have the benchmark results proving that real-world Big Data workloads can run as fast on virtualized / containerized environments (with BlueData EPIC) as they do on bare-metal environments. The proof is in the pudding.