Back to Blog

Hadoop Virtualization: It’s About Time (and Value)

“Whoever wishes to foresee the future must consult the past.” – Machiavelli

While there have been many amazing technologies that have impacted the data center, history books will certainly keep a warm spot open for “Virtualization”.  Virtualization is certainly not a new idea (indeed, it dates all the way back to the 1960s) but it has had an incredible impact during the last decade – largely due to VMware’s innovative approach to challenging the “one workload, one box” paradigm.  In 2015, Gartner now estimates that about 75% of x86 server workloads are virtualized.

There are few technologies that have had such a positive impact on customer TCO while also being a boon to the technology industry.  Case in point: in 2004, there were approximately 6 million new server units sold and the rough estimates for average server utilization were 6-12% (or less).

Fast forward to 2015: we’re looking at an estimated 11 million new server units for the year, while typical server utilization has more than doubled over the decade (with a potential of 50% utilization or even higher). Companies like Intel and Dell are selling more processors and servers, while customers are getting more than 2x the utilization out of the technology. These customers often see TCO reduction of 50% or more; and they improve agility with much shorter server provisioning times.

Hadoop Virtualization or Bare Metal

However, even in 2015, there are some workloads that are still not virtualized. “Big Data” (and Apache Hadoop in particular) remains one of the last bastions of the dedicated physical server. The vast majority of Hadoop workloads still run on bare metal. But the time has come for virtualizing Hadoop (as well as related Big Data technologies like Apache Spark).

Bringing the Value of Virtualization to Big Data

Regardless of the industry, company size, or geographic location, there are two questions that apply to all enterprise Big Data strategies and projects.  The first question is “how much will this cost?”  The second question is “how long will this take?” Total cost of ownership (TCO) is front and center in the Big Data space today.  So is time to value: as I wrote in a previous blog post, Hadoop and related Big Data tools are too complex and take too long to deploy. And now it’s not just Hadoop – here at BlueData, we’re seeing a huge spike in interest in deploying Spark. These same questions apply to Spark as well; it’s just earlier in the adoption cycle.

So why hasn’t the Hadoop community embraced virtualization? Concerns about I/O performance made virtualization a taboo subject when Hadoop was first introduced ten years ago. But a number of studies and performance tests have since demonstrated that virtualization is a viable (and attractive) option for Hadoop. And with the rapid adoption and popularity of container technologies like Docker, there are now lightweight approaches to virtualization that further minimize the I/O performance impact for Big Data workloads like Hadoop and Spark.

It’s about time we acknowledged that the dedicated bare metal server approach to deploying Hadoop is not only outdated – it’s also slow and inefficient. Intel and BlueData have teamed up to make it easier and more cost-effective for enterprises to adopt Hadoop and Spark. Leveraging the power of virtualization and container technology, we can help you achieve significant reduction in the TCO for your Big Data project. You can spin up virtual Hadoop or Spark clusters in minutes (instead the weeks or months it may take to build a physical cluster). And you still achieve the performance you’re used to with bare metal.

It’s about time that we apply the benefits of virtualization to Big Data. It’s about time to deployment. It’s about time to value. It’s about time to insights. For a deeper look at how to apply virtualization to Hadoop and Spark workloads in the enterprise, read through Intel’s new white paper at this link here or in the SlideShare below: