Back to Blog

Hadoop Virtualization: The Next Big “V” of Big Data

If you’re familiar with Big Data, you’ve likely heard about the 4 V’s: Volume, Variety, Veracity and Velocity. The next big trend to disrupt Big Data and Hadoop deployments is yet another “V”: Virtualization. For more than a decade, traditional application workloads have taken advantage of the numerous benefits of virtualization such as elasticity, scalability, agility, ease of maintenance, and costs savings due to consolidation and better utilization of resources.

As Big Data continues to achieve mainstream status within enterprises, the virtualization of Hadoop and other Big Data workloads will become a key mechanism for organizations to future-proof their Big Data infrastructure investments.

Some early adopters and traditionalists in this market have advised against virtualizing Hadoop workloads – citing I/O performance degradation, since Hadoop was originally implemented to run in a bare-metal, physical environment. This is similar to the situation that prevailed in the early 90’s, when data centers relied almost entirely on physical servers. That scenario changed completely with the entry of server virtualization and hypervisors in the early 2000’s, and today over 70% of all data centers run virtualized server environments. Moreover, as data centers rapidly upgrade from 1-Gbit Ethernet switches to 10-Gbit and even 40-Gbit or 100-Gbit switches, network bandwidth will no longer be a bottleneck for virtualization I/O performance.

Another interesting observation is that while innovations in network bandwidths have caused speeds to double and quadruple, disk bandwidths are not progressing at the same rate. This creates a need for a smarter and innovative approach that effectively decouples compute and storage for big data workloads.  And yet this approach is contrarian to today’s traditional method of co-locating storage and compute, with bare-metal physical servers and direct-attached storage (DAS).

As evidenced by our ongoing discussions with over 200 enterprise organizations, IT infrastructure teams are beginning to realize that virtualizing Hadoop is a natural evolution for their Big Data infrastructure – similar to the evolution and adoption of virtualization for other workloads that begin more than a decade ago. They recognize that virtualization can provide a highly scalable, elastic, and cost-effective platform for Big Data applications.

Moreover, depending on the workload and type of processing, virtualization can (in some cases) offer comparable performance to that of a bare-metal system. Even if virtualization imposes a small I/O performance penalty as compared to physical deployments, the benefits far outweigh the slight drop in performance. In addition to the obvious benefits of elasticity and multi-tenancy, virtualization provides an easy and cost-effective way for enterprises to build sandbox environments quickly for testing new applications, frameworks, and distributions.

Here at BlueData, we’ve developed technology that optimizes the I/O performance – with patent-pending innovations – to deliver the benefits of virtualization for Hadoop, along with comparable performance to bare-metal.  We’re working closely with all of the leading Big Data vendors (including analytics and visualization tools, the major Hadoop distributions, and data center infrastructure providers) to make it easier, faster, and more cost-effective to deploy Hadoop.

This leads us to the obvious question: will virtualization act as a key catalyst to help speed up the adoption of Hadoop and Big Data in the enterprise? We expect that it can be, as more organizations move out of the “Big Data evaluation phase” and start running their Hadoop production workloads on purpose-built infrastructure platforms (like BlueData EPIC’s software) that leverage virtualization technology and are designed specifically for Big Data workloads.