Back to Blog

Severing the Link between Big Data Compute and Storage

Call it Big Data meets hyper-convergence, or the next generation virtual storage network, or Hadoop on a hypervisor, but the time has come for a serious look at running honest to goodness real-world Big Data workloads in a virtual environment. Not just test and dev mind you, but real production Hadoop workloads using virtualization.

Ever since we started running Big Data jobs, the mantra has been: “bring the compute to the data.”  The rationale for this was two-fold. First, the performance of local disk drives was better than the performance of the data center network.  And second, the majority of the time a MapReduce task was spent reading or writing data.

This sentiment drove the development of the Hadoop Distributed File Systems (HDFS) to using commodity servers with direct attached storage (DAS). It also drove the development of a distributed compute environment, (JobTracker, YARN), to move the processing to the local units of distributed storage. It caused  users to create physical server Hadoop installations where the software had rack awareness and the network was carefully crafted to avoid bottlenecks.

However, with the advent of 10 Gbit/s, 40 Gbit/s and faster networks, the availability of vastly improved data compression technology, as well as the development of shared in-memory data analysis tools such as Spark, these assumptions need to be revisited.

In 2011, AmpLab prophesied the end of the “bring compute to the storage” era with its Disk-Locality in Datacenter Computing Considered Irrelevant paper.  A year later, Microsoft brought the prophesy to life with its Flat Datacenter Storage (FDC), proving that network bandwidth was no longer a bottleneck to data movement in the data center.

Providers of Hadoop clusters in public cloud environments have also demonstrated this. Last year Amazon released new AWS instance configurations that when used with Amazon S3 provided better performance than running the same Spark jobs using local storage. This demonstrated the end of disk locality in the public cloud as well as on-premises.

Accenture Technology Labs also recently compared the performance of the Google Cloud Storage connector for Hadoop and found that it beat local disk HDFS performance (see the chart below for reference).

 
Results of Accenture Technology Labs running Hadoop on Google Cloud

 

At BlueData, we have observed similar performance during our comparison of Hadoop running on our EPIC platform in a virtualized on-premises environment versus Hadoop running with bare-metal on-premises. Stay tuned for more updates on that in the coming months.

Every enterprise that is evaluating its options for Big Data platforms needs to seriously consider virtualized environments as well as physical bare-metal clusters. Running Hadoop on virtualized infrastructure is no longer an experimental proposition reserved for non-critical workloads. It is a proven solution for the most demanding Hadoop jobs.