Back to Blog

The Elephant in the Big Data Room: Data Locality is Irrelevant for Hadoop

We’ve had some good feedback and a lot of interest resulting from our recent blog post on “Severing the Link between Big Data Compute and Storage”. To follow up, I’d like to provide a bit more detail about why it is no longer necessary to co-locate compute and data for Hadoop and other Big Data applications.

The Elephant in the Big Data Room
The Elephant in the Big Data Room

Hadoop infrastructure was originally designed around the goal of data locality and “bringing the compute to the data.” To recap, the historical reasons for co-locating data on disk and CPU for Hadoop were based on two assumptions at the time:

  1. Local disks were faster than networks at delivering data to the CPU; and
  2. Hadoop-related tasks spent most of their time reading data.

Due to improvements in networking infrastructure, studies have shown that reading data from local disks is only about 8% faster than reading it from remote disks over the network. This is now relatively well known. What may be less well known is that this 8% number is decreasing – further eroding the basis for the first of these assumptions.

With the explosion of Big Data, the most pressing need in the data center is for additional storage capacity. And, as highlighted in this research paper on disk locality from U.C. Berkeley’s AMPLab, networks are getting much faster but disk storage isn’t. There is a practical limit to how many disks can be used, and so technologies such as data compression and deduplication are increasingly being used to cram even more data onto existing hard disks. What this means is that Big Data tasks are spending a smaller percentage of their time waiting for disk I/O requests to complete and more time uncompressing and then processing their data. This phenomenon invalidates the second assumption for co-locating compute and storage.

The fewer the number of bytes that needs to be read across the network, the shorter the amount of time required to do the transfer – with the net result that the same amount of uncompressed data can be delivered more quickly over the network to the CPU.

Facebook’s real-world experience provides further validation for the end to co-locating disk and CPU. The AMPLab analysis of logs from Facebook’s data center showed that data locality provided little, if any, reduction in the runtime of Hadoop tasks.

Once it’s been accepted that the co-location of compute and storage is no longer required to run high performance Big Data jobs, a number of intriguing questions, and possibilities, open up. For example, is it necessary to run Hadoop Distributed File System (HDFS) on the same hosts as other Hadoop components? Consider the possibility of running them on different hosts. This permits the hosts to be optimally configured and tuned for the Hadoop modules that will run on them. For example, the HDFS NameNode and DataNodes can be run on hosts optimally configured for them, while the Hadoop ResourceManager and NodeManager can likewise run on hosts optimally configured for their execution.

This raises another question. Is it really necessary to ingest data into an HDFS file system in order to make it available for a Hadoop application? After all, HDFS is a protocol as well as a file system implementation. As long as the HDFS protocol is maintained, there is no reason that the data itself cannot reside on any storage system. The trick here is to achieve the performance of a local HDFS file system, under real world workloads, while accessing the remote storage. This is possible today and we’ve implemented it in enterprise Big Data deployments – using client-side caching to take advantage of the I/O patterns of individual Hadoop applications.

Hadoop virtualization
This approach – a Hadoop cluster where hosts are specialized for running specific components and where I/O performance to remote storage is maintained via caching – is ideal for virtualization. And this leads to one more question. Is it possible to run Hadoop in a virtual environment with performance comparable to bare-metal? How about virtualization using containers as well as traditional virtual machines? Can security and QoS be maintained between multiple Hadoop clusters running in virtualized environments on common hardware?

The answer depends on achieving the necessary performance in traditional virtual machines and the necessary QoS and security in containers. With traditional virtual machines, the latency introduced by the standard mechanisms of storage and network emulation must be eliminated. This can be done by judicious use of para-virtualization and knowledge of the Hadoop execution environment. With containers, performance is not the issue, but rather security and QoS. Careful control of data access and resource partitioning not only can, but does, yield the necessary security and QoS.

The traditional Hadoop approach was based on the concept of moving compute closer to the data, with data locality as a key tenant. HDFS was typically implemented on a direct-attached storage (DAS) architecture, and Hadoop on shared storage didn’t work in this traditional model. The vast majority of on-premises deployments are on bare-metal, and traditionalists have long warned against Hadoop virtualization due to concerns about I/O performance. It’s time to bust these myths.

mythbustersThe link between storage and compute in Hadoop clusters can be severed. It is possible to run Hadoop against any shared storage system. Hadoop can be virtualized while achieving comparable performance to bare-metal. We can now hold these truths to be self-evident.

– by Tom Phelan, co-founder and Chief Architect, BlueData