The traditional Hadoop architecture was founded upon the belief that the only way to get good performance with large-scale distributed data processing was to bring the compute to the data. And in the early part of this century, that was true. The network infrastructure in the typical enterprise data center of that time was not up to the task of moving large amounts of data between servers. The data had to be co-located with the compute.
But now the network infrastructure in enterprise data centers, as well as that of the public cloud providers, is no longer a bottleneck for Big Data computing. It’s time to decouple Hadoop compute from storage. I’ve written about this in previous blog posts and our partners like Dell EMC have weighed in on this topic as well. Industry analysts have recognized this too, as noted in this recent IDC report on the benefits of separating compute and storage for Big Data deployments:
“Decoupling compute and storage is proving to be useful in Big Data deployments. It provides increased resource utilization, increased flexibility, and lower costs.” – Ritu Jyoti, IDC
In 2018, discussions about Big Data infrastructure no longer revolve around methods to reduce network traffic through the use of clever data placement algorithms; instead, there are now more discussions about how to reduce the cost of reliable, distributed storage.
The Hadoop open source community has brought this discussion to the forefront with the recent introduction of Apache Hadoop version 3.0. One of the key features of Hadoop 3 is Erasure Coding for the Hadoop Distributed File System (HDFS), as an alternative to the venerable HDFS 3x data replication. Under typical configurations, Erasure Coding reduces HDFS storage cost by ~50% compared with the traditional 3x data replication.
Over the past few years, the Hadoop community has discussed the potential storage cost savings that Erasure Coding will bring to HDFS; and many have questioned whether 3x data replication still makes sense, given the advancements in hardware and networks over the last 10 years. And now that it’s here, HDFS Erasure Coding does fundamentally change Hadoop storage economics – as outlined in this recent article in Datanami, including a brief interview with yours truly. It’s also acknowledgement (finally!) by the Hadoop community that the data doesn’t have to be co-located with compute.
To get appreciation of just how dramatic the shift in this conversation has been, let’s compare the performance numbers in a presentation from Yahoo in 2010 on Hadoop scaling (here) and compare it to a more recent one describing HDFS with Erasure Coding (here).
In the first presentation, the Benchmarks slide below refers to the DFSIO benchmark with read throughput of 66 MB/s and write throughput of 40 MB/s. The performance numbers for the Sort benchmark are given with the caveat of being based on a “Very carefully tuned user application”. The use of 3x replication in HDFS was considered a powerful tool for data protection and performance.
The comparable Benchmark slide in the second presentation refers to the same DFSIO benchmark. The read throughput is 1262 MB/s for HDFS with 3x replication compared to 2321 MB/s on HDFS with Erasure Coding (6+3 Striping). This is with 30 simultaneous mappers and no mention of careful application tuning! The 3x replication used by HDFS is now viewed as an archaic, expensive, and unnecessary overhead for achieving (limited) data reliability.
HDFS with Erasure Coding (EC) utilizes the network for each file read and write. This is implicit acknowledgement that the network is not a bottleneck to performance. Indeed, the primary performance impact of HDFS EC is due to its CPU cycle consumption rather than network latency. The use of Intel’s open source Intelligent Storage Acceleration Library (ISA-L) is addressing that concern; this brief video here provides a high-level overview of how using EC with Intel ISA-L can ensure high performance.
A recent blog post here on Hadoop 3.0 from our partners at HPE also highlights the advantages of separating storage and compute resources for Big Data, including performance results with HDFS EC and the use of ISA-L. Bottom line, this demonstrates the potential for significant cost savings in storage (more than 6x lower $/TB in this case) without sacrificing performance.
At BlueData, we have long been advocates for the benefits of decoupling compute and storage for Big Data environments. Flexibility, agility, resource sharing, and scalability – as well as significant cost reduction – are some of the benefits that compute / storage separation bring to Hadoop and other Big Data deployments. Until recently, we have been part of a small yet dedicated group of technologists with this vision. With the introduction of HDFS EC in Hadoop version 3.0, our ranks have swelled with support from the Hadoop open source community.
To learn more, you can watch this on-demand webinar on decoupling compute and storage for Big Data – including the implications of Hadoop 3.0 and HDFS Erasure Coding: