I recently came back from a visit with several of our largest customers and prospective customers in New York City. This is the part of my job that I love the most: talking with executives and practitioners about real-world use cases for Hadoop and Big Data. So I thought I’d share some of my notes and observations in a blog post.
One caveat: these were big organizations. All of them are Fortune 500 companies, in financial services as well as other industries, and each with tens of billions in revenue. And they are all at different stages of their Big Data journey. But I kept hearing the same questions and the same themes in each of these conversations. Here are the questions and my thoughts.
Apache Hadoop and Data Locality
Our on-premises Hadoop infrastructure costs continue to rise, yet our hardware utilization remains low. What are our options?
There are a few different options to consider. One option is to continue to invest in custom Hadoop development to leverage the YARN scheduler and thereby increase cluster utilization on bare-metal servers. This may help to improve the CPU utilization for your physical hardware, but capital expenditures for infrastructure would still increase as you expand the Hadoop deployment.
Another option is to move your Hadoop deployment off-premises (e.g. to a public cloud Hadoop-as-a-Service offering such as Amazon’s Elastic MapReduce). In this scenario, you would of course reduce capital expenditures for infrastructure; but your operating expenditures would also increase significantly over time.
A third option would be to virtualize your Hadoop infrastructure to achieve more efficient utilization of your on-premises hardware resources. Most environments have servers running well under capacity. With virtualization, you can share idle hardware resources across workloads and thus improve utilization, reduce capital expenditures, and achieve lower maintenance costs.
Network bandwidth has quadrupled since MapReduce was first introduced in 2005 and data locality for Hadoop became the mantra. Do we still need to co-locate the data with compute?
This is a hard conversation for many Hadoop traditionalists, since it fundamentally challenges the assumption that data locality is necessary for working with large data sets. But the simple answer is “No”. We can finally separate compute from storage; data locality is no longer relevant.
With technology advancements over the past decade – including the introduction of modern 10 Gbit/s network infrastructures into enterprise data centers – it is now possible to store data in a network shared storage device and not pay a performance penalty when running Hadoop jobs on remote compute servers.
We keep hearing that I/O performance is a bottleneck for Hadoop, but there is recent research that indicates otherwise. Why isn’t anyone talking about this?
We think they should be talking about it – and we are starting to hear more discussion on this topic. There is hard data that proves I/O performance isn’t a significant issue. Here are just a few of the research reports on this topic:
Disk-Locality in Datacenter Computing Considered Irrelevant by University of California, Berkeley
MinuteSort with Flat Datacenter Storage by Microsoft Research
Bottom line, the concerns about performance are now a myth. It’s time to separate compute and storage, challenging long-held Hadoop assumptions about virtualization and data locality.
How will Apache Spark evolve in relationship to the Big Data ecosystem? How should we prepare for this?
We’ve heard this question a lot recently. Apache Spark (an open-source in-memory cluster computing engine for large-scale data processing) is on fire. More and more organizations are beginning to work with Spark for a variety of different use cases.
In general, the reality is that the entire Big Data ecosystem (including Hadoop, Spark, Cassandra, MongoDB, noSQL, distributed storage, Splunk, and a host of other products and technologies) is evolving at a breakneck pace. Enterprise IT organizations must remain flexible in the face of this rapidly changing environment. They can’t afford to get locked into a technology that might be obsolete in the next couple years.
No one can read the future, but it is possible to hedge your bets. So we suggest that these IT organizations choose a Big Data infrastructure that is “future proof” (i.e. that can run any Big Data application, either today or in the future); that is flexible (i.e. that can co-exist with existing infrastructure as well as new innovations); and that can scale (e.g. expanding compute and storage resource independently). Keeping these principles in mind will help ensure that you’re prepared to take advantage of the latest Big Data innovations as the ecosystem evolves.
Spark can read and write data from non-HDFS sources and run native without Apache Hadoop. What is the connection between Spark and Hadoop?
Hadoop is a general purpose framework for processing large amounts of data; it is used by many Big Data applications. Spark is one specific application for processing Big Data. The connection between Spark and Hadoop is that Spark can be configured to run using the Hadoop YARN scheduler and so can co-exist in a Hadoop environment along with other Hadoop applications.
Why do we have to run Spark inside YARN? Why can’t we run Spark native ?
You can run Spark natively. It is not necessary to run it within the Hadoop YARN scheduler.
If Spark can do what Hadoop does and talks to HDFS sources, do we still need Hadoop?
The short answer here is “No”: you do not need Hadoop to run Spark.
We have well-established, enterprise-grade storage systems and policies in place. For Hadoop analytics, why do we need to copy all the data to HDFS?
It is no longer necessary to copy your data into a local HDFS file system in order to process it with Hadoop analytics. This is a major issue for many organizations – especially for larger organizations with significant existing investments in enterprise-class storage.
Data has gravity and it’s very difficult to move. Can we run Hadoop analytics on the data in our existing storage systems without moving or copying the data?
We hear this complaint quite frequently – it is painful to move data. And the answer to the question is “Yes”. With current network architectures, Hadoop can be run on existing storage devices without loss of performance. Enterprises shouldn’t have to copy and move their data.
HDFS and MapReduce
It’s very inconvenient that MapReduce compute and underlying HDFS are packaged together. We would like to see any MapReduce compute talk to any HDFS storage. Is this possible?
We agree – and yes, it is possible. It isn’t necessary to limit MapReduce to processing data only resident in HDFS file systems, nor it is necessary to limit a given version of MapReduce to accessing data in a given version of the HDFS file system. Any version of MapReduce should be able to read and write data to any version of the HDFS file system. Moreover, we believe that MapReduce should be able to read and write data to any shared storage device.
Can we use HDFS as a protocol to talk to MapReduce clusters, but not as a storage sub-system?
Again, the short answer is “Yes”. We like the HDFS protocol – it is the defacto standard for how Hadoop applications access data. However, the HDFS protocol need not dictate an HDFS implementation requiring NameNodes, DataNodes, and data copies in triplicate. Enterprise-grade shared storage systems have been around for decades. They are mature, feature rich, and stable in ways that will take the open source version of HDFS many years to achieve.
There is a way to get the best of both worlds. At BlueData, we’ve developed a software platform that allows your unmodified Hadoop applications to continue to use the HDFS protocol, while your data can reside on your existing enterprise-quality storage systems. In doing so, we’ve tackled many of the challenges outlined above with regards to moving data and leveraging existing storage systems.
How is it that public cloud providers can run Hadoop workloads in virtualized environments, whereas we can only deploy Hadoop on bare-metal in our own data center?
The vast majority of enterprise Hadoop deployments today are on on-premises (as outlined in this recent article by Alex Barrett) – running on dedicated, physical bare-metal servers.
These IT organizations need to provision multiple servers and multiple physical Hadoop clusters to handle multiple applications with different Quality of Service levels and security requirements. The typical result is Hadoop cluster sprawl, high capital expenditures on hardware, and low hardware utilization (often 30 percent or even less).
It’s time that these enterprises take a serious look at virtualization for Hadoop. Amazon’s Elastic MapReduce (EMR) has been doing this for years. With logical separation, you can run multiple Big Data applications (and multiple “tenants”) on the same physical server. And with BlueData’s software innovations, we’re seeing comparable I/O performance for virtual clusters and physical bare-metal clusters on-premises.
The benefits of virtualization for other applications are well-documented: greater agility, lower costs, less server sprawl, and higher utilization. It’s time to apply these benefits to on-premises Hadoop deployments.
Why can’t we have the agility and flexibility of Hadoop-as-a-Service (or “Big-Data-as-a-Service” more broadly) within our own on-premises data center?
This is one of the fundamental challenges that we see in the Hadoop and Big Data industry today. Hadoop-as-a-Service cloud offerings like Amazon EMR can be particularly attractive for development and testing, greenfield environments, and small and medium businesses. But for use cases where the data already exists in-house (and/or if security and compliance concerns prevent it from being moved to the cloud), those enterprise Big Data deployments will likely remain on-premises.
We started BlueData to address this challenge and provide a solution for all enterprises to benefit from the “as-a-service” model for their on-premises Big Data implementations. Their data scientists should be able to spin up virtual Hadoop or Spark clusters on-demand, rather than waiting for IT to provision and configure the required hardware. Big-Data-as-a-Service is a great model, but it’s not only for the public cloud. It’s time to demand greater agility and flexibility for on-premises Big Data deployments.
– by Kumar Sreekanti, co-founder and CEO, BlueData