It’s hard enough to gather all the data that an enterprise needs for a Hadoop deployment; it shouldn’t be hard to manage it as well. But if you follow the traditional Hadoop “best practices”, it is. In particular, upgrades to the Hadoop Distributed File System (HDFS) are excruciatingly painful.
By way of background, each version of Hadoop is composed of a compute side and a data side. The compute side refers to MapReduce and other data processing applications. The data side refers to HDFS. The compute side and the data side of Hadoop are closely linked. This is due to the Protocol Buffers and RPC connections used by the Hadoop Java code to communicate between the compute and data sides of Hadoop. If the versions of the compute side and data side do not match, then your applications will not run.
Let’s say your organization has a 100 node Hadoop cluster with 100 petabytes of data in HDFS running Hadoop version X and you want to try out the latest version of Hive, from Hadoop version Y, on some of that data.
Then you instantiate version Y of Hadoop on another system or virtual machine, and you configure your application to use remote HDFS access to your existing 100 petabytes of data. However, due to the version differences, your application from version Y gets an error trying to connect to the HDFS file system from version X. Now what do you do?
You can try and convince your IT organization to upgrade that 100 node cluster to version Y. This will not be easy. Hadoop can be very complex and time-consuming to install (if done the traditional way); once it’s operational, most enterprises do not want to upgrade big, bulky systems like this. Your IT team may not see the need for any of the newly added HDFS features and they’re likely to stick to a policy of “if it ain’t broke, don’t fix it.”
You can configure your new version Y cluster to access the data from the cluster running version X using HttpFS, or something similar which accesses HDFS over HTTP/S, but data transfer can be slow and you may run afoul of company network policies. There may be version differences in this path as well.
You can try copying the HDFS client jar files from the version X cluster to your version Y cluster and setting the Java CLASSPATHs appropriately. But this sort of thing is not for the fainthearted.
You can use tools like distcp or Apache Falcon to copy data from the version X HDFS filesystem to the version Y HDFS filesystem on your cluster. This takes time and requires you to have sufficient storage capacity in your new cluster. You still have to worry about the network police and now you may have the data security people on you as well. As one customer I recently spoke to told me “Every copy of my data is another security violation.”
Frankly, none of the choices above are good.
Now the Hadoop community strives to maintain compatibility between the compute (client) side and the data (server) side so that you will not have this problem. In fact, the policy around compatibility is clearly stated in the Hadoop wiki:
- Client-Server compatibility is required to allow users to continue using the old clients even after upgrading the server (cluster) to a later version (or vice versa). For example, a Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
- Client-Server compatibility is also required to allow users to upgrade the client before upgrading the server (cluster). For example, a Hadoop 2.4.0 client talking to a Hadoop 2.3.0 cluster. This allows deployment of client-side bug fixes ahead of full cluster upgrades. Note that new cluster features invoked by new client APIs or shell commands will not be usable. YARN applications that attempt to use new APIs (including new fields in data structures) that have not yet deployed to the cluster can expect link exceptions.
- Client-Server compatibility is also required to allow upgrading individual components without upgrading others. For example, upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
The policy states that “Both Client-Server and Server-Server compatibility is preserved within a major release” and “Compatibility can be broken only at a major release.”
But what is considered a major release? The changes to the HDFS code through Hadoop 2.7.2 are described here. In that list, you can see that there were incompatible changes made to HDFS in releases 2.5.0, 2.0.3, 2.0.2, as well as other “dot” releases. There are also two different Hadoop “lines” running at the same time: Apache Hadoop 2.7.x and 2.6.x: with version 2.6.4 having been released on 2/11/2016, later than the release of version 2.7.2 on 1/25/2016.
And what about the Hadoop distributions? How are the Apache Hadoop releases mapped to the Hadoop distribution releases by Hortonworks (HDP) and Cloudera (CDH)? For example, HDP 2.4.2 is based on Apache Hadoop 2.7.1 and HDP 2.1 is based on Apache Hadoop 2.4.0. Likewise CDH 5.7.1 is based on Apache Hadoop 2.6.0, and CDH 5.1.5 is based on Apache Hadoop 2.3.0.
Just to give one example, I recently met with a data architect from a large insurance company that had merged with another insurance company of comparable size; both companies were using Hadoop and had a lot of data in their respective HDFS implementations. The kicker was that one company was using a HortonWorks distribution and the other one was using Cloudera. Management had decreed that a single Hadoop distro was to be used. It was this architect’s job to figure out how to accomplish that with a minimum of risk and disruption to the enterprise. But how will he run their Hadoop applications against two different flavors and versions of HDFS?
And to make matters worse, there are also other sources of incompatibility between Hadoop versions and distributions besides HDFS.
STOP! All you want is for the latest version of your Hadoop applications to work against your existing data.
There is a better way. When my colleagues and I founded BlueData, we recognized this challenge – and we built a solution specifically designed to address this problem and eliminate the pain. One of the key components of our BlueData EPIC software platform is called DataTap (or “dtap” for short). With DataTap, your applications from Hadoop version Y can access and tap into data from HDFS built on Hadoop version X (or any version or distribution). Simple. Easy. Straightforward. Painless.
This allows you to try out the latest versions of Hadoop applications without having to go through the tedious and time-consuming process of upgrading your HDFS to the latest version, or copying data, or playing games with Java jar files. Similarly, when you upgrade the version of Hadoop running your HDFS, you can still use your existing and stable (i.e. older Hadoop version) applications against that data.
This screenshot from the BlueData EPIC software platform illustrates the ability to tap into multiple versions of HDFS (or even NFS storage) using DataTaps
There’s lots more to the BlueData EPIC software platform; this is just one of the many pain points that we solve. But in my conversations with enterprises that have embarked on their Hadoop journey, it’s one that I’ve seen time and again.
Now, with BlueData, you don’t have to deal with the pain of HDFS upgrades. You can mix and match the versions of your Hadoop applications against the versions of HDFS where your data is stored. See? It doesn’t have to be so hard after all.