Back to Blog

Big Data Case Study: How to Solve Hadoop Infrastructure Challenges

Apache Hadoop has gone from a “nice to have” to the “must have” Big Data technology in enterprise IT today. Unlike traditional data marts built to serve specific business purposes, Hadoop by nature supports a diverse set of use cases and can handle a variety of structured and unstructured data. Different departments and teams throughout enterprise organizations are identifying Hadoop as a key requirement to meet their business objectives.

Hadoop Sprawl

Hadoop Sprawl

In most large enterprises, Big Data initiatives begin within small isolated groups – rather than as a centralized initiative. This often results in islands of Hadoop clusters with each team managing their own environment. As these deployments grow, I’ve seen many organizations struggle to cope with the proliferation of Hadoop clusters: resulting in infrastructure sprawl, a lack of governance, and significant maintenance challenges.

However, the very nature of most Hadoop use cases requires the flexibility to try new things and the ability to access a variety of different data sources. So how do you control infrastructure sprawl while still empowering end users across multiple teams with a flexible Big Data environment?

There is a way …and it may help to provide a specific example. One of the organizations I’ve been working with recently is a Fortune 100 media and telecommunications company – they were faced with precisely these issues, so I’ll share their story here.

Hadoop adoption in this company had grown organically across multiple different teams, starting with “science projects” and lab initiatives that quickly grew and expanded. As with most Big Data deployments today, they deployed Hadoop using on-premises infrastructure with bare-metal physical servers and direct attached storage (DAS). Each group worked with the IT team to provision the necessary physical servers, storage, networking, and software for their dedicated Hadoop cluster.

One of the initial use cases was a sandbox dev/test environment to enable ad-hoc analysis with Hadoop as well as Apache Spark by internal data science teams and developers. Another Hadoop use case was an initiative to analyze customer device and service usage, in order to improve their customer experience and increase customer satisfaction. In this latter example, they needed to provide data access to internal teams as well as external partners (media content providers) – each of which required an isolated environment.

Within a relatively short timeframe, the IT organization was faced with several challenges. Some of the key issues they struggled with included:

  • Limited IT infrastructure resources and staff;
  • Relatively little IT experience and skillsets in Hadoop or Spark;
  • Increasing IT overhead for managing multiple Big Data environments; and
  • The need to onboard multiple user groups (and a growing number of external partners) with access to their own dedicated Hadoop environment.

To address these issues and the increased demand for Hadoop infrastructure, they considered two initial options for their Big Data deployment going forward.

Option 1: Expand their existing on-premises Hadoop infrastructure. This was the desired option for the IT team, but there were a number of unknowns and risks including:

  • Adding more physical servers to expand their existing Hadoop clusters
  • Adding more nodes to install various business intelligence and analytical tools
  • Setting up additional networking and network configurations
  • Implementing the Hue web console for internal Hadoop users
  • Building a web-based application that would allow external users to use the system (example: login, uploads, downloads, query)
  • Providing security and isolation for access by various internal and external user groups

The additional infrastructure resources, capital expenditures, and staff required for this option made it a difficult proposition. And as their existing Big Data deployment expanded (with more user groups, more use cases, more applications, and more data), the growing complexity and burden on the IT team would continue to accelerate.

Option 2: Move their Hadoop deployment to a cloud service (e.g. Amazon Web Services’ Elastic Map Reduce (EMR) or Microsoft’s Azure HDInsight). This is another option that the IT organization considered, and it was the preferred option for some of their development teams. However, moving their Hadoop deployment to a cloud service like EMR would require:

  • Copying data to the cloud service
  • Maintaining and managing Hadoop clusters in the cloud
  • Managing user access to the cloud service
  • Building a web-based application that would allow external users to use the system (example: login, uploads, downloads, query)

There were perceived advantages to this “Hadoop-as-a-Service” option – namely the on-demand provisioning, self-service provisioning, and low upfront capital expenditures. But this option was eventually ruled out because the data needed for their initial Hadoop use cases was already on-premises. Data has gravity, and it’s painful to move. There was simply too much data to move and/or copy back and forth.

Fortunately, they realized that there is a third option: providing the benefits of Hadoop-as-a-Service with on-premises infrastructure. The following slide deck provides a brief summary of their evaluation and the decision to move forward with this third option:

So how did they do it? Did it alleviate their deployment struggles and growing pains?

Ultimately, they selected the BlueData EPIC software platform to virtualize their Hadoop infrastructure and provide on-demand access to virtual Hadoop clusters in a secure, multi-tenant model. It offers a number of benefits to address each the challenges outlined above:

  • Consolidation and shared resources: By virtualizing their Big Data infrastructure, they can serve multiple groups on the same physical hardware and create completely independent, isolated tenant environments for each set of internal users or external users. Some of their business intelligence and analytical tools can also share the same physical hardware.
  • On-demand provisioning and secure multi-tenancy: The IT team can now enable self-service provisioning for their data scientists and other users, using the BlueData ElasticPlane user interface. These users – whether internal or external – can easily spin up virtual Hadoop or Spark clusters within minutes, within a secure and isolated tenant environment. They now have a similar user experience to that of Amazon’s EMR, but on-premises.
  • Data access: With BlueData’s unique DataTap functionality, they can now separate compute and storage (as an alternative to the traditional DAS approach for Hadoop). This means that the IT team can leverage their existing investments in enterprise-grade storage systems, eliminating the need to move or copy data. It also means that new groups of Hadoop users can access existing data in a centralized, shared storage environment; each of these tenants now has the ability to run Hadoop analysis against this data using their own tools of their choice.
  • Performance: Initially, the IT team was hesitant to use virtualization for Hadoop due to I/O (input/output) performance considerations. While virtualization has been applied to many server environments and workloads, data-intensive environments like Hadoop haven’t been virtualized due to these performance concerns. BlueData’s patent-pending IOBoost technology and caching was designed to solve these issues – delivering performance comparable to bare-metal, yet with the benefits of virtualization. As a result, the actual performance penalties were minimal; in fact, some ETL jobs now run faster than before.
  • Time to insights: Throughout the organization, each team had their own preferred tools for Big Data (including Spark, multiple Hadoop distributions, as well as various analytical tools). BlueData is pre-integrated with these tools, which allowed their data scientists and analysts to quickly get up and running – and enables them to try different ways of solving problems using their tools of choice.
  • Central control: Now the IT team has greater control and governance over the various Hadoop environments across the organization. It’s a simpler and more streamlined approach, with fewer infrastructure resources to manage and maintain. And they were able to hit a consolidation ratio of 8:1 between virtual machines and physical servers, significantly improving their server utilization.

In summary, this Fortune 100 media and telecommunications company was able to solve their Big Data growing pains and infrastructure deployment challenges. They can offer the flexibility and simplicity of the Hadoop-as-a-Service model, while leveraging their on-premises infrastructure. Now they can set their infrastructure template once and reuse the same template to onboard multiple teams with different Hadoop or Spark configurations. They can scale storage and compute separately for Big Data.

There was no loss of functionality for existing users, as they were able to continue with their work. New users had a fully functioning Big Data infrastructure platform in a very short amount of time, completely dedicated for them. And they have an infrastructure setup that is future-proof: it can be expanded to additional use cases and teams as their Big Data needs grow, without interrupting existing users.

I presented this use case at the Strata + Hadoop World event in San Jose earlier this year, and it was captured on video. So if you want to learn more, check out the recording:

And if you’re ready to try it out for yourself, you can get access to a sandbox environment or download BlueData’s EPIC software for a test drive here.

– by Nanda Vijaydev, BlueData