Back to Blog

Get Started with a Dev/Test Lab for Hadoop and Spark

The unprecedented explosion of available information today is creating new opportunities for enterprises to leverage data.  As just one example, enterprises can use this data to serve customers better, faster, and more cost-effectively. Data science teams and analysts can now provide greater insights into customer behavior and preferences.  These insights can help develop more targeted marketing offers, or attract and retain customers by anticipating their needs with new products and better customer support. These and other use cases are driving the demand for “Big Data” analytics.

There are now many open source tools as well as commercial software applications designed to analyze Big Data, including new distributed processing platforms such as Apache Hadoop and Apache Spark. These new technologies can be used to manage the growing volume,  variety, and velocity of structured, semi-structured and unstructured data – and they are typically less expensive to implement than traditional data warehousing and database technology.

However, Hadoop and Spark are complex; there are multiple components, systems, and infrastructure resources required. These components are available from the Apache Software Foundation as free open source software; but there are also distributions and applications that are packaged and distributed by several commercial vendors. It can be time-consuming and challenging to evaluate each of these tools and get these new Big Data environments deployed and operational – even in a lab environment for initial development, testing, and quality assurance.

There are several challenges to getting started with Hadoop and / or Spark including:

  • Lack of available knowledge and skills within the organization;
  • Whether and how to leverage existing IT assets and systems;
  • Confusion over which distributions and tools to use; and
  • Difficulty in procuring, installing, and implementing the right infrastructure.

Big Data Infrastructure: Barrier to Getting Started

Many Big Data deployments in the enterprise start within data science teams, not within IT. As a result, infrastructure and systems are not their area of expertise – and the infrastructure often becomes a stumbling block that stalls deployment. They usually follow a traditional approach to getting started with Hadoop and Spark infrastructure on-premises. This approach typically requires dedicated physical servers, with direct attached storage, for each new environment and user group. Setting up Hadoop distributions and the related ecosystem tools for new user groups requires new silo’ed infrastructure with little to no reusability.

For example, many organizations take the following approach when deploying Hadoop in a lab environment for evaluating multiple Hadoop distributions and tools. Each step results in unforeseen consequences as indicated in the table below:

Traditional Approach

Total elapsed time: Approximately 4-6 months

Servers required: Approximately 10-20 physical servers

If your organization is looking to set up a new Hadoop or Spark lab environment (e.g. for dev/test and evaluation of multiple Big Data tools and technologies), there is a better way.

A New Approach to Accelerate your Big Data Lab

In the prior table, we illustrated the traditional approach for deploying Hadoop in a dev/test lab environment to evaluate multiple distributions and tools. With the new Big Data Lab Accelerator solution from BlueData, we offer a much simpler, faster, and more cost-effective approach – delivered in two weeks and at a fraction of the cost.

The table below outlines how this new approach compares to the traditional approach, with benefits at each step along the way:

New Approach

Total elapsed time: Two weeks

Servers required: 5 physical servers or virtual machines

With the Big Data Lab Accelerator, your organization will have a ready-to-run dev/test lab environment to evaluate and experiment with multiple distributions, services, and tools on a shared, cost-effective infrastructure for multiple tenants.

The figure below illustrates an example environment for multiple different teams (tenants) – each with different use cases, distributions, services, and tools to evaluate – running on the BlueData EPIC software platform and shared infrastructure.

Big Data Lab

As shown in the above illustration, each tenant leverages shared virtualized infrastructure, with the ability to tap into shared data sets. Yet each team can run their own independent evaluation with different use cases and tools; the architecture provides secure and logical separation with compute isolation between each tenant.

Now you can accelerate the deployment of Hadoop and Spark in an on-premises, multi-tenant lab environment for dev/test. BlueData provides a turnkey solution with the enterprise version of our software along with professional services to get your lab environment up and running in two weeks. You can provide your data scientists and analysts with the ability to spin up Hadoop or Spark clusters on-demand – to evaluate multiple Big Data analytics tools and distributions, for multiple use cases. We provide all the software and services you need to get started.

To learn more, download our solution brief on the Big Data Lab Accelerator here.