Back to Blog

Docker, and Spark, and Hadoop. Oh My.

When we founded BlueData in the fall of 2012, we built our Big Data infrastructure software platform around the best open source hypervisor technology then available. We knew that virtualization was a key enabling technology to simplify Hadoop and other Big Data deployments. There were plenty of naysayers prophesying that Hadoop could never run in a virtual environment – citing concerns about I/O performance and the importance of data locality. But we’ve been very successful in using hypervisor-based virtualization technology to deploy high-performance, scalable, and elastic on-premises infrastructure for Hadoop (and now Spark) applications.

Having worked at VMware for ten years, I’m very familiar with hypervisor-based virtualization.  At the time, I was well aware of a nascent operating system virtualization technology, commonly known as containers, but it was not yet mature enough for enterprise use in a Big Data environment. Then, in March of 2013, Docker containers were introduced as open source and began to turn the virtualization landscape on its head.

Last year, I wrote a blog post outlining customer use cases which required the use of hypervisor-based virtualization and other customer scenarios where the use of container-based virtualization (e.g. Docker) was more suitable. Now another year has passed and the ability to run Big Data applications in a virtualized environment (with decoupled compute and storage) is quickly becoming an accepted fact. Meanwhile, the operating system virtualization technology has further matured as containers have moved into the mainstream.

And we here at BlueData have not been idle. Seeing the power of Docker containers, we’ve doubled down on our vision of running Big Data in a flexible, automated, and elastic virtual environment. One of the key challenges we faced was the virtual network management so that clusters of containers running on different physical hosts could communicate. Now, as highlighted in our recent announcement, the BlueData EPIC software platform will support both hypervisor-based virtualization via the use of virtual machines and operating system virtualization using Docker containers.

The virtualization technologies used by virtual machines and containers are different. Each has it own strengths and weaknesses. The hypervisor-based virtualization used by virtual machines (VMs) provides strong fault isolation and security. An application running in one VM will not negatively impact the stability or performance of an application running in a different VM on the same host. The application can go so far as to crash the operating system running in the VM, but this will have no impact on other VMs. And each VM can run a different operating system. This gives the user maximum flexibility when selecting an application to perform a specific task. However, this security and flexibility comes at the expense of using a lot of extra CPU and memory resources – and this overhead can ultimately mean higher costs.

The operating system virtualization of container technology uses significantly less extra CPU and memory than hypervisor-based virtualization. There isn’t the overhead associated with having each container run a completely installed operating system (OS). This lightweight approach can improve I/O performance and reduce costs – and containers can dramatically accelerate the application development lifecycle.  However, an application running in one container may impact the uptime and performance of an application running in a different container on the same host. In fact, an error in an application running in one container could crash the underlying OS on the physical host and cause all the containers on that host to fail. Each container running on a given physical host must use the same OS. This has its benefits, but it also puts some limitations on which applications can run within the containers on a given host.

Those enterprises seeking to run distributed Big Data applications in a virtualized on-premises environment now have multiple options to choose from. If they require high-level security and fault isolation, then they can opt to run their Big Data workloads on virtual machines. Hadoop and Spark on Docker containersIf they need fast cluster creation in a lightweight environment, then they can run Big Data on containers. Enterprise IT no longer needs to allocate physical hosts and install Hadoop or Spark on bare-metal servers when standing up new Big Data applications in their data centers. Hadoop and Spark can now run on Docker containers, within a cluster of virtual machines, or even a cluster of containers running on a cluster of virtual machines.

Challenges still remain. Can fault containment within a virtual OS (container) environment be improved? Can the CPU overhead of hypervisor-based virtualization be reduced? Can storage and networking QoS be improved in both operating system and hypervisor-based virtualization? We look forward to working with Docker and our other partners in the Big Data ecosystem to address these challenges. So stay tuned.

– by Tom Phelan, co-founder and Chief Architect, BlueData