Back to Blog

Deep Learning with TensorFlow, GPUs, and Docker Containers

I work with a lot of data science teams at our enterprise customers, and in the past several months I’ve seen an increased adoption of machine learning and deep learning frameworks for a wide range of applications.

As with other use cases in Big Data analytics and data science, these data science teams want to run their preferred deep learning frameworks and tools in Docker containers on the BlueData EPIC software platform. So part of my job is trying out these cool new tools and making sure they run as they should on our platform – and to help develop new functionality that might solve any challenges.

One of the most popular of the open source frameworks for deep learning and machine learning is Tensorflow. TensorFlow was originally developed by researchers and engineers working at Google to conduct machine learning for deep neural networks research. However, it’s general enough to be applicable to many other use cases. Some other deep learning examples using TensorFlow including image recognition, natural language processing with free text data, and threat detection and monitoring.

“TensorFlow is an open-source software library for machine learning across a range of tasks. It is a system for building and training neural networks to detect and decipher patterns and correlations, analogous to (but not the same as) human learning and reasoning.– Wikipedia

Tensorflow allows distribution of computation across a wide variety of heterogeneous systems, including CPUs and GPUs. To accelerate computation of TensorFlow jobs several of the data science teams I’ve worked with use GPUs. However, GPUs are costly, and the resources need to be managed carefully. And this is we found some challenges that our software platform can help address.

Considerations for Deploying TensorFlow

Here are some of the challenges and considerations when deploying data science applications, and TensorFlow in particular, at large-scale in the enterprise:

  • How to manage the deployment complexity (e.g. between OS, kernel libraries, and TensorFlow versions)
  • How to support a transient cluster creation for the duration of a job
  • How to isolate resources in use and preventing requests from simultaneous access
  • How to manage quotas and allocation for GPU-enabled and CPU resources in a shared, multi-tenant environment

The BlueData EPIC software platform can address these challenges for our customers – providing their data science teams with on-demand access to a wide range of different Big Data analytics, data science, machine learning, and deep learning tools. Using Docker containers, our Big-Data-as-a-Service software platform can support large-scale distributed data science and deep learning use cases in a flexible, elastic, and secure multi-tenant architecture.

Deep Learning with TensorFlow on the BlueData EPIC Platform

And with our new fall release announced today, BlueData can now support clusters accelerated with GPUs and provide the ability to run TensorFlow for deep learning on GPUs or on Intel architecture CPUs. Using the BlueData EPIC software platform, data scientists can spin up instant TensorFlow clusters for deep learning running on Docker containers. BlueData supports both CPU-based TensorFlow, that runs on Intel Xeon hardware with Intel Math Kernel Library (MKL); and GPU-enabled TensorFlow with NVIDIA CUDA libraries, CUDA extensions, and character device mappings for Docker containers.

The BlueData EPIC software platform can provide self-service, elastic, and secure environments for TensorFlow whether on-premises, in the public cloud, or some combination of the two in a hybrid architecture. All from the same interface, with the same user experience regardless of the underlying infrastructure.

As illustrated in the graphic below, this means that our customers can easily spin up instant TensorFlow clusters with BigDL for deep learning with BlueData – just as they do today for other Big Data analytics, data science, and machine learning environments. And they can specify placement of Docker containers running TensorFlow on infrastructure configured with GPUs or CPUs and in the public cloud or on-premises.

A few of the specific benefits offered by BlueData for deep learning with TensorFlow include:

On-Demand TensorFlow Clusters

With BlueData EPIC, users can create TensorFlow clusters on-demand with just a few mouse clicks. And with the host tagging introduced in our new fall release, they can create GPU-enabled or CPU-based clusters with host tagging that specifies the hardware for their particular workload (as indicated in the screenshot below).Once created, the cluster will have one or many nodes of Docker containers deployed with TensorFlow software and the appropriate GPU and/or CPU acceleration libraries. For example, GPU-enabled TensorFlow clusters would have NVIDIA CUDA and CUDA extensions within the Docker containers; whereas a CPU-based TensorFlow cluster would have Intel MKL packaged within the Docker image along with a Jupyter notebook.

Efficient GPU Resource Management

GPUs and specialized CPUs are generally not identified as a separate resource for Docker containers. BlueData EPIC handles this by managing a shared pool of GPUs across all host machines and allocating the requested number of GPUs to a cluster during cluster creation time. This exclusivity or isolation guarantees the quality of service for deep learning jobs and prevents multiple processing jobs from trying to access the same resource simultaneously.

For most enterprise organizations today, GPUs are a premium resource and need to be utilized efficiently. When a cluster is not in use or is finished running a job, BlueData EPIC can stop the cluster and assign the GPU to a different cluster. This allows users to create multiple clusters, in different tenant environrments, and use GPUs only when they need it without deleting or recreating their clusters. There is also a mechanism to create a cluster for the duration of the job as a transient cluster.

Improved User Productivity

Once the TensorFlow cluster is completed, the containers can be enabled with AD/LDAP-controlled SSH access and secure Jupyter notebooks.

Sample Jupyter notebooks are included with the TensorFlow cluster by default, for immediate validation and testing as shown in the screenshot below.

The samples shown in the screenshot above are from the following GitHub repo: https://github.com/nlintz/TensorFlow-Tutorials. These and other tutorials are available for users to get started and be productive immediately with TensorFlow.

The screenshot below is a sample reconstruction of digits based on input digit images from MNIST datasets using TensorFlow libraries and graph plotting, running on the BlueData EPIC platform.

The next step is to extract the data sets and model predictions based on input images and models trained using TensorFlow GradientDescentOptimizer (as shown in the screenshot below):

This screenshot shows the results, comparing input images and output predictions:

Ability to Update Running TensorFlow Clusters

New libraries and packages are constantly being introduced and the needs of data science teams are constantly changing, so BlueData EPIC provides a mechanism called “action scripts” that allow users to simultaneously update all nodes of a running cluster with new libraries and packages. Users can also submit Python jobs as interactive or batch jobs for long running processes via the web-based UI or a RESTful API.

Streamlined Operations for Deep Learning with TensorFlow

 Now BlueData enables our customers to create instant TensorFlow clusters on Docker containers – and get started with their deep learning algorithms in just a matter of minutes. They can get started quickliy with their deep learning projects, without the operational overhead of setting up, configuring, and managing their new TensorFlow environments. BlueData provides:

  • The ability to jumpstart their deep learning and spin up on-demand TensorFlow clusters
  • A way to easily manage and reuse shared infrastructure resources (e.g. GPUs) with better quality of service
  • Secure multi-tenant environments accessible via web-based UI and RESTful APIs, with enterprise-grade security controls
  • Managed access to data in HDFS, NFS, and other data sources within a tenant
  • The ability to create end-to-end pipelines with ETL, Hadoop, Spark, and TensorFlow clusters running on Docker containers

By adding support for TensorFlow and GPUs in this new release, BlueData is continuing to extend its ability to run the most commmonly deployed Big Data analytics, data science, and machine learning tools on the BlueData EPIC platform for Big-Data-as-a-Service. And now data science teams can getting started quickly with deep learning in a large-scale multi-tenant enterprise deployment with BlueData EPIC – while ensuring the security, flexibility, elasticity, performance, and cost-efficiency that they need.