Back to Blog

Deep Learning with BigDL and Apache Spark on Docker

The field of machine learning – and deep learning in particular – has made significant progress recently and use cases for deep learning are becoming more common in the enterprise. We’ve seen more of our customers adopt machine learning and deep learning frameworks for use cases like natural language processing with free-text data analysis, image recognition systems, threat detection, fraud detection, and more.

And as with other use cases in Big Data analytics and data science, they want to run their preferred deep learning frameworks and tools in Docker containers on the BlueData EPIC software platform.

So What is Deep Learning?

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.” – Source: Wikipedia

What this means is that the features selected and results produced can be comparable to, and in some cases better than, what you would expect from human experts in the selected field. The modeling techniques are similar to the way human experts think, learn, and infer by introducing depth with a pipeline of feature extractors in the model. Learning techniques can discover input features, learn, and refine features and act on data/objects to produce results. There are only a handful of open source choices for deep learning. One of those is Intel’s BigDL framework, which offers a rich set of libraries in this space.

What is BigDL?

As described in a blog post by Intel, BigDL is a distributed deep learning library for Apache Spark that can run directly on top of existing Spark or Apache Hadoop clusters. You can write deep learning applications as Scala or Python programs.

  • Rich deep learning support. Modeled after Torch, BigDL provides comprehensive support for deep learning, including numeric computing (via Tensor and high-level neural networks; in addition, you can load pretrained Caffe or Torch models into the Spark framework, and then use the BigDL library to run inference applications on their data.
  • Efficient scale out. BigDL can efficiently scale out to perform data analytics at “big data scale” by using Spark as well as efficient implementations of synchronous stochastic gradient descent (SGD) and all-reduce communications in Spark.
  • Extremely high performance. To achieve high performance, BigDL uses Intel Math Kernel Library (Intel MKL) and multithreaded programming in each Spark task. Consequently, it is orders of magnitude faster than out-of-the-box open source Caffe, Torch, or TensorFlow on a single-node Intel Xeon processor (i.e., comparable with mainstream graphics processing units).

Some practical use cases for BigDL include image recognition, object detection, and natural language processing. The BigDL architecture takes advantage of distributed frameworks such as Spark and seamlessly works with the Hadoop Distributed File System (HDFS) and other storage; it allows organizations to utilize their existing Hadoop and Spark environments for data storage, data processing, and mining. The distributed nature of Spark allows users to run compute processing on large and complex datasets. And additional capability offered by Intel CPU architecture such as Intel MKL provides hardware acceleration for deep learning.

Considerations for Deep Learning in a Containerized Environment

Here are some of the challenges and considerations when deploying data science applications, and the BigDL framework for deep learning in particular, in a containerized Big-Data-as-a-Service environment:

  • The need for BigDL-enabled Spark clusters may be transient or permanent. The cluster provisioning should be on-demand, easy, and consistent for short or long term clusters.
  • The end user experience is very important. If it works once with a certain configuration, it should work every time with the same configuration.
  • Depending on the situation, you may want to restrict or prioritize certain use cases over others.
  • You may want to run your Spark cluster with BigDL workloads on specific machines which are better equipped to handle heavy loads (e.g. with the latest Intel Xeon processors and sufficient memory).
  • Workloads may need to run in isolated environments for security purposes.
  • Users also need secure access to their data sources (e.g. HDFS, NFS, Amazon Web Services S3) as input and output.

Here at BlueData, we’ve focused on addressing all of these challenges – with a focus on providing our enterprise customers with the security, elasticity, and interoperability they need in this ever expanding field of BigData analytics and data science. Using Docker containers, our Big-Data-as-a-Service software platform delivers elastic on-demand environments for large-scale distributed data science and deep learning use cases.

Deep Learning with BigDL and Spark on the BlueData EPIC Platform

The BlueData EPIC software platform can provide self-service, elastic, and secure environments for deep learning whether on-premises, in the public cloud, or some combination of the two in a hybrid architecture. All from the same interface, with the same user experience regardless of the underlying infrastructure.

And with our new fall release announced today, the BlueData EPIC App Store now includes a pre-integrated application image for Intel’s BigDL running on Docker containers. As illustrated in the graphic below, this means that our customers can easily spin up instant Spark clusters with BigDL for deep learning with BlueData – either on-premises or in public cloud – just as they do today for other Big Data analytics, data science, and machine learning environments.

Now enterprises that want to ensure security, elasticity, multi-tenancy, and quota management can run BigDL on the BlueData EPIC platform for a wide range of deep learning use cases.

As noted above, the App Store in our new fall release includes a fully featured Docker-based application image for Spark + BigDL, with the ability to create an on-demand containerized cluster with just a few mouse clicks. The screenshot below shows how users can place these containers on specific hosts, enabled with CPU hardware acceleration for running deep learning workloads.After cluster creation, the Docker containers are assigned to the selected hosts and all the associated services with the Spark cluster are automatically started as shown in the screenshot below.The Spark clusters can be accessed via SSH by authorized command line users. Users now have access to an integrated deep learning environment, with MKL and other BigDL libraries embedded and referred to from various Spark applications – including sample BigDL notebooks (e.g. with Jupyter or Zeppelin). The full set of examples and tutorials on the BigDL GitHub are included in the Spark + BigDL image and available for users.

The screenshot below shows a Jupyter notebook with the sample applications from these BigDL tutorials, packaged with the pre-integrated Spark + BigDL application image in the latest release of BlueData EPIC.

Users can go from creating a cluster to running deep learning jobs within minutes, running on some of the most powerful hardware available using the purpose-built BigDL libraries. Within the Jupyter notebook shown above, the screenshot below is an example of running a logistic regression model for digit classification using BigDL and Spark on MNIST digits datasets with BlueData EPIC.

 

A Fast and Easy Path to Deep Learning with BigDL

Now BlueData enables our customers to create instant Spark + BigDL clusters on Docker containers – and get started with their deep learning algorithms in just a matter of minutes:

  • The clusters can be provisioned on-demand via the web-based UI or a RESTful API
  • Users can expand, shrink, stop, and restart clusters as needed
  • Users can become immediately productive with notebooks such as Jupyter and Zeppelin
  • Clusters are deployed as Docker containers and can run either on-premises or in the public cloud
  • All Docker containers are secure, and during deployment they can be automatically configured for login against corporate LDAP/AD servers
  • BlueData EPIC manages secure connections to HDFS, NFS, and other data sources. Users can access storage with connection names abstracting the details set up by an administrator
  • A real world pipeline employs a pre-processing cluster, an analytical / scientific computing cluster, and a storage columnar database or HDFS. BlueData offers the unique ability to securely spin-up, manage, and use all these components simultaneously

With support for BigDL, BlueData offers a fast and economical path to deep learning by utilizing x86-based Intel CPU architecture and the pre-integrated Spark clusters that BlueData EPIC provides out of the box. It’s just one of many other Big Data analytics, data science, and machine learning environments that can run on the BlueData EPIC platform for Big-Data-as-a-Service.