Back to Blog

Distributed Data Science with Spark 2.0, Python, R, and H2O on Docker

Here at BlueData, I’ve worked with many of our customers (including large enterprises in financial services, telecommunications, and healthcare, as well as government agencies and universities) to help their data science teams with their Big Data initiatives. In this blog post, I want to share some of my recent experiences in working with the data scientists at a Fortune 500 insurance company – helping them to take advantage of technologies like Spark 2.0, Python and R libraries, as well as machine learning tools like H2O.

Data Science Tools at a Fortune 500 Insurance Company

At this particular insurance company, their data science use cases include analyzing and understanding consumer behavior, customer responses, geography-based product offerings, potential new services, and more. The analytics they perform vary depending upon the types of data they receive.

The data science team had been using a variety of existing tools including:

  • SAS and other traditional software tools to solve specific analytical problems;
  • A large single host physical server or virtual machine infrastructure, typically with 64 GB or more memory and 16 or more cores for exploratory analysis; and
  • Laptop computers for ad-hoc analysis on smaller data sets.

Here are some of the issues this data science team faced in using these existing tools:

  • The software on each machine is limited to the knowledge of the machine owner;
  • Data scientists own the software stack as well as the time-consuming process of maintaining the stack;
  • New integrations and upgrades are a difficult process;
  • Scaling is limited to the size of the single machine;
  • It’s difficult to share and collaborate with other users in these environments; and
  • The team is new to distributed processing and the latest Big Data technologies, so there’s a steep learning curve.

They realized that this was a siloed and ad-hoc approach with a number of limitations. The team was looking for a more streamlined and repeatable way to empower their data scientists in a distributed environment – while also providing the flexibility to use their preferred tools and try out new technologies.

One Platform to Rule Them All

After a series of discussions, we decided to use two types of environments to address most of their use cases. We were able to provide each of these environments using out-of-the-box functionality available with the BlueData EPIC software platform, powered by Docker containers.  With BlueData, they also have the ability to extend and create other new environments with different tools as needed (for example, if their use case needs to handle streaming data, they can provision a Kafka cluster within their data science environments).

By way of background, the BlueData EPIC software platform makes it faster and easier to deploy the most commonly used Big Data frameworks and data science tools either on-premises or on the public cloud. These applications and software configurations can be deployed on bare-metal servers or virtual machines in the customer’s data center or on Amazon Web Services, while providing customers with full control to configure their own environments and securely use the right tool for the right job. The environments are deployed as Docker containers, created and managed by the BlueData EPIC software platform. This significantly improves the end user experience, reduces startup time, and optimizes performance and resource utilization for a variety of users and use cases. In addition, it allows users to collaborate and share their data and results.

The screenshot below shows some of the distributed Big Data frameworks (e.g. Spark 2.0.1 and other recent versions of Spark standalone, multiple Hadoop distributions, Kafka, Cassandra) as well as other data science tools (e.g. Jupyter notebooks, Python, R) that are available within the BlueData EPIC platform. Within BlueData EPIC’s “App Store”, we provide these as pre-configured Docker images that can be installed within minutes; we also offer an “App Workbench” that allows customers to create images with their own preferred applications and tools, or upgrade to the latest software versions.

app-store

In the case of the insurance company, the users on the data science teams were given the choice of a local compute environment (i.e. on a large virtual machine) and a distributed compute environment (e.g. with Spark and/or Hadoop), both running on the same BlueData EPIC platform. As noted above, these users also have the ability to provision any number of different Big Data environments depending on their needs. This gives them the ability to continue to use their existing data science techniques, while also exploring new tools and migrating to newer Big Data frameworks as needed. The differences between the local and distributed environments are illustrated in the graphic below.

local-and-distributed

For example, with the local compute environment, the users on the data science teams at this insurance company retained the ability to work in single node (sandbox) environments on a large virtual machine for some of their use cases:

  • The main difference now is that the large, single node environments are also provisioned using self-service options from a common Docker image.
  • All users of a particular tenant can access any sandbox instance with appropriate access controls.
  • All the existing code (e.g. R scripts, Python, Bash) run unmodified.
  • These environments offer Jupyter notebooks with Python and R kernels pre-integrated.
  • RStudio (an integrated development environment for R) is also provided in some cases.
  • Users have the ability to easily add other kernels such as Julia, if required, to Jupyter notebooks.
  • In some cases, users may also prefer to have Apache Spark on the same node for testing Spark applications in a sandbox mode.
  • All of these environments are created as Docker containers within the BlueData EPIC platform, including Active Directory integration for security.
  • Each container also includes Hadoop Distributed File System (HDFS) client libraries for potential HDFS access. BlueData EPIC provides the installation of secure HDFS on local storage out-of-the-box. Most of our customers use this as a common storage area, outside their individual sandboxes, where data and results can be shared. They can also use any external HDFS, if available.
  • For users of Amazon AWS and S3, BlueData EPIC also includes pre-built S3a storage integration (with appropriate credentials) for each of the container clusters.

In a nutshell, the features included in the local compute environment provide security, access control, self-service provisioning, web-based notebooks for ease of use, and the ability to load and share data from common distributed storage. This was a good starting point for the team.

The obvious next step was to provide these data scientists with the ability to work on a true distributed Big Data framework such as Spark (with or without Hadoop). So the second type environment with distributed compute included:

  • Apache Spark 2.0.1 (with standalone resource manager).
  • Spark clusters running on Python 3.5 with full Anaconda packages installed on all nodes.
  • All nodes of the Spark cluster configured with R.
  • “Sparkling Water” (H2O + Spark) added for additional model support.
  • A fully configured Zeppelin notebook with access to Spark R, PySpark, and H2O provided out-of-the-box.
  • Some users preferred to have a Jupyter notebook connect to the PySpark kernel, with Spark R in addition to local R and Python execution.
  • As with the local compute environment, these tools are created as Docker containers within the BlueData EPIC platform – including Active Directory integration for security as well as HDFS and AWS S3 integration.

All these capabilities are available for the data science teams with just a few clicks from BlueData EPIC’s simple web-based user interface.

In the case of this insurance company, it seemed to work best to use Jupyter notebooks for the local compute environments and Zeppelin notebooks for the distributed Spark compute environments. The reason was that Zeppelin reuses the same set of cores for all Spark applications, irrespective of the number of notes you create in the notebook. With Jupyter we found that every time the PySpark kernel was launched, it created a new shell and consumed additional cores. However, this may change in a later release of Jupyter notebooks.

Here’s an example screenshot showing the distributed compute environments within BlueData EPIC for the data science team at this insurance company (note: “spark2.0.1uber” is the Spark environment including both Zeppelin and Jupyter notebooks):

environments

The screenshot below shows the details of that “spark2.0.1uber” environment within the BlueData EPIC platform. BlueData EPIC’s easy-to-use interface provides pre-built links to Jupyter and Zeppelin notebooks, along with ssh access to Spark-shell, R, and Python.

node-details

Local or Distributed?

While I was working with this insurance company, we did encounter some challenges (and opportunities): for example, we needed to how to figure out when to use a distributed Spark cluster as opposed to a local Python or R environment for a given use case.

In some cases, the decision is simple: do everything on distributed Spark clusters because the data is only available on HDFS clusters or other large storage systems. Even with smaller data sizes, I’ve found that sharing your models, data, and results across a team of data scientists is much easier in a distributed Spark cluster model. But it does have its disadvantages: in particular, there is some administrative overhead involved in running distributed jobs with smaller data sets. So if your data or simulation can be done using a non-distributed environment, you don’t necessarily need to use a distributed Spark cluster.  The ability to upload your pipeline and analysis to a common storage system can help with collaboration and sharing across the team.

What does work really well with a distributed environment like Spark is the ability to do data pre-processing, model building, analysis, and visualization using a variety of different techniques with the same framework. For example, you can analyze your data using Scala, Python, R, and SQL; build models using R, Python, Spark MLlib, and H2O; and run and visualize your analysis using Zeppelin or Jupyter notebooks – all on the same Spark cluster, using shared data, as shown in the code samples below with Scala, Spark R, and SQL.

[Note: The dtap:// URL shown in the code below behaves similarly to the hdfs:// URL. The BlueData EPIC platform provides DataTap functionality (“dtap”) with IOBoost and caching on top of regular HDFS data access in order to speed remote data processing.]

zep-code-sample-1

 

zep-code-sample-2

Users can also run PySpark with their existing scikit-learn (machine learning in Python) and other techniques on Spark clusters, as shown in the Zeppelin notebook screenshot below:

zep-code-sample-3

 

Again, for this insurance company, the choice of notebook was important to the data science teams. So for some use cases, they chose to run PySpark while using the Jupyter notebook (rather than Zeppelin):

pyspark-with-jupyter

 

One Size Does Not Fit All

If there’s one takeaway from my experience with this insurance company (and other enterprises that I’ve worked with over the years), it’s that data science never really works right with a “one size fits all” cookie cutter solution. Data science is a highly creative process. Data scientists, data analysts, and data engineers should be given an environment that allows them to select the appropriate stack (e.g. whether local or distributed, using R or Python and/or Spark, and with Jupyter or Zeppelin notebooks) and tools to use for each phase of data processing – for their particular use case.

The BlueData EPIC software platform provides this flexibility and choice, allowing data science teams to use their preferred frameworks and tools with an easy-to-use self-service interface and elastic, on-demand Big Data infrastructure. As in the case of this insurance company, BlueData’s customers are able to get started quickly and keep pace with the rapidly changing field of data science, predictive modeling, and machine learning.

Hopefully this example gave you a glimpse into the art of the possible with BlueData EPIC for your data science team.  And if you’d like additional information about the BlueData EPIC software platform, you can learn more here.