Back to Blog

Large-Scale Data Science Operations

Here at BlueData, I get the opportunity to meet with many data science teams working on very interesting projects in different industries across our customer base. These are definitely exciting times to be working in the field of data science, machine learning, and analytics.

The primary goal of a data science team is to understand the strategic objectives and the goals of the business – and then convert that into series of actionable recommendations or actions that can help improve or predict something, by identifying complex relationships between different entities. In theory, data science consists of a standard workflow with ins and outs as indicated in the simple illustration to the right.

However, in reality, data science is a series of multiple workflows with several iterations and checkpoints along the way. What makes it more complex is that traditional analytics tools and technologies don’t account for today’s variations in data including different formats, latency, frequency, size, and more. So the newer data science techniques used for analyzing, correlating, and modeling don’t fit with traditional analytics tools. And in practice, data science often looks more like the illustration below.

Source: https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Data Science Teams and their Challenges

Data science teams have users with a variety of backgrounds including statisticians, analysts, programmers, engineers, application developers, and pure data science experts, along with communicators or interpreters. They work alongside business users to solve various business problems. Most of them would love to spend their time working on the data and finding solutions. Unfortunately they often spend excessive amount of time on some of the following activities before they can even start their analysis:

  • Getting environments up and running with end-to-end security. Why is this so difficult? Can’t they reuse their environments? The answer is generally “not really”. Users often need different algorithms, new ways of reading data with different formats, or more libraries to augment their current environments. Trying to repurpose the same environment for every use case is not practical and is often time-consuming (if not a complete waste of time).
  • Access to data in full fidelity. Most users are accustomed to desktop-based tools and local analysis. Data is copied to their local laptops for modeling. With more companies moving data to distributed storage (such as Hadoop HDFS), local processing is becoming impractical due to security and resource availability.
  • Scaling and collaboration. Analysis can start small with a subset of data. Actual business problems need data that are in different formats and from various sources. That migration or expansion is not a problem that can be solved by end users.
  • Extending the functionality of a running cluster. Users often have to download missing packages or libraries to work on their analysis. Sometimes they also have to tune the memory, core, and other settings to meet their needs. Trying to do this in one monolithic cluster creates conflicts with other jobs and users.

Data Science Operations

The series of multiple workflows with several iterations and checkpoints along the way is what I refer to as “data science operations”: it’s the operations side of what data science teams need to get done, and it typically involving working closely with engineering and IT operations teams to make it all work. These workflows (or phases) include:

  • Acquisition phase. Data can come from databases, Kafka streams, IOT streams, log files, image files and more. Keeping track of data provenance including source, versions, transformations, and more can be challenging without a proper system that provides security, auditing, and tracking. Users may have to go back to the sources and reacquire data if needed.
  • Analysis phase. This is the key phase where feature selection, model building, saving/loading models, and review of results occur. This can be done by one or many users, using one or more programming languages and techniques. Support for multiple, disparate environments, parallel computing, and polyglot programming is the key for success in the Analysis phase. As described in the source article referenced for the graphic above, some of the considerations for this phase should include efficient resource management, error handling, and code management, with an ability to iterate as needed.
  • Reflection phase. Reflection in data science is the process of validating the findings, sharing the progress with users, and getting their feedback. It is important to have tools that support charting, creating reports, and feeding new data and models to run what-if analysis before finalizing the models.
  • Dissemination/Deployment phase. This is the last phase and it requires operationalizing the models. Real use cases need an ability to learn from deployments and feed the information back to the Analysis phase for continuous improvement. A number of learning systems are set up to rebuild the models based on production results on a continuous basis.

In my job at BlueData, I often work closely with our customers to help them with their data science operations. In this blog post, I’ll highlight some specific scenarios where the BlueData EPIC software platform addresses the challenges of data science operations. Using Docker containers, our platform provides the ability to spin up instant environments for large-scale distributed data science …  and much more.

Many of our customers have data science teams that are in the process of migrating from their laptops or personal workstations to elastic, on-demand, and multi-tenant infrastructure (whether on-premises or in the public cloud) with fully-provisioned distributed data science environments tailored for their specific analysis. And many of these customers have provided us with feedback and input that led to some of the exciting new functionality and enhancements introduced in the latest release of our software announced today: BlueData EPIC version 3.0.

Data Science Environments with BlueData EPIC

For data science, jobs can be long running and/or ad-hoc iterative analysis. So the need for compute can be more persistent or transient. Users should have the ability to decide that at a job level, and have both compute capabilities available if needed.

Here’s one scenario of moving from a laptop-based analysis to a more scalable and distributed data science model using BlueData EPIC:

  • Users start with RStudio, Jupyter Notebook, and other IDEs for analysis on their laptop. Compute is also on local laptops.
  • The next step is generally to keep the same UI (e.g. RStudio, Jupyter Notebook) and send the compute to remote/distributed nodes for scaling. This requires users to connect to a Hadoop or Spark cluster from their IDEs and notebooks.
  • To operationalize their models, users deploy their models and algorithms on a distributed cluster and automate the pipeline with APIs and schedulers.

To support the above multi-step analysis, users can create a tenant, and start with a data science sandbox using the BlueData EPIC software platform. This generally is a 1 or 2 node RStudio + Spark instance or a Jupyter Notebook + Spark instance as shown below.

Using the new Quick Launch functionality in BlueData EPIC 3.0, users can quickly and easier deploy a new cluster within a few minutes using templates. Tenant administrators can pre-configure templates with different combinations of nodes, CPU, memory, storage, services, and more to fit different use cases. By using one of these templates, the end user can get their cluster provisioned with just a single mouse click.

A fully provisioned cluster will have nodes and services pre-integrated and pre-configured for immediate use. The users can easily scale their sandbox, or user other larger clusters created in BlueData EPIC, to meet their needs. A multi-node cluster can be orchestrated with master/worker services and notebooks or IDEs (e.g. RStudio, Zeppelin, Jupyter) as shown below, within minutes.

Not all environments should be up and running at all times. Depending on the use case, users can start or stop environments at anytime using either the user interface or an API. Clusters can also be expanded or shrunk as needed.

Access to Data with Security

Data is the most important asset during analysis. Limited or filtered data will alter the results and does not provide a thorough view of the facts. With large volumes of data and a variety of formats (including json, parquet, avro, csv, orc, image files, and more), you need flexible and distributed storage such as HDFS or NFS to keep the data.

Data scientists now have to make that transition to access and process data in these formats. Downloading this data to their local storage is usually not an option; so a secure connection needs to be established between their notebooks/IDEs and the data sources, along with distributed compute frameworks to process the data.

With BlueData EPIC, the clusters are pre-integrated with client software to connect to HDFS and NFS via our DataTap technology. DataTap is an abstraction layer that allows compute clusters to connect to a remote datastore with full support for Kerberos and credential passthrough. This eliminates HDFS client mismatch issues and also provides read-ahead and write-back cache for better performance. This enables RStudio, Jupyter Notebook, Zeppelin Notebook, Gateway nodes, and other IDEs to securely access and process data, in place, with distributed compute clusters. Conceptually, DataTap is similar to using s3a:// or emrfs:// for accessing data in Amazon S3 buckets from Amazon EMR compute clusters.

The following example shows a sample PySpark (Python on Spark) program reading data from HDFS (using DataTap: dtap), splitting the data into training and test datasets, and building the classification model at large scale using Spark.

BlueData EPIC provides rich environments like the one above out-of-the-box, without having to download or configure any additional software.

Models can also be saved to HDFS and loaded back further in the pipeline (as illustrated in the screenshot below), by the same or a different user in a different cluster. This enables collaboration between users and stages of the pipeline.

Collaboration and Scaling

True data pipelines have multiple stages and phases to their analysis. Each stage may need different tools and skills. With BlueData EPIC, users can create any number of clusters — based on the same or different images — in a specific tenant. So a pipeline that requires data pre-processing, followed by modeling, and then a deployment on a large long running cluster can all run in independent compute clusters, within one tenant. Here’s an example:

  • User A may create a Hortonworks cluster (e.g. HDP 2.5) to pre-process and create Hive tables
  • User B may create an RStudio and Spark cluster and try a few models on the Hive tables (as indicated in the screenshot below). Once they’re comfortable, then can save the model to HDFS
  • User C can schedule this on a larger cluster (e.g. Spark on Mesos or Spark on YARN) and run this model on a periodic basis

All these can be created in a matter of minutes using the BlueData EPIC UI or REST API. And as I mentioned before, the models and data can be saved to remote storage to move through the analytical pipeline.

With BlueData EPIC 3.0, we’ve added several new out-of-the box Docker-based application images in our App Store for common data science environments (e.g. with Spark and RStudio, Jupyter, Zeppelin). BlueData also offers an App Workbench tool that allows our customers to customize existing images or create their own application images.

Flexible Resource Management for Spark

BlueData EPIC has long provided the ability to deploy Spark on YARN or Spark in standalone mode. With BlueData EPIC 3.0, we’ve added the ability to deploy Spark on Apache Mesos – offering the flexibility to deploy distributed Spark using any of these standard resource management options.

Now you can use Mesos as the resource manager and scheduler for Spark jobs running on those specific clusters. Mesos will run on all the cluster nodes within Docker containers. Some of the use cases that our customers are using Mesos for include:

  • Jobs that require using different versions of Spark on the same cluster; the version may be decided at the job submission time.
  • Simultaneous long-running jobs with variable resource requirements
  • Advanced monitoring and fine-grained resource tracking

The screenshot below shows an example Spark on Mesos cluster. The Mesos master is running on 3 nodes, and agents will run on the rest of the nodes.

Extending Cluster Capability

One of the standard requirements while working on a real business problem in data science operations is the need to download and add new libraries and packages to a cluster. Users, even with access, find it difficult to extend their clusters. There are many reasons why this can be hard, for example:

  • They may not have all the dependencies available on a common cluster if they are using one monolithic environment.
  • Clusters are usually multi-node. You have to login to each node and run your commands, or you have to enable Chef/Puppet scripts (and this is generally not the domain of a data scientist)

With BlueData EPIC 3.0, we now offer the ability to submit Action Scripts from within the web-based user interface or via API. If users are authorized to login and make changes to the cluster, BlueData can use that same authorization and allow them to submit those install (package) scripts or yum install scripts from the user interface. In one go, all the clusters will be updated with the necessary software.

Action Scripts is a very powerful new feature. Users can script job submissions, package installs, new software installation and configuration, and much more — as they have the necessary access to the cluster nodes. Running post-cluster creation scripts is a common use case for short-lived clusters. The screenshots below shows a set of such capabilities, using Action Scripts, on a running cluster within BlueData EPIC.

This screenshot shows a list of actions run by different users on a cluster:

You can install libraries using sudo, if your user has permissions:

Users can run Spark, Python, or any other job using Action Scripts. This submits the job to the cluster, as a logged in user, and runs the job in the background.

The script runs on all nodes and the logs are available within the BlueData EPIC web-based user interface to monitor the progress of the script:

Agility and Flexibility for Data Science Operations

In summary, with version 3.0, BlueData EPIC delivers powerful new capabilities to help IT operations, engineering, developers, and data scientists with large-scale distributed data science operations.

Keep in mind, there is a lot more to the lifecycle of data science operations in the enterprise beyond cluster provisioning and resource management. Based on my experience working with many of our customers, you’ll need functionality like the ability to ensure efficient use of resources by easily starting / stopping / shrinking / expanding clusters; fine-grained monitoring and the ability to audit user actions; and complete isolation between different groups of users in a multi-tenant architecture.

The BlueData EPIC software platform provides all those capabilities and much more, delivering the agility, flexibility, and ease-of-use that your data science teams need — along with the scalability, performance, security and control that your IT teams require.

And if you want to try it out for yourself, you can test drive BlueData EPIC on AWS.  Just apply here for a free two-week trial.