Back to Blog

Apache Spark Integrated with Jupyter and Spark Job Server

Apache Spark is clearly one of the most popular compute frameworks in use by data scientists today. For the past couple years here at BlueData, we’ve been focused on providing our customers with a platform to simplify the consumption, operation, and infrastructure for their on-premises Spark deployments – with ready-to-run, instant Spark clusters.

In previous blog posts, I’ve written about how to implement an enterprise-ready, on-premises Spark deployment (powered by Docker containers) – along with tools such as Zeppelin notebooks for exploration and visualization with Spark, as well as tools for specific use cases like building real-time pipelines with Spark Streaming, Kafka, and Cassandra.

As the adoption of Spark accelerates and usage increases, the ecosystem of tools and applications around Spark is growing rapidly. In talking with organizations deploying Spark, one of the challenges I’ve seen is how they can keep pace with these new innovations – and provide all the tools that their users need, depending on their role and use cases.

So in this post, I’ll talk about how we’re continuing to integrate new Spark ecosystem applications with the BlueData software platform – with the ability to easily add productivity tools like Jupyter and Zeppelin notebooks, as well as Spark Job Server for submitting and managing Spark jobs, to our customers’ Spark deployments.

Different Tools for Different Folks

Over the past couple years, I’ve worked with many enterprises and other organizations – across multiple industries – as they’ve adopted and implemented Spark for their data science and Big Data analytics initiatives. I’ve seen some common patterns and categories of users emerge, as outlined in the graphic below:

Spark Users and Ecosystem

Although data scientists are perhaps the most often discussed users, there are in fact other types of users and consumers of Spark in any deployment. Data science is an interdisciplinary field that employs multiple systems and processes; it also requires a combination of different skills and methodologies, with different roles and users, and different toolsets.

In my experience, these users typically fall into one or more of the following categories (each with their own set of preferred tools):

  • Data Scientists – use a combination of command line and web based tools. In particular, Jupyter notebook with Spark, Python, Julia, and R is popular among data scientists; Zeppelin notebook is also gaining popularity among Spark users.
  • Developers – develop Spark applications using the Spark command line interface (CLI) and they use desktop development tools (like Eclipse and IntelliJ) which submit jobs to Spark clusters via APIs.
  • Analysts – use Spark with their preferred business intelligence applications (e.g. Tableau, MicroStrategy) to connect to Spark using Spark SQL.
  • Data Engineers – use Spark for data processing (ETL), and they are often involved with various data sources and other operational tasks.

The tools that each of these users work with also depends on their specific use case – whether for real-time analytics using Spark Streaming in conjunction with Kafka and other frameworks, or for graph processing and machine learning use cases.

Data science users need a variety of programming languages and skills to perform their tasks in Spark, and some of these tools have rather arcane ways of integrating with Spark. For example, we often hear the requirement to provide a unified workspace for data scientists that permits them to run Python, R, Spark and other tools without having to go through a painful software installation exercise for each. Sorting out the “right” versions of Python, notebooks, kernels, and dependent libraries required to run the various tools can be a very time-consuming process.

Moreover, as noted above, the entire landscape of tools is constantly evolving. So the software tools (and the versions of that software) that were “right” yesterday are not necessarily what is “right” today. Given the dynamic nature of this ecosystem – and the range of different users that want their own user-centric portfolio of tools – it can be challenging to keep up.

Give Your Users What They Want

With these challenges in mind, Bluedata has several enhancements to our software platform that enable our customers to support the diverse user base and tools for Spark:

  • An App Store of commonly used ready-to-run Spark infrastructure packages for these various users (e.g. data scientists, developers).
  • An App Workbench that allows our customers to easily add their own Spark applications and tools, without custom frameworks or custom scheduler coding.

While data scientists and developers (myself included) are good at math and coding, they are not usually as well-versed in building enterprise-grade infrastructure required to deploy their applications. This is where a self-service platform with simple orchestration and one-click provisioning for Big Data application deployment (i.e. the BlueData EPIC software platform, powered by Docker containers) really comes in handy. We’ve put in all the hooks our customers need to quickly deploy Docker images for popular data science and developer applications, as well as to add their own preferred tools.

As an example of ready-to-run Spark clusters, we provide Spark version 1.5, Hive, R, and Apache Zeppelin web-based notebooks as a base package (in a pre-configured Docker image) for data scientists and other Spark users in the App Store. With BlueData, individual users can create a single large cluster or multiple sandbox environments with our simple self-service interface. A fully configured Spark cluster can be created in minutes with just a few mouse clicks:

create_Spark1.5_cluster

Once the Spark cluster is provisioned, so are all the related services and tools; the complete environment is ready to use just as if it were running on its own dedicated physical server. This is a big win. Data scientists and developers spend a lot amount of time setting up the environments they need. Providing them the ability to automatically spin up and reproduce standard Spark environments is a huge productivity gain.

But we recognize that these ready-to-run packages won’t necessarily meet every Spark users’ needs, every use case, and every organization’s unique requirements. So we provide the ability for our customers to modify and/or augment their App Store to meets the specific (and highly dynamic) requirements of their data scientists and data analyst teams. With our App Workbench, administrators for BlueData EPIC can leverage existing starter images and templates in the App Store to create new variations and versions (e.g. using a Spark 1.5 image to create a new Spark 1.6 image) or add their own preferred tools.

Jupyter: The Notebook Formerly Known as IPython

For example, while some users may prefer Zeppelin notebooks, other advanced users may prefer the popular Project Jupyter notebook (previously known as IPython notebook). Jupyter provides a browser-based graphical notebook interface to create and share documents with live code, equations, and visualizations for data science. It supports multiple different programming languages, including those popular in data science such as Python, R, Julia, and Scala.

Starting with the Spark image described above (with Spark 1.5, Hive, R, and Zeppelin), you can use the App Workbench to create a new Docker image that includes Spark version 1.6 and a Jupyter notebook with IJulia, IPython, and Spark kernel upgrades; we also added Spark Job Server to the package (more about that below). Then your users who prefer this data science toolkit can easily spin up this new Spark 1.6 cluster:

create_Spark1.6_cluster

With this functionality, our customers can deploy multiple different Spark clusters (on the same set of shared hardware, using Docker containers) and simultaneously support both basic and advanced use cases for their data science users.

And when this new Spark 1.6 cluster is provisioned by the user, links to all the related services and tools are also created. Highlighted below you can see the links to Jupyter, Zeppelin, and Spark Job Server along with command line access to the Spark master and other nodes of the cluster:

Your data scientists now have a variety of numerical computing, statistical modeling, and large dataset modeling configuration tools available at their fingertips. And they have powerful notebooks that allow for easy development and sharing.

In the screenshot below, you can see some of the backend kernels available with the Jupyter notebook (i.e. Julia, Python, Spark). Users can add more kernels as their needs grow.
jup_kernels

The Right Tool for the Job

For developers, submitting jobs remotely via an API is another common requirement. So we’ve made it easy to incorporate Spark Job Server (an open source project available on GitHub) to support remote development and job submission in your Spark environment. Spark Job Server allows your developers to simplify submitting jobs to Spark using a RESTful interface, without the complex set up required to connect directly to the Spark master.

Now your developers can use their favorite IDE (e.g. Eclipse, IntelliJ) to develop applications and manage the job lifecycle using the Spark Job Server APIs. They can create reusable Spark contexts and chain jobs in the form of a workflow with minimal overhead.   Here’s a screenshot of the Spark Server command line interface:

job_server_CLI

The Spark Job Server UI gives access to server details and the jobs submitted to the Spark cluster. In addition, it preserves the contexts created by remote clients to support repeated job submission.

job_server_ui

A Comprehensive Spark Environment for Multiple Users

With the BlueData software platform, we help our customers to quickly deploy Spark on-premises and deliver a complete Spark-as-a-Service environment for their data scientists, developers, analysts, and other users. That includes the latest versions of Spark, graphical Web-based notebooks, persistent tables, JDBC support for business intelligence, job server functionality for job management, and more. And to support enterprise-grade security requirements in a multi-tenant deployment, we provide multiple levels of access controls – including storage / data access control and access control via LDAP integration to Spark master and worker nodes.

Here at BlueData, we’re following the evolution of Spark very closely and we’re excited about the new innovations coming in Spark 2.0 – including structured SQL, ANSI SQL support, continuous queries, and uniform DataFrame APIs. We’re looking forward to helping our customers take advantage of these improvements as soon as they become available. With the BlueData EPIC platform, they can keep their existing Spark clusters running (whether standalone or on YARN), and simultaneously try out all the new and improved features Spark has to offer. And as the Spark ecosystem continues to expand, they can continue to add new tools and applications to their Spark environment – with our App Store and App Workbench functionality – to keep up with this dynamic and rapidly evolving landscape.

So if you’ll be at Spark Summit in San Francisco this week (June 6th through 8th), make sure you stop by the BlueData booth for a demo of how you can quickly and easily deploy a comprehensive on-premises Spark environment – including integrations with tools like Jupyter or Zeppelin notebooks and Spark Job Server. And don’t miss Lessons Learned from Running Spark on Docker on Wednesday June 8th at 2pm with Tom Phelan, co-founder and chief architect at BlueData.