Back to Blog

A Quick Start Guide for Deploying Apache Spark with BlueData EPIC 2.0

Apache Spark has quickly become one of most popular Big Data technologies on the planet. By now, you probably know that it offers a unified, in-memory compute engine that works with distributed data platform such as HDFS. So what does that mean? It means that in a single program, you can acquire data, build a pipeline, and export the results using a rich set of APIs specially built for parallel processing. Intermediate datasets, RDDs, can be stored in memory or on disks and its performance is 10 to 100 times faster depending on memory usage.

Spark consists of Spark Core: a fast and general, distributed execution engine that supports Java, Scala, and Python APIs.  Additional libraries, built on top, allow diverse workloads for Streaming, SQL, and Machine Learning running on the same core. With Spark 1.4 and subsequent releases, it’s become even more exciting with mature pipelines and support for SparkR.

For these reasons and more, the use of Spark by individual data scientists and data analysts is growing rapidly. But it’s not always so quick and easy to implement – at least not at scale.  Implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires expertise that is generally not available to all. In a previous blog post, I wrote about how to get started with deploying Spark on-premises and outlined a few different options. There are a number of different considerations to keep in mind – including requirements for security, multi-tenancy, resource management, scalability and more.

Apache Spark and BlueData EPIC 2.0 

Here at BlueData, we’re continuing to invest in our infrastructure software platform to make it easier, faster, and more cost-effective for enterprises to get up and running with a multi-tenant Spark deployment on-premises. In fact, today we announced the latest release of our BlueData EPIC software: version 2.0. My colleague, Anant, wrote a blog post providing an overview of the new release. For Spark, we’ve added several new features in EPIC 2.0 that incorporate the innovations in Spark 1.4 and improve user productivity with Spark:

  • Spark 1.4 will now be packaged with the BlueData EPIC platform – along with Spark SQL, Spark Streaming, MLLib, GraphX, and SparkR. We are also providing a preview of Streaming SQL for Spark. BlueData EPIC 2.0 provides a turnkey solution for deploying Spark 1.4 – including these new Spark services – running on Docker containers. EPIC can be installed on physical servers or VMs; the only requirement is to have a standard Linux Operating System.
  • Apache Zeppelin (web-based notebooks) can now be automatically provisioned in BlueData EPIC with Spark 1.4 clusters. Business analysts, data analysts, and data scientists now have a user-friendly option for data exploration and visualization with Spark; they can also have a personal workspace and collaborate with other users. Notebook interpreters include Spark SQL and Hive, and the notebooks are pre-wired to a running Spark cluster. This makes it seamless for users, so they can get productive right away without having to know much about the underlying details of Spark cluster deployment and configuration.
  • Out-of-the-box integration is now provided with Hive metastore for creating persistent tables. This persistence enables sharing of results between users as well as processes. This also can make data and results available to programs outside of Spark.
  • SparkR is setup and enabled with ‘R’ software on cluster nodes. Traditional ‘R’ users can start to analyze with ‘R’ out of the box.

A Quick Start Guide for Apache Spark with BlueData EPIC

To show you how easy it is to get up and running with Apache Spark 1.4 using BlueData EPIC, here’s a quick start guide that walks through the key steps:

Creating a new Spark cluster

  • Log in into your BlueData Tenant -> Clusters -> Create New Cluster.
  • On the “Create New Cluster” screen, select “Spark” as the cluster type. Select the Spark distribution from the drop down list. Select Spark 1.4.
  • Select the node flavors based on your data volume.
  • Click submit to create a new Spark 1.4 cluster.
BlueData EPIC %22Create New Spark Cluster%22

Accessing data in local HDFS

  • When you create a new Spark cluster, BlueData provides a cluster file system (cluster FS) by default for local use. This is the local HDFS dedicated for this cluster. You will learn more about using external file systems below.
  • After the cluster is created, click on the cluster link (Spark1.4 in this case) and click “Cluster FS Browser”.
  • You can create directories and add data in HDFS from the user interface.
Spark 1.4 Zeppelin Notebook

Accessing data from remote Hadoop clusters and NFS

  • Enterprise data is typically stored in several different storage systems. Depending on the pipeline from which the data originated, it can be in databases, NFS, Isilon, HDFS, Gluster, object store, and more.
  • BlueData believes that analytical capability should not be restricted to certain data sources. BlueData EPIC’s DataTap technology allows Spark and other processing engines to use data from any source including HDFS data.
  • In addition, it also provides read ahead and write back cache through EPIC’s IOBoost functionality to ensure performance. This feature enables Spark processes built for Hadoop to run unmodified against a number of other sources.

Creating a personal workspace using Apache Zeppelin

  • Enterprises have business analysts, developers, data analysts, data scientists, and data engineers among other users who would like to use Spark for data processing. BlueData provides the ability for all of these different users to leverage their tools of their choice. For example, business analysts may prefer using a user-friendly, web-based notebook instead of shell scripts and command line.
  • When you create an Apache Spark 1.4 cluster in BlueData EPIC, a Zeppelin notebook is immediately available (with a tutorial).
  • Users can click on the Zeppelin notebook link on the cluster detail screen.
  • Once they are in Zepellin, individual users can create their own “note” (workspace) or try the tutorial to learn about using Zeppelin notebook.
Spark on BlueData with Zeppelin

Iterative analysis and visualization using Spark with Zeppelin

  • Click on your note. In this example, we are using a “Spark on BlueData” note.
  • There are a number of interpreters supported out of the box to process Spark, SQL, Hive, Shell and other types of code in Notebook.
  • Lets look at a specific example. In this note, there is a %md (markdown) section with some introduction to the note. Then we are creating a DataFrame from a JSON file in HDFS, and running a SQL query on the data in DataFrame.

Running Spark shell using command line

  • Some data scientists and developers prefer using command line projects. It gives them more control over their environments and enables them to add other components that may not be available out of the box.
  • With BlueData EPIC, the Spark cluster provides SSH access to Spark-Shell, SparkR and other Spark projects. Users can create their own environments and work in their isolated workspace.
Commandline Spark Shell and R

Sharing results between sessions and users with Hive persistence

  • Spark programs allow users with a variety of skillsets to process the data. The ability to store their results, with the metadata, is highly desirable to enable sharing with other users and processes.
  • BlueData offers pre-built integration with Apache Hive metastore to store results as tables. Spark can also access Hive tables created outside of Spark pipeline. Now users can run Spark SQL queries and Hive queries on persistent tables.

Submitting Spark jobs via the BlueData user interface

  • BlueData provides out-of-the-box capability to not only create Spark clusters, it also offers a friendly user interface to submit Spark jobs without having to login to the cluster or understand the details about a cluster including IP addresses and ports of running services.

In summary, the BlueData EPIC software platform offers a turnkey solution to deploy Apache Spark in an on-premises environment for multiple users (and multiple types of users). Enterprises can get started quickly and users can be productive with Spark in a short amount of time – with instant access to their own Spark clusters and all of the associated services, data, and tools they need. It’s EPIC. And you can try it out for free.