Back to Blog

Building Complete Data Pipelines with Apache Spark: How to Get Started

Organizations in every industry are faced with the need to analyze and solve Big Data challenges in a timely manner. Many of them are trying new data platforms that can help them adapt quickly and respond to these new data challenges. They need to find the right column of innovation and sustainability; it’s a continuous process that requires ongoing calibration to both existing and new platforms.

More and more of these organizations are viewing the delineation between their traditional data warehouse and the new Big Data platforms as a false dichotomy. It is not an “either/or” decision, since there are typically multiple use cases each with different response times, availability, and interoperability requirements.

Fuller quoteI’ve worked with many different organizations – including large Fortune 500 enterprises, public sector entities, as well as smaller companies – who need to evolve their data strategy and data platforms to meet these new challenges. They have large sets of both internal and external data that is growing rapidly, and they need to explore data from multiple sources that don’t fit the traditional mold.

In the past, the traditional data warehouse did a very good job of quickly answering the queries that they were built to answer. Some organizations may have also had separate systems for statistical analysis, and other systems to explore “out of the ordinary” insights. They used this data to examine a variety of use cases – for example, a customer’s interest in a product, their propensity to buy, frequency of visits, satisfaction index, and much more. But it could be difficult to pull together this analysis to tell different sides of the same story (instead of different or conflicting stories).

Apache Hadoop has provided the ideal data platform for a unified, logical view of an entity like a customer. It supports a variety of data pipelines and computation techniques in the same platform. Hadoop (and MapReduce in particular) uses disk-based persistence for intermediate data and also for data across a pipeline of tasks. In general, this makes it a good fit for large-scale, cost-effective, batch processing workloads. But for some use cases, it may not be the best fit.

The emergence of Apache Spark offers a unified, in-memory compute engine that works with a distributed data platform such as the Hadoop Distributed File System (HDFS). So what does that mean? It means that in a single program you can acquire data, build a pipeline, and export the results using a rich set of APIs specially built for parallel processing. Intermediate datasets (Resilient Distributed Datasets or RDDs) can be stored in memory or on disks and the performance is 10 to 100 times faster, depending on memory usage.

Apache Spark includes Spark Core, a fast and general distributed execution engine that supports Java, Scala, and Python APIs. Additional libraries, built on top, allow diverse workloads for Streaming, SQL, and Machine Learning running on the same core. With the recent release of Spark 1.4, this capability gets even more exciting with mature pipelines and support for SparkR.

So how do organizations typically get started on their journey with Apache Spark?

I’ve seen three different categories of users (whether data scientists and business analysts or the IT teams that support them) who are interested in Spark:

  1. Existing Hadoop users who are looking to try Spark, for specific newer workloads that MapReduce may not be well suited for (e.g. interactive queries and iterative algorithms).
  1. Users who need the complete Hadoop stack as well as Spark in their Big Data environment.
  1. Users who are only interested in Spark standalone, and don’t have use cases for Hadoop at this time.

Sometimes two or three of the above may exist within the same organization. It’s not uncommon to see multiple user siloes or project teams at different stages (and with different use cases with different requirements) in their Big Data initiatives.

At BlueData, we offer an infrastructure software solution to simplify on-premises Big Data deployments – and we’ve been working closely with many different organizations deploying Spark as well as Hadoop. For each of the scenarios above, we provide options to help you get started quickly and grow as your needs change over time:

  1. Existing Hadoop users can easily spin up new Spark compute clusters on BlueData, and utilize the same data as their on-premises Hadoop platform. They can scale Spark independent of Hadoop, with reduced stress on Hadoop nodes. Users have the ability to quickly create their own sandbox and not affect the rest of the processes; they can also use newer versions of Spark not yet supported by Hadoop vendors. They can access data for one or more Hadoop clusters from the same Spark cluster; upload data from their laptops to local ClusterFS and test various programs; and run Spark on HDFS as well as NFS, Gluster, and other storage systems. Administrators also have the ability to script Spark cluster life cycle management and pipelines using BlueData’s RESTful APIs.
  1. Users who need both Hadoop and Spark can get started right away with BlueData and scale on-premises based on their needs. They have the ability to create multiple compute clusters connecting to one or more data sources. They can share data across clusters within a given tenant; users can also share data across tenants by connecting to an external data source. And with minimal IT involvement, they’ll have a complete Hadoop + Spark environment; users can deploy new Spark or Hadoop clusters within minutes and get productive in a very short time.
  1. If users are only interested in using Spark and have data in existing sources such as NFS, new Spark clusters can be created from scratch in a matter of minutes. Data can be stored in NFS or another storage environment as the backend file system. They can take advantage of the full functionality of the BlueData software platform, including multi-tenancy, security, user management, isolation, and auditing – with the agility and efficiency of Spark-as-a-Service in an on-premises model.

The brief video below provides a quick tutorial about how to get started with Apache Spark using the BlueData EPIC software platform:

If you’ll be at Spark Summit in San Francisco this week, please stop by the BlueData booth to see a live demo and talk with one of our experts. And if you want to get started with your own personal multi-node Spark sandbox, you can download the free version of our software to run on your laptop at