If you’re following the Big Data space, you’ve most likely heard at least something about Apache Spark.
Originally developed in the AMPLab at UC Berkeley by Ion Stoica and Matei Zaharia, Spark is an open-source in-memory cluster computing engine for large-scale data processing. By keeping data in memory, Spark allows users to quickly perform repeated queries. So it’s particularly appropriate for iterative algorithms such as machine learning.
Ion (a BlueData advisor) and Matei have gone on to move Spark out of the lab by founding DataBricks. Here at BlueData, we introduced our Databricks-certified platform for Spark last fall. And last week, the Databricks team held the inaugural Spark Summit East event where BlueData was one of the sponsors.
Spark is definitely on fire and it’s experiencing explosive growth in interest; it’s now the hottest open-source project in big data analytics. Red Monk analyst Donnie Berkholz wrote a recent blog post that highlights this skyrocketing growth (see the graph below); he closed by stating “Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.” One of the clear takeaways from last week’s Spark Summit East is that data scientists love Spark and it’s likely to overtake MapReduce in the near future.
However, as Spark adoption grows, we’re also starting to see some predictable challenges. As with Hadoop and other Big Data technologies, Spark can be difficult to install and configure. Matthew Glickman from Goldman Sachs presented an excellent session at the summit and he highlighted some of the benefits (citing Spark as the new “lingua franca” for Big Data analytics) as well as some of the challenges of deploying Spark in an enterprise environment. This particular slide from his session caught my eye:
As Matthew pointed out in this slide, “Getting machines provisioned to run Spark” is non-trivial. That’s why BlueData is focused on making Spark infrastructure easy to deploy on-premises. With the BlueData EPIC software platform, you can spin up virtual Spark clusters within minutes – providing secure, on-demand access to Big Data analytics and infrastructure. Here’s a brief demo that shows how:
No more hassle of getting and installing the right libraries. EPIC does this for you. No more replicating data; you can use Spark with or without the Hadoop ecosystem AND allow Spark programs to access data in HDFS or any shared storage system.
And in answer to the question on Matthew’s slide about running HDFS and Spark on the same cluster, BlueData responds with a resounding “No, that’s not necessary.” Why tie your compute resources to your data resources unnecessarily? BlueData’s stance has always been that compute and storage should be scaled independently. The BlueData EPIC platform supports such scaling, initially for Hadoop clusters but now also for Spark, and provides high speed access to remote data stores with its DataTap and IOBoost technologies.
Spark is a breakthrough Big Data technology with great potential. Here at BlueData, we’re focused on accelerating the adoption and benefits for our enterprise customers by making it easy to deploy Spark infrastructure on-premises: