Over the past several years, many Fortune 500 enterprises have implemented their Hadoop architectures with a data-centric approach. The common theme in their architectures is the build out of one or more Hadoop clusters supporting the concept of a centralized data lake. In other words, they created a central storage repository that is the source of data for use cases such as data warehouse offload and batch processing of semi-structured (e.g. log, clickstream) or unstructured data (e.g. documents, videos). Due to the accumulation of all types of data in this single storage repository, they required a strong focus on enterprise security and data governance for access and various uses of this data.
The Data-Centric Model of Hadoop
The characteristics of this data-centric model include:
- A single large Hadoop cluster with an emphasis on Hadoop Distributed File System (HDFS) capacity
- Strong data management including ingestion, cataloging, transformation, and auditing
- Strong data security enforcing access controls and encryption
- A relatively small number of workloads or jobs (e.g. ETL offload, daily/weekly log processing)
- Bare-metal physical server infrastructure, with direct-attached storage (DAS)
Considerations and Challenges
While this “single cluster” approach has served its purpose for building out a highly governed and secure data lake, the natively available compute services on these clusters (e.g. the standard services available in a Hadoop distribution such as MapReduce, Hive, Impala, and specific versions of Spark) are often insufficient to satisfy the ever-changing demands of modern data-driven applications – which are often being developed and used by multiple different business groups simultaneously.
Here are two specific examples where I’ve seen this happen:
- A large information services provider, with a portfolio of over two dozen in-house data applications, migrated their data pipeline from a farm of SQL Server databases to a well-designed centralized Hadoop data lake. While the cost savings on their data consolidation was impressive, they were unable to create an operational model that enabled their development teams to leverage new products and tools (e.g. new releases of Spark, Jupyter notebook, H20, RStudio) that were essential to delivering their next generation data applications. The Hadoop cluster they used was not functionally or operationally optimized for a data-driven application development lifecycle.
- A leading healthcare organization built out what is probably the most sophisticated Hadoop data lake I’ve ever seen, with stringent security and data governance. But it was relatively rigid and inflexible when it came to meeting the needs of the data scientists and analysts in their business units. For example, one of their business units wanted to utilize their preferred BI/ETL tool; yet this tool required software to be installed on every Hadoop node of the data lake. While this was technically possible, it was simply not a sustainable model for the data lake operations team to install, manage, and upgrade specific software and tools for every business unit while also meeting IT objectives around data management, security, and governance.
Simply put, the pace of innovation around Big Data compute services such as Spark (as well as R, Python, H20, web-based notebooks like Jupyter or Zeppelin, and a myriad of other analytics tools) places significant burdens on this data-centric approach and the traditional architecture of the centralized Hadoop data lake. As a result, I’m seeing a new model emerge where Big Data architects are seeking a more application-centric approach that decouples the applications (and compute services) from the data lake.
An Application-Centric Model for Big Data
There is now a fundamental paradigm shift underway in enterprise Big Data architectures: where “compute” services are being offloaded from the compute resources of the centralized Hadoop data lake. Instead, these compute services (e.g. Spark and other new innovations in the Big Data ecosystem) are being placed a separate “compute only” application environment.
In other words, I’ve seen more and more enterprise organizations embrace the idea of a Big Data application environment independent of their centralized Hadoop data lake; to provide flexibility for application developers and data scientists, while still enforcing governed access to the data lake. A similar type of architecture has already been implemented in public cloud services such as AWS where cloud storage (e.g. S3) acts as a centralized data lake and various compute services (e.g. EMR) or manually installed clusters of Spark serve as separate “compute” clusters. Now this type of architecture can be implemented on-premises, either in a greenfield Big Data deployment or to complement and extend an existing Hadoop data lake deployment.
The important thing is that this architecture helps deliver a more application-centric approach to Big Data analytics. The characteristics of this application-centric model include:
- The ability to use the “right” tool for each workload. In other words, it provides choice and flexibility for data scientists, analysts, and developers
- Support for different tools and/or clusters for different user groups or business units
- Decentralized administration of compute services
- Easy yet secure access to the Hadoop data lake
- Use of modern infrastructure options (e.g. Docker containers) for Big Data, rather than traditional bare-metal infrastructure with DAS
Considerations and Challenges
The key to success for any Big Data initiative is to ensure that your application developers, data scientists, and analysts have access to the right tools. Timely and cost-effective access to the right data is imperative; but more and more, it’s all about the applications. And yet there are significant challenges with providing the choice and flexibility these users need for Big Data analytics and application development.
Many of these challenges are in areas like operations, management, and security. In particular, it requires that you address questions such as:
- How can your organization provide these users with their choice of Big Data frameworks and tools, while also providing the right set of controls and security? Can you accomplish this without increased administrative and overhead costs?
- How can you streamline the provisioning of these different tools and clusters, while also ensuring stringent security and without requiring manual intervention for each request?
- Last but not least, how can you control which data sets in the Hadoop data lake are available to which compute clusters and/or user groups?
With the BlueData EPIC software platform, you can address these challenges and enhance a data-centric Hadoop implementation with a more application-centric approach – as shown in the diagram below:
Over the last couple years, we at BlueData have witnessed the evolution of Big Data architectures from the traditional data-centric Hadoop model to a more application-centric approach. Many large enterprises within existing Hadoop deployments are now seeking out ways to move their Big Data applications to production faster and more cost-effectively, while accessing data in their existing Hadoop data lake.
BlueData is a perfect fit for this new application-centric model, leveraging Docker containers to provide a more flexible and agile “cloud like” experience for Big Data deployments on-premises. The BlueData EPIC software platform allows multiple teams and users to provision production-grade Hadoop and Spark clusters with their analytics tools of choice within just a few clicks. Using BlueData’s proprietary DataTap technology, they can easily and securely tap into the data that already resides in the Hadoop data lake – without moving or copying data.
As a result, they are no longer dependent on the compute services associated with a single Hadoop cluster and its limited, rigid, and inflexible architecture. The compute services are now offloaded from the centralized Hadoop data lake and placed in a separate and highly flexible containerized environment, powered by BlueData. And they don’t need to go through the complex and painful process of upgrading HDFS to leverage new versions of Hadoop and other Big Data ecosystem products.
The BlueData EPIC solution allows designated administrators to rapidly onboard new Big Data ecosystem products (whether the latest version of your Hadoop distribution, the latest version of Spark standalone, Jupyter or Zeppelin notebooks, or tools like Informatica and H20) using Docker containers and our Application Workbench. Enterprises can quickly provide the right set of tools for the right job – providing a more responsive, easier-to-use, and agile application environment for data scientists, analysts, engineers, and application developers.
By running these offload clusters in Docker containers with BlueData, these enterprise deployments now have the agility to be more responsive to business users’ needs and meet their enterprise SLAs with security and control. Referring back to the specific examples I cited earlier:
- The large information services provider uses the BlueData EPIC platform to enable their data scientists and application developers to take advantage of the latest Big Data innovations (such as Spark, Jupyter or Zeppelin notebooks, machine learning tools like R and H2O, and/or newer versions of SQL-on-Hadoop) that aren’t available on their Hadoop data lake cluster.
- The leading healthcare organization is now able to offload the BI/ETL tool mentioned above to a containerized Hadoop cluster running on the BlueData EPIC platform. Specific data sets are exposed to this offload cluster via BlueData’s DataTap functionality in order to support in-place analytics – thereby eliminating the need for data duplication, but also eliminating the requirement to install the BI/ETL software on each and every node.
To make this all work requires the same strong focus on enterprise security, data governance, and access controls that were requirements for the traditional data-centric Hadoop architecture. And over the past year, BlueData has released multiple innovations and enhancements (including our summer and spring releases) to ensure security and control for production Big Data deployments using Docker containers.
The newly announced fall release of the BlueData EPIC software platform is another significant milestone that allows enterprises to accelerate and streamline their evolution to this application-centric model – while also providing the enterprise-class security of the traditional data-centric Hadoop architecture. Some of the new functionality includes:
- Automated Kerberos setup for Hadoop ‘compute’ clusters: This ensures that when a specific developer or a team administrator creates a cluster, there is no additional management or administrative burden on the Hadoop team to secure that cluster with Kerberos.
- Automated management of AD/LDAP groups: While it can be relatively straightforward to ensure that these application-centric compute clusters share the same AD/LDAP system as the data lake, this release automates the entire lifecycle of managing AD/LDAP users and groups. For enterprises that have dozens of different teams or tenants consisting of dozens or even hundreds of developers and Big Data analysts, features like these are critical for operational controls (and for doing more with less).
- Integration with Linux privileged access management tools: The BlueData EPIC software platform is the foundation for the application-centric model, and now it can be regulated by these access management tools from installation to runtime – to improve security controls and record all privileged access sessions for auditability and compliance.
- Enhanced virtual networking and storage support: With this release, BlueData has introduced key performance and scalability enhancements to rapidly onboard more production workloads at scale using Docker containers.
The new fall release incorporates the feedback and lessons learned from our customers’ production deployments of Big-Data-as-a-Service (BDaaS) in an on-premises environment. And it addresses several security, networking, and infrastructure requirements for achieving the agility and flexibility benefits of an application-centric model for Big Data – while also ensuring the governance and control of the traditional data-centric approach for Hadoop.
Learn More at Strata + Hadoop World in New York
If you’ll be at Strata + Hadoop World this week (September 27-29) in New York, you can see all of this in action. Just stop by the BlueData booth where we’ll be featuring demos of our new fall release.
And to learn more about security, networking, and other key considerations for deploying Big Data in an on-premises enterprise environment using Docker containers, mark your calendar at the conference for a session with my colleague Tom Phelan on “Lessons Learned Running Hadoop and Spark in Docker” at 2:05pm on Thursday September 29.