Back to Blog

Distributed Machine Learning Environments with H2O on Containers

More and more enterprises are adopting machine learning in support of their AI and digital transformation initiatives. Here at BlueData, I’ve worked with many customers across multiple industries to implement their machine learning algorithms. Whether in financial services, insurance, life sciences, healthcare, manufacturing, retail, telecommunications, or government … adoption is accelerating across every sector for a wide range of use cases.

Example Use Case: Fraud Detection in Financial Services

As just one use case example, I’ve recently worked with several of our Fortune 500 customers in the financial services industry to help them deploy distributed machine learning environments for the detection and prevention of fraud involving credit cards, insurance, and more. By detecting the activities that can lead to fraudulent behavior sooner, these customers can avoid significant losses and claims.

To zero in on this particular use case, the implementation of distributed machine learning for fraud detection requires several specific things including:

  • Access to large datasets — e.g. using Hadoop Distributed File System (HDFS) as their data lake
  • A rich set of machine learning and deep learning algorithms — e.g. using powerful tools such as H2O and TensorFlow
  • Data preparation and samplings for model estimation and testing
  • Distributed computing frameworks — e.g. with Apache Spark
  • Testing stage and deployment frameworks

Fraud is a continuous threat, so a continuous fraud prevention program is required to safeguard organizations from the risk of fraud and reduce the time it takes to uncover fraudulent activity. To address this need, these enterprises require a proven solution that allows their data science teams to access data, train models, deploy, and retrain in an easy and seamless manner.

The data for a fraud detection program comes from several different sources and, as noted above, many enterprises store this data in their HDFS data lake. They may use compute frameworks like Spark to perform operations such as read and write, basic SQL, and pipeline capabilities on these large datasets.

This is where BlueData comes into the picture … we provide a container-based software platform that allows their data science teams to easily deploy sandbox environments, training clusters, and inference clusters on-demand — with secure access to this data. They can get up and running quickly with containerized machine learning environments — and scale out from sandbox and dev/test to production — in a multi-tenant architecture with a shared pool of resources using CPUs and/or GPUs.

Building sophisticated machine learning models for a use case like fraud detection also requires transforming domain knowledge into practical applications. This demands expertise in data science techniques, such as clustering, forecasting, and classification.

To provide this capability, some of the most popular tools we’ve seen time and again in our customer deployments come from H2O.ai: including open source H2O, H2O Sparkling Water for machine learning with Spark, and the automated H2O Driverless AI.

H2O.ai + BlueData for Machine Learning in the Enterprise

Today we’re proud to announce our partnership with H2O.ai, to support this customer demand and provide an end-to-end solution for machine learning in the enterprise. As noted above, the data science teams at many of our customers — in financial services as well as in healthcare, insurance, life sciences, and other industries — are turning to H2O.ai for their machine learning data preparation, a rich set of models, and visualization capabilities.

BlueData and H2O.ai are natural partners in this area, helping our joint customers to implement a wide range of different use cases (including addressing fraud before it happens, as just one example). Our partnership includes integration of H2O.ai’s full suite of products with the container-based BlueData EPIC software platform. The result is a powerful combination to help customers rapidly deploy and scale their distributed machine learning environments, while ensuring enterprise-grade security and performance.

With BlueData + H2O.ai, our joint customers have an integrated solution for large-scale distributed machine learning pipelines — to implement sandbox and dev/test environments as well as production deployments. For example, now they can:

  • Quickly spin up containerized environments for developers — pre-provisioned with H2O libraries and integrated with Jupyter notebooks as well as other Python / R frameworks;
  • Ensure seamless support for H2O on CPUs as well as GPUs if needed, depending upon the use case;
  • Provide seamless integration with their enterprise AD / LDAP systems for secure access to these environments;
  • Offer on-demand connections to Spark and Hadoop clusters, running in containerized environments;
  • Deliver a secure connection to large datasets in their HDFS data lake and/or NFS enterprise storage;
  • Automatically mount NFS as local POSIX file systems to containers for shared storage; and
  • Deploy these environments on-premises, in multiple public clouds (AWS, Azure, and GCP), or in a hybrid architecture

Deploying H2O on Containers with BlueData

To show how this works, here’s a screenshot from the BlueData EPIC App Store with some example Docker-based images for H2O and other common tools.

As illustrated in this image, H2O is one of the many machine learning, deep learning, analytics, and data science, and environments that can run on the container-based BlueData EPIC platform.

Within a matter of minutes, data science teams can create an H2O cluster in a containerized environment — whether for open source H2O, H2O Sparkling Water (with Spark), or H2O Driverless AI.

The screenshot below shows the creation of a new H2O Driverless AI cluster from the BlueData EPIC web-based UI:

Access to all the endpoints for the new H2O Driverless AI cluster is provided via the BlueData EPIC API:

The H2O Driverless AI environment can be automatically configured with access to HDFS and other data sources via secure Kerberized connections — as shown in the screenshot below — using BlueData EPIC’s DataTap functionality:

Now that it’s up and running, data scientists and developers can analyze, model, and visualize data using Driverless AI on the container-based BlueData EPIC platform:

An Accelerated Path to Distributed Machine Learning with BlueData and H2O.ai

With BlueData + H2O.ai, you can quickly spin up distributed environments for open source H2O, Sparkling Water, and Driverless AI on containers – and get started with your machine learning algorithms – in just a matter of minutes:

  • You can provision containerized H2O clusters on-demand, via BlueData EPIC’s web-based UI or RESTful APIs;
  • You can utilize H2O libraries and pre-integrated Spark clusters that BlueData EPIC provides out-of-the-box;
  • You can become immediately productive with integrated notebooks, such as Jupyter;
  • You can expand, shrink, stop, and restart clusters as needed — providing ultimate flexibility and elasticity;
  • Your clusters are deployed as containers, so you can run the same environments either on-premises or in the public cloud;
  • Your containers can be automatically configured for secure login through integration with corporate LDAP / AD servers;
  • You can access datasets from existing storage, through secure connections to HDFS, NFS, and other data sources; and
  • You have the unique ability to securely spin-up, manage, and use all the components you need for a distributed pipeline: a pre-processing cluster, an analytical / scientific computing cluster, and a storage columnar database or HDFS.

With BlueData + H2O.ai, you now have a fast and economical path to large-scale distributed machine learning on containers — to deliver faster time-to-value for your AI initiative. To learn more and see this in action with a demo, check out this recent presentation from H2O AI World in London: