Artificial Intelligence (AI) is today’s hot topic—and hot technology.
From news headlines to trade shows to boardrooms, everyone is talking about how AI can be used to transform entire industries and deliver groundbreaking business innovations. The use cases may be as mundane as automating the interaction with customers during entry-level support requests, or as sophisticated as features that assist drivers in avoiding accidents on the road. For each of these use cases, software development and data science teams are at work behind the scenes, supported by operations teams and IT infrastructure.
Over the past few years, I’ve spoken with many data science teams. I always ask them about their challenges and what they could use to increase their productivity. Ultimately they want to operationalize their work more quickly, and thus connect it to business processes for various use cases to drive business outcomes. In this context, what they mean by operationalize is taking the components of a machine learning/deep learning or predictive analytics model (i.e. the code, scripts, libraries, and meta-data) and deploying it into a running state in “production.”
Slow model training processes hinder productivity
The answers I’ve received are often specific to each team’s unique circumstances and environment. But some consistent patterns come up repeatedly. In this blog, I’m focusing on one common to all data science teams: frustration with slow model training processes. This is top of the list I suspect because it’s something that severely limits the amount of work that data science teams can take on, not to mention the quality of the outcomes they can deliver.
Model training processes can be slow for several reasons—but an important one is the need for access to the right compute resources. In an enterprise environment, the compute resources for model training must be able to process huge amounts of complex data (e.g. petabytes of images), which present punishing loads even for some of the best CPU-based systems which are commonly available to data science teams.
GPUs accelerate machine learning and deep learning
Consider this easily accessible solution to this: use specialized compute resources and accelerators that are tuned and optimized for just these scenarios. The most common examples are GPUs, which were originally designed to support high-quality, real-time graphics. Since then, GPUs have evolved to be the computing accelerator of choice for compute-intensive applications such as machine learning and deep learning. Running the same model through a GPU versus a traditional CPU can result in processing time going from hours to minutes.
If this is the case, then why don’t organizations have GPUs available to everyone that needs them?
The short answer is that they try to. But in large enterprises with multiple data science teams, it’s not always that easy. GPUs are in high demand, infrastructure procurement can take months, and there’s no simple mechanism for sharing the existing GPU resources. The most common scenario is that the back-end infrastructure grows organically over time with siloed data science teams hoarding their personal GPU environments.
The problem with this is that due to the cyclical nature of a data scientist’s work and the binary model training (100% or 0% GPU utilization) workloads, you could end up with your most valuable assets sitting idle while other teams have projects that could benefit from the 100x GPU speed-up. The potential impact is significant: slower model development and reduced productivity from your data science teams due to underutilization of existing GPU investments.
So what can you do about this? Watch my educational talk below, where I discuss this scenario and how HPE’s BlueData software can help your enterprise get the most out of your GPU resources to accelerate model training processes and improve the productivity of your data science teams.