Back to Blog

Data Lakes: Keep Your Big Data Projects Out of the Swamp

Businesses are spending millions of dollars on Big Data-related initiatives (up to $41.5 billion by 2018 according to IDC), but their return on investment is no sure thing.

What’s holding back the ROI? The IT infrastructure used today in most organizations was not designed specifically to handle Big Data workloads, the systems requirements of tools in the Apache Hadoop ecosystem, or the changing needs of data scientists. Most of the existing infrastructure and systems are relatively rigid, complex and expensive. While much of the recent Big Data spotlight has been on the analytical applications and the data itself, positioning these projects for success will require new considerations for infrastructure as well.

New technology innovations have emerged with the promise of making Big Data and its underlying infrastructure more manageable, accessible and useful. One approach, known as a data lake, has gotten a lot of attention. In theory, a data lake would hold large quantities and varieties of data, both unstructured and structured. It would store data in its original, native format, for later transformation. In other words, today’s data silos would be moved into a centralized data lake.

Listen to this conversation with BlueData’s CTO Tom Phelan and Vice President of Products Anant Chintamaneni as they discuss the ideal of the data lake and what’s actually practical with today’s technology:

So should you hold off on the data lake concept?  As Gartner VP and distinguished analyst Andrew White warns, a data lake can end up becoming a collection of disconnected information silos all sitting in one place. When this happens, the data lake turns into a data swamp. Other challenges include the need to replicate data from existing storage in order to fill the lake (which means more storage and more IT budget resources), as well as performance penalties from remote users accessing and leveraging centralized data.

Here at BlueData, we’ve been working on new technology innovations to simplify and streamline Big Data infrastructure. We’re focused on addressing the overwhelming complexity of Hadoop and related Big Data technologies, which most enterprises simply do not have the IT and development staff to handle.

Our approach provides the benefits of a data lake: working with all types of data, both structured and unstructured, and eliminating data silos by centralizing that data.  But our solution virtualizes your Big Data compute and storage — so in contrast to the data lake concept, there’s no need to copy all the data into a single and centralized data repository.  It’s effectively a “logical” data lake.

To learn more about the opportunity and challenges for data lakes, view this on-demand webinar from Taneja Group, featuring a panel of experts from BlueData, IBM, and VMware: “How to Avoid Drowning in your Big Data Lake”.

Ready to dip your toe in the lake and see for yourself? Check out our free trial.