The idea of the Data Lake is relatively new – the term itself has been first used in 2010. Interestingly, five years later, only 1.12% of responders felt that this new concept is sufficiently defined and consistent on a detailed level. What’s more, the late Dave Needle (Amiga 1000 co-chief architect) characterized the “so-called data lakes” as “one of the more controversial ways to manage big data”. And in 2016, Forbes published an article entitled “Why Data Lakes Are Evil”.
However, the response to data lakes hasn’t been universally bad. Quite on the contrary. Many have viewed data lakes as a sort of panacea – a magical solution that can solve all of their data problems.
So, a question arises – which perspective is closer to the truth? And is the situation different in 2021 than a few years ago?
Let’s find out!
Data Lake Concept – What is a Data Lake?
The data lake concept was developed as an answer to recurring complaints about data storage. The new idea aimed at bringing together the data from multiple business applications and data systems to one place in a raw form for future structuring and processing (which is immensely useful for operating data pipelines). As a result, the data lake strived to make all the dreams about fast-tracking structured and unstructured data into a one-stop repository shop for business insights come true.
And whereas the popular data warehouses tend to force organizations into narrow data paradigms and silos, the data lake emphasizes a more holistic and expansive view of analytics. Data lakes emerged to fill the need for a scalable, low-cost data repository that would enable companies to store all data types easily, regardless if their source, and then make it possible to analyze this data for evidence-based decision making.
So, why have they been viewed as “evil” in the first place?
Data Lake Technology Issues (And Flawed Perception)
One of the reasons why data lakes got so much negative press may be the fact that their purpose often is (or, hopefully, has been) misunderstood.
According to CIO, the controversy surrounding data lakes focuses on their perceived drawbacks. They are supposedly too difficult to manage, too unstructured and expansive. Nevertheless, it’s important to remember that data lakes offer key functionalities that make them uniquely valuable.
Data lakes are by no means a sole alternative to data warehouses. The goals of both data storing systems are different. Data warehouses are useful to storing structured data that has a clear purpose. As you know by now, data lakes mainly do the opposite. As a result of these misunderstandings, some data lake projects fail because enterprises expect data lakes to be the answer to all of their problems. But they don’t necessarily have to be. If an organization has mainly structured data but decides to embrace data lakes because they’re a cool buzzword everybody in the industry is throwing around, then yes, somebody could say data lakes are nothing special and not worth the fuzz.
Data Lake Storage – Benefits
So, let’s look closer at the key data lake benefits.
One of the crucial advantage of data lakes is storing all data in one place at a low cost and in a scalable way. You can use the stored information any time you want and can wait until specific analytical needs arise (without worrying it’s not worth to spend much money on storing).
Crucially, data lakes are immensely flexible when it comes to data ingestion. You don’t have to worry about structuring. You can start storing your data in its native format. And at any time – even when you’re still working on your ETL workload; you don’t have to wait until everything is complete.
Read more about ETL and data pipelines here: What is Data Engineering? | PGS Software
What’s more, in contrast to a data warehouse, a data lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms. And with a well-managed data lake structure, decision analytics and ML tasks can be operated much quicker
And, importantly, data lakes also offer wider data access. In many organizations, to be able to make data-driven decisions, executives need support in getting relevant data. For example, they have to ask a data department for a specific report. Data lakes have the potential to make data available to a whole organization. And the strength in that is hard to overlook.
Data Lake… or Rather Data Swamp!
The challenges related to data lakes are very well depicted by data swamps – a highly disorganized data repository that’s pretty much useless. And beware – your data lake can turn into a data swamp without you even realizing it. As a result, your data lake will become inaccessible or will deliver little value.
The reason why data swamps exist is strongly rooted in the data lake’s biggest advantage – the fact that it can store all of your company’s data easily.
You can compare a data lake to a black hole: it will take everything you through at it. However, that’s where the similarities end. Because while black holes can exist more or less forever and consume everything that stumbles upon their way, data lakes are not that flexible. At some point, if you feed them with too much data, they’ll become unmanageable. And since usually the information stored in data lakes comes from all sorts of sources and is most likely unstructured, at some point it will turn into a chaotic, hideous cluster, unable to deliver any real benefits (or, worst-case scenario, even slow down operations that were supposed to be fast in the first place).
Luckily, you can avoid ending up with a data swamp. You only need to follow a few good practices.
First, collect less data. I know it’s tempting to store every bit of information – after all, it may become useful at some point, right? Well, not necessarily. Make sure you store only that kind of data that has even the slightest potential to deliver value.
Next, make sure to use metadata – information that describes other data. Without metatags, it will be nearly impossible to find anything in your data lake. Even if you know exactly what you’re looking for. Without a tagging system, you won’t be able to search for different kinds of data effectively… which will turn your data lake into a data swamp!
What’s more, you should also set up a data cleaning strategy and automated processes to help maintain the data lake. If you do all of the above, you can sleep soundly.
Data Lake Challenges in 2021
Returning to our initial question – how are data lakes perceived in 2021?
Luckily, I have a great way to answer. With the use of data!
Experts predict that the data lake market will grow from $7.9 billion in 2019 to $20.1 billion by 2024 with a shift to cloud-based platforms. What’s more, Business Wire reports that the Data Lakes Market was valued at USD 3.74 billion in 2020 and is expected to reach USD 17.60 billion by 2026, at a CAGR of 29.9% over the forecast period 2021 – 2026. So, clearly, data lakes are nowhere near dead. They’re also viewed as essential for certain kinds of projects or organizations. One of the latter are start-ups, which can use data lakes as an economical option that’s able to support their operations thanks to the analytical options.
However, there’s one important challenge I need to emphasise. To manage data lakes, you’ll need skilled professionals. Otherwise, you may end up with a data swamp. If that happens, you’ll not only fail at getting useful insights but will also pay for something that has no value at all (so, clearly, the economical factor will stop being a benefit).
Embrace Data Lakes – But Also Have a Plan For Them
In 2021, data lakes are useful and popular. However, their bad press didn’t come from nothing. If ill-managed, data lakes can indeed be described as evil.
But if you keep these issues in mind and will define a clear purpose for your data lake and the data it contains, your data lake can become a useful and economical measure to get valuable insights and store information.
Thank you for reading!