Witness how the hero of our story – The Corporation – joined the Data Revolution.
In 2015, the High-Level Panel, appointed by Ban Ki-moon, Secretary General of the United Nations, expressed the need for something called a “data revolution” in the millennium development goals (MDG’s).
This data revolution was supposed to empower making evidence-based decisions. As for 2021, it’s already embodied in many organizations. Currently, even medium or small enterprises can enjoy the benefits of data-driven decision making (DDDM). So much so that the phrase “data is the new oil” has already become a trope.
However, the road to becoming a modern, data-oriented enterprise is bumpy. Many challenges await on the way. Often, companies don’t even know where to start the journey.
So, to help you navigate your way through this path, today I’m going to tell you a story. It’s going to be different from what you might expect from a blog post. If you know the works of Gene Kim (and I’m sure you do!), the construction of this tale won’t surprise you. But don’t worry – I didn’t write a novel!
Without further ado, let’s dive into the mythical land of excels and big data.
4 Steps to Becoming Data-Driven
The main hero of our story is a Corporation. It’s operating in retail and keeps rapidly growing.
For some time now, managing the company becomes increasingly harder. When it was established 20 years ago, it only started with one single store. However, now the Corporation is present in multiple locations around the globe, including franchise stores and some recently acquired brands.
As a result, daily operations generate terabytes of data. The cost of storing this data grows, yet usefulness shrinks. Reports are filled with old data and are missing a general overview. The Corporation needs to do something about that quickly – otherwise, these problems will only keep growing and will likely get out of control soon.
From a technological point of view, the Corporation’s systems are everything but homogenous. After many years of growth and acquisitions, their IT ecosystem is a conglomerate of various parts from over 10 companies. This poses a big threat to data quality and ownership. The data is not centralized – granting access to new employees requires adding them to 5 different authentication systems. And with a multi-regional support of operations, all the challenges make utilizing this data tricky.
Up until now, daily sales reports were generated manually by a team located in Australia (with the HQ in the UK). However, since sales results are growing swiftly, the spreadsheet’s size is getting gigantic; thus, loading times are rising accordingly. And although creating these reports consumes a lot of time, doing them manually naturally operations causes a lot of errors. In the past, some data discrepancies even led to wrong business decisions. More mistakes will likely arise since the Australian team exports the data without a single point of overview. They don’t even get a confirmation if the whole export was successful.
So, in this tricky case, what should the Corporation do to see its terabytes of data as an advantage – and not a liability?
Step 1 – Collecting and Visualizing Data
Let’s start with something fairly straightforward. It would be beneficial to reorganize the Corporation’s data into a comprehensible form. It would be wise to set a reasonable but ambitious goal. In our case, the Corporation could, for example, decide to get automated daily reports with a summary of all the sales from the previous day.
GOAL: Daily reports of sales from the previous day – generated in 4 hours or less.
There are a few challenges that the Corporation needs to overcome. The first major problem is the distribution of data – it’s spread all over the world. Moving all the data to a single location is out of the question. The cost and time needed to transfer the data would be too high. As a result, it would be beneficial to create a place that could store and preprocess the data by market, ideally, if the data could be processed as close to the storage location as possible. The per market report should be created where the data is stored; this would eliminate data movement (which costs time and money).
To achieve this, the Corporation introduced an ETL process that automated the precomputation of sales data. Data from multiple sources are now reconciled between themselves. There are also data quality checks added to the process, which minimized the number of data errors.
Now, the results are stored in a DWH (data warehouse) per region; the UK merges the results. Thanks to automation rollouts for the rest of the world, new reports can be created for a single region. A BI tool was used to deliver reports swiftly with support to drill down. The BI tool is presenting the data results per market and enables both a detailed and general overview. There is an additional report dedicated for top management, merging results from different regions; this was a big challenge for the Data Analytics team – yet it was worth the additional effort.
Now, after only 4 hours, automated reports with relevant data are available for Board Members. Data from sales and clients doesn’t leave the region it was generated in… and regulators are happy about it!
The new data visualization helps with decision making; automated reports can be exported on demand. The BI tool provides a layer of security and user access control so that no unauthenticated people can access the data.
A great start, isn’t it?
Step 2 – Streaming Data Processing
The Corporation can now celebrate its first data success. With the collection and visualization solution, the company can steer its strategy with data-driven techniques. The selected metrics are broadly visible, and the analytics tools help with taking (and explaining) business decisions.
And since everything is working great, it’s time for the next step.
The Corporation has now the idea to incorporate new data sources. The goal is to get an even better overview of the global operations. This time, however, unlike with the automatic reports, a 4-hour delay won’t cut it.
One of the requested data reports are notifications that a given product is missing in a store. Speed is a crucial factor; with a quicker response, the goods can be delivered faster, and more items can be sold. After all, to restock a store, the shipper needs to schedule a new transport, deliver goods from another store and contact a salesperson with further instructions.
Goal: Creating a way to notify the personnel when a critical alarm is invoked (and allowing them to mark the alarm as resolved).
First off, every POS (Point of Sales) creates a stream of events about purchases in the shops. Thanks to the First Step, the information about items in the store warehouse is available in Data Lake. Based on this information, if a product is no longer available in a shop, an alarm can be triggered. Every information coming as a stream is also put into Data Lake for future analytics purposes; every single action is tracked. Data-Driven Companies measure their processes and use this knowledge for improvement.
Now, a new shipment dashboard is available, introducing quick notifications about alerts. Additionally, the DWH and Data Lake provide a single overview with all information required to solve the alarms.
So, great news – the Corporation has now gained its first experience with Streaming technologies! The data produced live in one system is used nearly in real-time for analytics purposes. With it, numerous new possibilities are now available to explore; if processing data in batches or with delay is not an option – streaming is the answer!
Step 3 – Advanced Analytics/Big Data Processing
The Corporation is getting more and more data-driven. The previous successes convinced Management that data in their company is precious and should be used frequently. Moreover, since employees experience the direct benefits in their daily work, they also started putting more care into collecting data.
Some time has now passed since the beginning of the data transformation, and information from many new data sources found its way to the Data Lake. The growth of the Corporation and its business is accompanied by extensive data collection.
After talking to the managers, the Data teams selected 30 of the most needed and time-consuming operations in creating reports for the company. They’ve tried to implement a few of them within the current setup; the time needed for this expensive analysis made it obsolete before it finished. Because of this, some extremely valuable reports were assumed impossible to get.
Goal: Enabling heavy reports daily or on-demand. Reports should be swiftly and timely delivered within 4 hours of the request. Outputs from these reports should be available via a Visualization Tool (BI).
To overcome this challenge, the Corporation has introduced an MPP platform. Apache Spark was selected for this. When a report is created, a huge cluster spins up and starts processing a PB-scale of data. The cloud provider supports an on-demand creation of Clusters and shutting them down after performing operations. Cost-reduction and automation will be a big part of this plan.
After the results are saved, the cluster is turned down. There is a second smaller cluster used only for smaller on-demand queries (which are now taking minutes instead of hours). Stored reports are visualized with the company’s BI tool.
Finally, all heavy reports were moved to Spark. Reports that were impossible to create in the past are now timely delivered.
Additionally, a graphical interface called “Jupyter Notebook” with Spark engine support was provided to BI and Data Engineering to develop new ideas around the data collected in Data Lake. New ideas are coming to life since SME (Subject Matter Experts) and Data Experts can now quickly iterate and prototype solutions.
Step 4 – Enrich Data and Operate ML Models on Production
The Corporation is getting there! Its nearly ready to become a mature, data-driven enterprise.
Advanced analytics was the first step to handle data at scale. The natural next point is Machine Learning – training models and operating them on production like any other part of the software.
Since Jupyter Notebooks is deployed as part of the Big Data initiative, hired Data Scientist can now leverage the data from Data Lake, the computation power from the Big Data platform and create new machine learning models to solve business problems. Data Scientists performed experiments to predict sales volumes to minimize the number of times shelves are empty. This increases sales significantly.
Still, challenges arise – the code in “Jupyter notebook” is not generating any value. Until it is deployed to production and used daily, no value was created. The Data Science team struggles with managing, hosting, and operating their models in the production environment.
Goal: Moving the Jupyter Notebook code from experiments to production – and kick-starting its use.
To achieve the above goal, the Corporation introduced a new approach to all Data Science projects. Thanks to MLOps tools and splitting projects into the experiment and production phases with different tools, the company gains the ability to solve business problems not only in notebooks but also running them in production. All the code provided by the Data Science team was moved to the code repository. Notebooks were enhanced by introducing a mechanism of experiments, which allowed the tracking of all execution results.
Now the most valuable models will never get lost, and experiments will be reproducible. If needed, Data Scientists can gain access to additional computation resources like GPUs for Deep Learning. These resources are spin up on-demand only for a needed period of time. Distributed training reduced the time to train models from multiple hours to minutes and improved its quality by fine-tuning a larger hyper-parameters searchspace.
Data Scientists can promote experiments to production using an MLOps pipeline with a human in the loop. This means that each model created by the Data Science team can be promoted to production after passing a threshold for quality and compliance checks. Additional human validation of the model before deployment to production is a mechanism that builds up trust within the company.
Models are hosted in a Kubernetes cluster as endpoints available by consuming applications – secured and monitored. An additional introduced function were A/B tests and canary release, which allow Data Scientists to test their models on production before user roll-out. Imagine that you can perform experiments on a small subset of user requests with new models, then based on this early feedback, you decide if a model should be promoted from canary release to full release. A/B tests play a big role in deploying ML models to production with great certainty of the result. The hosted model is treated like any other application and monitor accordingly. Additionally, drift detection was introduced to retrain the model when its performance is downgraded; an alert to the Data Scientist is also sent with proper information. Drift occurs when an ML model starts to perform poorly – the reasons for that are numerous and require further analysis. One of the most common ones is a performance downgrade in time; luckily, in most cases, this can be solved with model retraining.
As a result of these actions, experiments can be now executed and tracked freely. The model was prepared and hosted on production. New insights are generated daily. MLOps was introduced to speed up the creation of new models and swiftly deploy them to production. Monitoring and drift detection together with automated retraining is building the stakeholders’ confidence in the ML-based solution. From now on, Data Scientist can use their well-known Notebooks to experiment, develop, train, and deploy models to production within minutes. More and more business problems will be solved this way!
You should know that not every step is always a necessity. Depending on your business and operating model, one or more of them may turn out to be redundant.
Another important thing to keep in mind is that a journey to data-driven decision making is like climbing a mountain. Along the way, every step becomes more difficult. However, this increased difficulty comes with greater reward; every one of these steps delivers a higher return on investments and, in the long term, lowers the operation costs progressively.
The Corporation Has Matured Greatly – Now It’s Time for Your Company!
Finally, the efforts of our Corporation have come to an end. From an obsolete organization with manual (error-filled) reporting, it has grown into a respectable enterprise that knows how to leverage data and modern technology to make the best decisions, boost revenue, and identify new opportunities.
If your organization is more like the Corporation at the beginning of its data journey than in its new form, you should also consider starting your journey towards data-driven decision making. Sure, this road is not easy; it should be taken step by step, and each step needs to be followed with validation and deliverable artefacts. But as our Corporation proves – it’s worth it.
Picking the right tools is crucial for success. Such tooling will solve issues; incorrect tooling will create high costs without delivering expected benefits.
Partnering up with a technology partner can also make this road much easier. If you’d like to hear more, book a consultation with me (free of charge) – I’ll gladly answer all your questions!