The beginning of the last decade saw the rise of Data Science. Multiple companies – including some leading brands – started employing Data Scientists. Businesses spent millions on developing this field at their organizations… and often contributed towards the biggest ML Ops fails in history.
First of all, back then, the relevant business tooling was very basic (or nonexistent), since Data Science only just started emerging from the academic world. Funnily enough, even industry leaders didn’t fully understand how to make the most of their new information experts, getting nearly nothing out of the gigantic investments.
Luckily, the current data science landscape is completely different. Thanks to managed services and libraries, setting up an effective working environment for your Data Scientists is not an impossible challenge – quite on the contrary. So much so, that it can be explained in one (not even that long) blog post!
So, in this entry, I’m going to show you how with a fairly easy implementation of tools (that are usually used separately), you can make certain that your Data Scientists will quickly deliver value.
But First: Why Are the Right Tools so Important for Data Scientists?
If you’re not completely new to Data Science, you probably know that this field has specific challenges. For example, in software engineering, calling the same function three times delivers the same results. In Data Science, without the right environment, this is not the case.
In short, without the right tools, projects can get lost, and experiments are unreliable.
With the right tooling, this won’t be an issue; a proficient working environment should guarantee that your Data Science department will be able to work:
- more effectively,
- on bigger data sets (multi-terabyte sets),
- with freedom of experimentation and confidence of replicating achieved results.
And to be more precise, with the work environment from this article, your Data Scientists get:
- a processing platform to handle high data volumes
- enough computing power to work with as least waste as possible
- code repository to store and share code
- a way to keep track of models, parameters, results
- artefacts of old experiments
- experiment results and comparison
- effortless results visualization
Summarizing, you will never have your Data Scientists exchange their notebook codes via email again!
Embracing the Five Ideals
Before we jump into the specifics, I would also like to introduce you to an interesting concept that will prove useful from the perspective of our project. In his latest book – “The Unicorn Project” –Gene Kim introduced The Five Ideals of Great Software Development work. These Ideals are:
- 1st Ideal: Locality and Simplicity
- 2nd Ideal: Focus, Flow and Joy
- 3rd Ideal: Improvement of Daily Work
- 4th Ideal: Psychological Safety
- 5th Ideal: Customer Focus
Shortly speaking, they are underpinning what is required to create better value – sooner, safer, and happier. And since Data Scientists are also relying on code, they deserve no less than Software Developers, in my opinion. That’s why I’m going to put a special emphasis on embodying these ideals in the working environment I’m going to show you.
Let’s Start: The Kick-Off
So, let’s start. But wait… didn’t we forget something? Yes! We need a data set and data science problem to be working on. For this project, I’ve chosen the Predict Blood Donations warm up from DrivenData. It’s a small and uncomplex dataset which will be perfect for our prototype (and maybe will even convince you to donate blood – remember, you can save lives).
As a platform of choice, we’ll use Microsoft Azure. It offers a Managed Service called Azure Databricks, which is a Cloud-based, managed, scalable Spark cluster with nearly unlimited capabilities in processing power and storage – and all that for a decent price. The Spark cluster works with multiple programming languages: Python, R, Scala, Java, or SQL.
Most importantly, Azure Databricks will provide us with a few key capabilities:
- Data storage for raw data and partial results in columnar format,
- Spark Cluster with enough capacity,
- Zeppelin notebooks to run code.
Now, start by creating a cluster with enough capacity – but don’t waste too much time on configuration; you can change everything later on (except “Cluster Mode”). That way, we’re also embracing the idea of the previously mentioned 1st Ideal.
After hitting “Create Cluster” (and waiting a few minutes), you get a Spart cluster in Running state. We’ll use this cluster at all stages of this project. Luckily, it’ll scale up or down, adjusting to your needs.
To have data to work on, now’s the time to upload it and create a table. A huge part of this process is adding data types to table columns. Luckily, Databricks can help us out and do this boring task for us – simply allow it to “Infer schema” (and remember to double-check everything afterwards; although Databricks is pretty good at guessing the correct categories, let’s not leave too much to luck). Ultimately, it’s a great example of 2nd Ideal at work; we can focus on things that generate value and leave the repetitive, cumbersome tasks to computers.
The First Experiment
The cluster is ready, and the data is in place. It’s time to create a baseline for the project.
A simple model in PySpark will suffice. Load the data, create a pipeline, train the model, and then evaluate it… Ready, steady, go!
So, we did a model, now’s the time to check the results. All data from this execution is already part of our run (which is an execution of the notebook and is stored in form of an experiment log). So, just go to “Runs” in the right top section and explore cross-validation with grid search for parameters.
Each Run consists of the aggregate row which has “Source” filled in with a notebook (of the exact version that was executed). If the run was successful, the aggregate row is marked with a green tick. If not (errors), it’s red.
Each of the cross-validation executions has the parameters maxDepth and numTrees (Random Forest Classifier parameters) and a result of F1 metrics. One experiment can consist of thousands of runs, which can be filtered with SQL-like expressions. Each column is sortable for you to be able to extract only the information you need. Thanks to these two features, you can easily focus on the relevant “Runs”.
If you need to run a different experiment, you can use different models within the same pipeline; thanks to identical metrics, all models can be compared in the same environment.
Moreover, a really handy feature is comparing “Runs” within one experiment. This simplifies comparing parameters and achieved results. Each model can be cross-checked with other ones in an easily understandable way.
Finally, found on the bottom of the report, Scatter Plot is a good place to analyze the correlation between parameters and metrics. The Number of Threes in RandomForestClassifier impacts the performance of the model; with the X-axis as numTrees number and Y-axis as F1 metric, there’s a clear correlation between them. And what’s very handy, these graphs are exportable into PNG formats, allowing to zoom in and rescale, which is perfect for a detailed look. The bar plot also allows reviewing the progress of metrics while comparing “Runs”.
Now’s the time to embody the 3rd Ideal – Improving Daily Work.
In most projects that involve coding, especially when more than one person is involved, there’s a need for code sharing and versioning. To have such an option in your new environment, add a Git integration in User Settings and select the Git provider.
Additionally, you can add notebooks to version control with “Revision history” and connect the notebook in Azure Databricks with the notebook stored in Git.
Now, our code is safe in the code repository! With its help, the collaboration between your Data Scientists will be easier, and code reviews possible; any changes to the code can now be reverted without losing valuable experiment results or the code itself. With this step, we’ve also achieved the 4th Ideal – not fearing experimentation, embracing new lessons, and, most importantly… having an environment to learn from your misstates safely!
Finally, with this done, we only have one Ideal left – the 5th – which is all about aiming at better Customer Value. To achieve it, you should introduce more custom metrics, like, for example, the false-positive rate. In short, default capabilities of MLflow can be enough for establishing the baseline, but you need more to be able to improve quickly and iteratively. To get there, you can add metrics with “mlflow.log_param(‘custom_param’, <value>)”; [e.g., adding the mentioned false- positive would be: “fprmlflow.log_metric(‘fp_rate’.<value>)”].
What’s very handy, the metrics allow to add key performance indicators (KPI) – and this enables you to evaluate all your decisions based on what’s most important to your customers, bringing you close to the abovementioned focus on Customer Value.
One last hint: remember, that the parameter length is capped at 500 characters. If you need to store longer values, try using artefacts and storing them as files (i.e. classification reports from sklearn).
We all have our preferences, and Data Scientists are no exception. Most of them are set on using one specific IDE; without it, they can lose efficiency and the ability to prototype rapidly. Introducing a specific IDE will be a further improvement within the 2nd Ideal.
Luckily, all the code in GIT and the repository can be cloned to any device, so the groundwork has already been done. Databricks allows using the Spark cluster by connecting your local machine to a remote cluster; as a result, you execute the code written locally on Azure Databricks the cluster remotely. The name of the integration component is databricks-connect (and the documentation is available here: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect).
As a side note, remember never to keep plain-text secrets in the code – instead, environment variables are a decent place for them (like the Databricks API token).
Next, you also need to change the configuration of the Databricks Cluster; be prepared for a restart of the server after adding the Spark configuration to it.
With a correctly configured integration (databricks-connect test will be handy!) it’s enough to add a single line of code to celebrate success. Now, the notebook previously working only on Azure Databricks will run from IDE, using the same Spark Cluster.
The Third Time’s a Charm – Improve Once More
In truth, the above statement is not 100% correct – because you should never stop improving. But from the perspective of our blog entry, this is the last upgrade we’ll be adding.
Now, available to us is the possibility to add *.py files to Spark job executions (yes, using python modules is possible with Databricks)! To add new import dependency to Spark, importing only won’t be enough – each of the files needs to be distributed to all nodes doing an operation. Add the needed modules to Spark Context “sc.addFile(“module.py”)” and “sys.path.insert(0,SparkFiles.getRootDirectory())”. Now, import will be working again.
What’s more, the code split into small and meaningful parts contributes to a codebase that is easy to extent and manage. A single, super long notebook is hard to work with and forces Data Scientists to waste a lot of time on searching the right code to inspect. Data can be stored efficiently in Data Tables between notebooks. Unique table names will provide immutability of the data and allow to go back to old experiments – invoking old experiments with new data.
And That’s Your Data Scientist Environment Done!
Leveraging all the above tools will enable your Data Scientists to rapidly prototype and deliver value to customers. It’s a good starting point for a Data Science project and can be adjusted further to individual needs. Comfortably, the most cumbersome tasks of creating this environment are already half-done by the managed service or libraries.
Now, your Data Scientists can start becoming more productive (and happier) while delivering customer value!