Data engineers can have a wide array of responsibilities. Usually, their main job is to make data useful and accessible to other data professionals – like data scientists. To achieve that, data engineers operate a data pipeline, which, in short, produces useful information from raw data in an automated way. Apart from that, data engineers often have additional responsibilities in the fields of big data, MLOps or visualization. Yet, none of these tasks can’t be performed without the proper tooling. That’s why in this article, I’m going to take you on a journey through the core data engineering tools. With these tools, you’ll be well-equipped to complete all the key data engineering tasks.
See also: What is Data Engineering? | PGS Software
Data Engineering Tools – Overview
First off, it’s worth pointing out that the web is filled with hundreds of data engineering tools and technologies (do you know the website Is it Pokémon or Big Data?). As a result, you’ll find numerous extensive articles about “the top 20 data engineering tools” or “five essential data engineering tools for 2021”. However, I’d like to offer something different – by taking a more tailored approach.
I won’t list the most popular tooling. Instead, I’ll take you through the core data engineering tasks and suggest the tools that will support you best in these endeavors.
For the purpose of this guide, I divided data engineering responsibilities into 4 main fields. These fields include operating the data pipeline, handling big data, creating MLOps models, and visualizing your findings.
If you already have a general understanding of what a data engineer does, you probably recognized that these fields rarely come separate. For example, big data can be part of ETL (or ELT, which is the fundament of data pipelines). Similarly, visualization is also a key part of providing data access (to data scientists, BI analysts, or marketing departments), which is also one of the steps in the data pipeline. However, sometimes you’ll have to approach these responsibilities separately. And that’s why in this article I’m also going to discuss them independently.
So, let’s jump into the specifics!
Is Python Enough for Data Engineering?
Let’s start with this – to operate most of the tools we’re going to talk about, you’ll need to use a programming language.
As for 2021, python is something of a market standard in data engineering. Python is widely useful in many areas. It’s also fairly easy to learn. However, a question often arises – is python enough for data engineering?
The answer is: it depends.
For most solutions, python will suffice. It’s unarguably the most popular programming language for data wrangling – the key responsibility of data engineers. Additionally, it’s also good for data science. Yet, it’s not a universal, one-tool-fits-all solution. For example, it’s not compatible with tools like Spark Streaming, which can be used only by Scala or Java. What’s more, to do uber-complicated stuff, it would also be beneficial to know C++ (but it’s not a must). And in IoT, C and Rust are often replacing Python.
TL;DR, python should be enough for most data engineering activities – however, not for all of them.
Now, having that out of the way, it’s time to take a look at the first broad category of data engineering responsibilities – the data pipeline.
1: Data Pipeline
As I’ve mentioned, operating data pipelines is one of the key data engineering responsibilities. In short, a data pipeline is a set of actions that ingest raw data from disparate sources and move the data to a destination for storage and analysis. Through a data pipeline, data engineers can transform chaos into comprehensible information that will become useful for other professionals (to take business decisions, feed ML models, and many, many more).
To start, data engineers need data. They can get it by automating a data source that’s going to deliver information in an automated, repeatable way.
This kicks off the main process within a data pipeline – the ETL workflow.
The letters ETL stand for extraction, transformation, and loading. As soon as the sets of unprocessed information start coming in, the data engineer will need to clean (and possibly enrich) this data to make it useful. After all, raw data without a common denominator won’t be very useful (if at all). This task may include, for example, unifying date formats to enable comparisons. Finally, after this step is ready, the data can be stored, visualized (for instance, for marketing purposes), or used in another way.
The terms ETL and data pipeline are sometimes used interchangeably, however, I wouldn’t treat them as the same thing. A data pipeline can simply be any process that transports data from one system to another. Yet, a data pipeline doesn’t imply any sort of data transformation. From that perspective, the data pipeline is a broader term. Any ETL will be part of a data pipeline, but not every data pipeline needs to be an ETL workflow.
And finally, let’s look at the tools I would recommend for operating ETL workflows.
Apache Airflow has become a widely popular tool for orchestrating and scheduling data pipelines for batch processing. It can also monitor progress of workflows that last several days.
Airflow’s biggest advantage are the build-in integrations, which enable you to run tasks quickly. As a result, you often need to write just one line to get a job done. In comparison, in Argo, which we cover below, you need to create a whole step in dockerised form. Needless to say, this takes way more time.
The tool is open source.
Learn more here: https://airflow.apache.org/
Luigi enables to build complex data pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, and more.
Citing the GitHub description, Luigi addresses all the plumbing typically associated with long-running batch processes. If you’re chaining and automating many tasks, failures will happen. Typically, this issue is related to long-running things like Hadoop jobs, running ML algorithms, dumping data from or to databases – and, really, anything else. Luigi helps solve these issues.
The tool is open source.
Learn more here: https://github.com/spotify/luigi
Argo Workflows is a workflow engine for orchestrating parallel jobs on Kubernetes. It enables to run compute intensive data processing jobs swiftly.
The solution was designed from the ground up for containers without the overhead and limitations of legacy VM and server-based environments. As a result, the solution is technology agnostic – which means that you can run workflows implemented in any language.
And, additionally, Argo’s website has a cool graphic design. (But that’s a sidenote, and it shouldn’t affect your decision-making, I guess).
Argo is open source.
Learn more here: https://argoproj.github.io/
After the ETL workflow is operational and usable data starts coming in, the data engineer needs to make this data available to others. This stage is called the data access layer.
There are several approaches a data engineer can take here, depending on how the given data will be consumed. First off, if the data will be used by external parties, access can be granted via an API. If it’s for internal use, the data can be stored either on Data Warehouses (DWH) or on blob storage. In both cases, security and access control are mandatory things to be considered.
Blob storage is the optimal choice if you don’t need quick access and care about keeping the cost of storage low. For example, this storing choice is useful if you want to feed ML models for them to learn. Blob storage is a form of cold storage.
Data warehouses are quite different. They’re designed to store, filter extract and analyze large collections of data. Uploading data to a DWH implies structure, and enables multiple visualization options, which blob storage doesn’t provide.
As a side note, it’s important to point out that during the ETL phase data is also stored. After all, it has to be somewhere. However, I’m discussing data access separately, since it exists independently of ETL workflows.
AWS S3, Microsoft Azure Blob Storage, Google Cloud Storage
These 3 solutions are examples of popular Cloud-based blob storage solutions.
All of them offer intelligent tiering which enables to lower the cost of storage even more. They’re a great option to archive large data volumes.
These solutions aren’t open source.
See more here: https://aws.amazon.com/s3/
Amazon Redshift is a fully managed Cloud data warehouse solution designed for large scale data set storage and analysis. It also enables large scale database migrations. Redshift’ database connects SQL-based clients and BI tools, making data available to users in real-time.
Overall, Amazon Redshift is a great alternative to on-premise DWH’s.
Amazon Redshift is not open source – it’s available only on the AWS Cloud.
Learn more here: https://aws.amazon.com/redshift/
Snowflake Data Warehouse
Snowflake DWH can be deployed onto the AWS or Azure Cloud infrastructure. It’s easy to operate; moving data into Snowflake with an ETL solution is a no brainer. Snowflake also has another service – SnowPipe – which is a convenient ETL tool.
Apart from that, Snowflake is known for its architecture and data sharing capabilities. It allows storage and compute to scale independently.
Snowflake DWH is open source.
Learn more here: https://www.snowflake.com/workloads/data-warehouse-modernization/
Finally, the last element of the data pipeline is streaming data. However, it’s not technically a next step following after setting up ETL workflows and data access – it’s another form of ETL, for continuously arriving data.
Streaming data is the continuous flow of data. The data can come in from all types of sources, different formats and volumes; and with stream processing technology, data streams can be processed, stored, analyzed, and acted upon instantly, as it’s generated in real time.
That’s why streaming is great… and expensive. If you don’t need to get an instant output out of your data pipeline, streaming isn’t necessary, and batch processing will suffice.
Now, let’s look at a few streaming tools.
Apache Kafka is a distributed event streaming platform for high-performance data pipelines, streaming analytics, data integrations, and mission-critical applications.
Kafka has a wide user-base and great community; according to the platform’s website, 80% of all Fortune 100 companies use it.
Apache Kafka is open source.
Learn more here: https://kafka.apache.org/
Amazon Kinesis enables to collect, process, and analyze real-time, streaming data for timely insights and quick reactions to new information. With it, ingesting, buffering, and processing streaming data is limited to seconds or minutes instead of hours or days.
Amazon Kinesis isn’t open source.
Learn more here: https://aws.amazon.com/kinesis/
2. Data Visualization
After all the data is usable and accessible, data engineers are also responsible for visualizing it, if needed. Visualizations are immensely important. After all, even if the information generated by the data pipeline is exactly what your enterprise needed to take a crucial data-driven decision, this choice will be hard to take if the conclusions coming out of the data is not clear.
There are several approaches to visualization.
First off, it can be done on a simple level, in EDA – even with Python. Python enables you to create sophisticated dashboards, however, with a huge downside – it takes time. As a result, these simple visualizations are often only useful for supporting Machine Learning; since these reports are static, not much can be changed within them.
Next, a more complex visualization option are on-demand reports empowered by custom BI Tools. They are created mainly for Business Analysts. Contrary to EDA reports, they’re dynamic. After they’re set up, even non-technical people can change dashboards and cherry-pick this information, that they’re the most interested in – like in Tableau.
Finally, there are also custom JS apps. They are mainly useful if you’re providing (or selling) your data to external users – for example, in form of open street apps. These solutions are expensive, however, they offer nearly unlimited flexibility.
When it comes to off-the-shelf tools, the are several I would recommend using.
The abovementioned Tableau is a analytical and business intelligence platform. It offers interactive dashboards, quick responsiveness and real-time data analysis features. Tableau has one of the widest user-bases when it comes to visualizations tools. It also offers good support.
Tableau isn’t open source.
Learn more here: https://www.tableau.com/
Microsoft Power BI
Microsoft Power BI enables you to create rich interactive reports with visual analytics. It enables you to develop deep, actionable insights for a broad range of scenarios.
Microsoft Power BI is open source. It supports Python.
We’ve also written a detailed article on how to use Microsoft Power BI. Make sure to read it to learn more about this tool.
3. Big Data
Big data can either be part of the ETL process or a totally independent transformation. When it’s part of the ETL process, it speeds up data transformation – instead of transforming smaller data streams, big data allow you to work on huge loads of data at once.
Generally, the goal of this data engineering responsibility is to manage enormous data streams efficiently. Capable of empowering this are either very strong computing machines or multiple smaller, but less powerful ones. The second approach is more modern and cost-effective – all in the spirit of high speed and easy access to data.
And they are tools that make it possible.
Hadoop Distributed File System (HDFL)
Hadoop is a distributed processing framework that manages data processing and storage for big data applications. HDFS is a key part of the many Hadoop ecosystem technologies discussed below. It provides a reliable means for managing pools of big data and supporting related big data analytics applications.
HDFS is tailored to work with enormous files. A typical HDFS file is measured in gigabytes to terabytes. As a result, the system should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster – supporting tens of millions of files in a single instance.
It’s based on Hadoop, which is open source.
Learn more here: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache Spark is a unified analytics engine for large-scale data processing. Although Spark works a standalone solution, it can also be deployed with other tools (for example, on Kubernetes) to create better performing, high-scaling environments.
Additionally, Spark powers a stack of libraries like SQL or DataFrame, and more. But, most importantly, Spark delivers high performance for both batch and streaming data – all with help of a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
The engine is open source.
Learn more here: https://spark.apache.org/
Hive facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
Although also developed by Apache – just like Spark – and also dedicated to big data operations, Hive is a tool for different purposes. While Spark is a framework for data analytics, Hive is a distributed database that operates on the Hadoop Distributed File System.
The database is open source.
Learn more here: https://hive.apache.org/
Other Apache Tools
These two abovementioned tools are by no means all the big data tools (remember our Pokémon or Big Data tools example?). But to cover the entire Apache toolset would require a separate article. If you need to read more about big data technologies, visit the website of the Apache Foundation.
4. Machine Learning Operations
Many experts view MLOps as an in-between category. Some will say it starts during the ETL workflow, others that it of data storage.
But regardless of this theoretical issue, MLOps has two phases – experiments and production.
The experimental phase requires swift access to data. During experiments, data scientists validate hypotheses on sample data; as soon as they find something, the production phase begins. Production requires efficient data access and not slowing down the transactional data infrastructure.
The experiment phase focuses on swift experimentation and iteration over ideas. Data scientists should get flexible working environments and proper data – and data engineers make it possible. After every breakthrough in the experiment phase, the production phase needs to handle all the heavy lifting. This includes automated model training and hyperparameter tuning, model deployment and monitoring on production (and day2 operations).
The following tools can help in these phases.
Kubeflow, a project initiated by Google, enables to manage a set of open-source tools for MLOps and deploy them on Kubernetes.
Kubeflow is dedicated to make and manage ML workflows in a scalable way. The tool aims not only to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
The platform enables to work on Machine Learning operations in both the production and experimental phase.
Kubeflow is open source.
See more here: https://www.kubeflow.org/
Like Kubeflow, MLflow is a platform for managing end-to-end Machine Learning lifecycles. In its concept, MLflow is organized into four components: tracking, projects, models, and model registry. Each of these components can be used on its own; however, they’re designed to work well together.
Citing its official website, the platform is designed to work with any machine learning library, determine most things about your code by convention, and require minimal changes to integrate into an existing codebase. At the same time, MLflow’s goal is to take any codebase written in its format and make it reusable and reproducible by multiple data scientists.
MLflow, just like Kubeflow, offers a simplified way of deploying your ML pipeline with experiment tracking and production workflow.
ML Flow is open source.
See more here: https://mlflow.org/
Sacred is a tool to configure, organize, log and reproduce computational experiments. It enables to get only minimal overhead while encouraging experiments to be modular and configurable.
Most of all, Sacred helps to keep track of all the parameters of your experiment, run experiments for different settings easily, save configurations for individual runs in files or a database, and reproduce the achieved results.
Sacred is a good tool for experiment tracking; however, it offers fewer options than MLflow, which doesn’t only focus on tracking the experiment phase.
Sacred is open source.
See more here: https://sacred.readthedocs.io/en/stable/
Every Cloud solution comes with its own ML Ops tool. The AWS Cloud offers the Amazon SageMaker. With it, data engineers can prepare, build, train, and deploy high-quality ML models on the AWS Cloud quickly thanks to a broad set of ML-tailored capabilities.
SageMaker is a good choice for running hyperparameter sweeps and general run orchestration.
Amazon SageMaker isn’t open source.
Learn more here: https://aws.amazon.com/sagemaker/
Top Data Engineering Tools – Summary
Working with the right tooling is essential to achieving success in data engineering. And the first major challenge in that area is identifying the tools that will be optimal for your needs.
We’ve listed many tools and technologies in this article, but remember – this was only an overview. Although the list mainly contains the most popular solutions, there are many, many others that can help you get the results you’re looking for.
See also: Data Exploration Workshop | PGS Software