Throughout the most recent years, it seems we are always increasingly coming across the term “big data”. Indeed, large companies already have special positions directly related to the processing and analysing of such data sets – but what about those that haven’t already firmly established themselves?
As a young software developer, as well as someone preparing for their AWS Certifications, I quickly became interested in the potential of Big Data and identified it as something I would like to include in my career path. Consequently, it wasn’t enough to simply learn and understand the theoretical issues – I needed to be able to put it into practice. Fortunately, PGS Software puts a lot of effort into ensuring their developers are fully prepared, even before we pass our AWS Certifications. As this is something I wanted to work with a lot in the future, I knew hands-on experience with internal projects would be a big milestone in my personal development.
The requirements for this task were quite clear: create a simple and efficient tool to analyse large amounts of data, complete with visual representation, for minimal financial cost.
Furthermore, this test was about big data after all, so we needed to know what volumes we were dealing with. For this project, we decided to use large amounts of data, of up to 100GB, in a specific data type and format (which I’ll discuss shortly).
At the start, we decided we would like to test a solution with about 100 GB of data. While personal definitions of volumes may vary, I’ve always understood ‘big data’ to refer to volumes greater than two-digit GB values. In other words, 100 GB is an ideal place to start testing the performance of the tools used in this project. It’s also worth mentioning that the larger the volume of data is, the more conclusions can be drawn (1 of the premier advantages of utilising big data).
Before we could start, however, we first of all had to choose the type of data we would be using. In this case, as in many situations, the amount of data helps determine the type. For instance, there are large collections of publicly available datasets in numerous areas, such as patent contracts, medical statistics, blockchains and – in our case – weather information. We chose to focus on this, because it leans favourably towards graphical representations.
We utilised data provided by NOAA (National Oceanic and Atmospheric Administration) – a collection of measurements that, among others, include temperature values from various locations around the world, gathered from 1763 to the present year (2018). This is all collated in a folder with specific files, each of which covers a year’s worth of measurements.
Each of these files is saved in a CSV format, containing information on where the measurements were made, the exact date they were recorded, a description of the measured elements and its value, as well as additional data regarding the measurements’ source, overall quality and more. To surmise, the data is very well prepared, clear to read and already sorted, with individual columns separated by commas.
As an example, here’s how the data structure appears in a single file:
Here are the descriptions for each individual column heading:
- id – An 11 character station identification code
- date – this is an 8 character date format in the YEAR/MONTH/DAY format, or YYYY-MM-DD. For example, 1986-05-29 would refer to the 29th of May, 1986
- element – a 4 character indicator of the element type (e.g, TMAX is the maximal temperature)
- value – 5 character data value for the element. In this case, the temperature in Celsius, in multiples of 10.
- mflag – 1 character measurement flag
- qflag – 1 character quality flag
- sflag – 1 character source flag
- time – A 4-character time of observation, written in the hour-minute format (i.e. 0700 =7:00 am)
According to our own requirements, as well as the size of the data set, I needed a method of storing the data. Local storage is problematic because the memory requirements, as well as the loading of such data volumes, impacts the speed of many processes. It is also not a Serverless solution.
When comparing the costs of Cloud providers, especially against the costs of storing data on physical, external devices, storing information in the Cloud appeared to the cheapest solution.
Because of this, I decided on using 1 of Amazon’s solutions: the S3 bucket. This service can be easily rewritten, is able to retrieve large volumes of data from various sources, has a high durability rate (99.999999999%), and provides comprehensive security and compliance capabilities. If that wasn’t enough, it’s also designed to be available 99.9% of the time and, most importantly for us, it’s completely Serverless.
Additionally, S3 offers additional flexibility via data management and access control. It also provides query-in-place functionality, allowing us to perform analyses directly on the data stored in the bucket.
Our bucket is divided into 2 folders; while 1 contains compressed data, the other contains uncompressed information. This was necessary for carrying out future tests.
Here, you can see the folder structure I set up for my S3 bucket:
The next challenge involved measuring the data dispersed, as well as the lack of schemes needed to create a table. Carrying out more advanced analyses on raw data is always difficult, which is why I needed a means to better comprehend it: AWS Glue was exceptionally helpful for this.
Glue is an ETL (extract, transform and load) service that enables the efficient preparation, loading and transferring of data between various forms of storage. It contains the Glue Data Catalog, which creates and stores the schemas and metadata for databases. It’s also a Serverless tool, and therefore has no infrastructure to configure and manage.
Furthermore, it’s worth mentioning that one of Glue’s additional options is to create a scheduler. We can use this to set up crawler firing on a cyclical basis, if the data we use is updated from time to time. For this project, we decided to stay with an on-demand option, because we were creating a proof of concept, rather than a permanent, ongoing tool. Of course, such an option could also be easily implemented with a scheduler.
Glue creates a conventional database and table for us, based on the files that we store in S3. This is an entirely automated process, managed by AWS, and the user’s role is therefore reduced to just a few clicks. If we have good quality CSV files, the table we get as a result of running the crawler will contain named columns, complete with specific data types in their respective columns.
Below, you can see the steps required to create a crawler in Glue:
The main goal for our project was to find a tool that can analyse large data sets in an efficient, fast and low budget approach. It also needed to be completely Serverless, as I wanted the entire infrastructure to be managed by AWS, not by us.
After researching the available options, Amazon Athena turned out to be a great tool for cooperating with our needs.
AWS Athena is a service used to write interactive queries and analyse data stored directly in S3. Queries can be written in the Athena console, generating results immediately. It is also a cost-effective solution, because fees are only charged when queries are run. Like the previous AWS tools used here, Athena is Serverless. It is also compatible with Glue, so integrating these services was very simple to achieve.
In the Athena dashboard, we must configure the database and table, which automatically appears after starting the crawler we earlier configured in Glue. To this end, the name of the database and table will be the same as those specified in Glue itself. In fact, our entire work process ends after this configuration, as Athena draws everything it needs from Glue. It is also another service that provides us with a simple and clear interface.
Generally the connection between Glue and Athena is very simple and does not require too much effort from me.
After selecting the appropriate table, we can start to write queries. Fortunately, Athena uses the SQL syntax, which gives me a huge advantage, as it is something I find very easy to work with.
Furthermore, we can also save the queries that we create and Athena even has a query history, which maintains results from the last 45 days. This history feature shows both successful and unsuccessful queries, in addition to allowing us to download the results or see the ID of each query for future us. It’s also possible to save the results of these queries in both txt or csv files.
The image below shows the Athena dashboard and query editor:
When it comes to pure data analysis and viewing results in new tables, a user might not find this as easy to read or interpret as data visualisation. This is vital when presenting raw data to someone who lacks the knowledge to interpret it. In other words, graphical data representation is essential. It also allows users to draw more interesting conclusions.
For our project, Amazon again managed to surprise us with the breadth and availability of its various tools. In this case, I decided to use QuickSight. This is a business intelligence tool that allows for the easy creation of visualisations, alongside simple data analysis. It also allows us to access data from multiple sources.
As with all other AWS services, integrating QuickSight with Athena is very simple and easy to do. The entire process takes just 3 steps:
- Start a new analysis
- Choose a data source, which can be from our S3 bucket, an existing dataset saed in QuickSight, or even a few directly from Athena.
- After the data source has been selected, we can create a graphical visualization in the form of various charts.
For an example of the final, visualised data, please see below:
In this graph, we can see changes in the minimum and maximum temperatures during the day, alongside average values, depending on the date. This particular visualisation shows the scope of data for 7 months in 2018 – this is a small section of data obtained as a result of a given query.
When it came to the architecture, I decided to go Serverless for several reasons. First of all,I wanted to focus on core functions, spending less time maintaining and troubleshooting, although this is unavoidable for most applications. Still, by taking the hardware necessities out of the equation, I was able to keep this to an absolute minimum.
Additionally, I was concerned about the efficiency and availability of the services I used. As mentioned near the start of this post, Serverless solutions are highly available and durable.
In the diagram below, you can see the current project architecture:
Like any project, thorough testing is essential. Fortunately, because all of the services used in this project are complete managed by AWS, no advanced tests are required.
Instead, I was able to focus on performance testing both Glue and Athena.
Testing AWS Glue
For the sake of testing, I decided to manipulate the data a little and see how each service manages when the data is formatted, as well as when I use less data. As mentioned earlier, all data is stored in the S3 bucket in 2 separate folders. One folder contains uncompressed data, with compressed data stored in the other.
With this in mind, I performed tests for 2 scenarios:
- Testing the schema creation rate for uncompressed data
- Testing the schema creation rate for compressed data
Despite initial appearances, the fastest way to create a scheme was for the largest amount (the first scenario) of data, which took about 30 seconds. For the second scenario, schema was created in 33 seconds.
Below, I’ve included the table we received after running the crawler on the compressed data set:
I also checked how Glue will behave when it tries to create a crawler with only 1 file at its disposal. In this case, schema creation took about 20 seconds, which was the fastest result. However, for a single file, column names and appropriate data types were not as accurately defined when compared to instances that use a larger amount of data.
As we can see below, the columns have not been named and the types of columns are different than the above image. There’s also a different amount of columns present.
Testing AWS Athena
For the Athena service, I wanted to test the efficiency of its analysis. I prepared scenarios for 2 cases and, for each case, I will write 3 types of queries, checking the run time of each.
The 2 scenarios are:
- Testing compressed data set,
- Testing uncompressed data set
For each of these scenarios, I performed 3 types of queries:
- The first case: I selected 10 records for a given dataset
- The second case: I selected 1,000 records for a given dataset
- The third case: I selected the minimum and maximum temperature averages for Italy in the years 1763, 1891 and 2018. This represented approximately 30,000,000 records.
Here, I’ve presented the run time for each query (in seconds):
|Scenario||First Case||Second Case||Third Case|
As we can see, query performance is better for uncompressed data. It’s probably caused by the high compression ratio of the respective algorithm, as more CPU is required to compress and decompress data as the ratio increases.
Of course, there is always more than one way to achieve any goal. As part of this project, I investigated similar solutions, so I could better compare Athena’s suitability.
Specifically, it’s worth discussing an alternative solution proposed by IBM. They propose using Apache Spark and iPython Notebook to create a weather analyser. This solution is based on the same weather data I have used, but it is not as simple as our Athena project.
It consists of 5 steps:
- First of all, you need to create a Spark instance.
- Next, you should create an iPython notebook on this instance.
- After this, the data set needs to be uploaded to the notebook.
- Following on, an RDD (resilient distributed dataset) needs to be created.
- Finally, you need to parse and analyse the data.
Furthermore, unlike our Athena project, IBM’s solution does not have a solution for graphical data processing. Furthermore, Spark does not use the SQL syntax, but Python does use Lambda functions for querying.
Besides all of this, IBM’s answer is based on clusters, which need to be maintained and often require a lot of financial effort. My solution uses SQL syntax, so we don’t need to write any Python code, making it easier to reuse. Finally, our option is also completely Serverless, greatly reducing the costs of the project. These are even lower than often expected when you consider the fact that we only pay for the amount of data being scanned.
Athena vs IBM: Costs
For each of the suppliers – AWS and IBM – and their respective solutions, we can use a so-called free trial, which allows us to create practically cost-less applications. However, as it typically occurs, some features or services are not immediately available in free trials.
The solution I created with AWS is slightly different from IBM’s solution, because the latter does not have a dedicated data processing service. What has been achieved through Athena in my project is instead carried out via the Apache Spark instance and iPython notebook for IBM.
As such, it’s hard to directly compare the costs or carry out an accurate estimate. To best determine the costs between these Cloud providers can be, I carried out a calculation for both platforms, assuming the use of computational instances and storage, as well as the costs of an on-premises solution.
|Server||$4.50 per month||$1.40 per month||$9.10 per month|
|Storage||$4.40 per GB/month||$0.023 GB/ month||$0.03 per GB/month|
|Analytics Engine||Unknown||Athena, $5/TB scanned query||$0.70 per default compute instance hour|
After completing this project, as well as the above comparison, some clear conclusions can be drawn.
In order to increase the efficiency of queries – and even further reduce costs in Athena – we can use data partitioning. This is the process of restricting the amount of data scanned by each query. Amazon Athena leverages Hive for this purpose and it is possible to partition data by any key. This helps with query performance in Athena, because it helps to run targeted queries on specific partitions, which can be done to increase performance when necessary.
Furthermore, to check Athena’s performance, I also compared the solution to one that just uses the S3 bucket for data storage and the QuickSight service to visualise results. From this, I determined that the latter is better at visualising smaller amounts of data; it is faster and data is represented on the axes. When you have such smaller volumes, using just the S3 bucket and QuickSight is enough. However, when you want to scale this into larger datasets, you also need to use Athena.
To summarise, all my initial assumptions were well met. Throughout this project, I managed to:
Create a simple data analysis solution with low financial costs.
Create a solution that’s completely Serverless, removing concerns regarding maintenance and scalability.
As for AWS services, I can certainly say this was a great experience. I found starting many of these services, integrating them and configuring them to be fully automated so, as a developer, I needed to only click a few times and receive a ready, versatile solution.
Practical Uses and Applications
As can be seen, AWS services allow us to quickly and effectively analysis the data we need – but the implications stretch far beyond weather reports. I would like to take a few moments to discuss some alternative uses for this form of data manipulation, as well as how to expand it to meet some more additional business needs.
Nearly every business these days generates data on a daily basis and most – if not all – stand to gain better knowledge if they harness this appropriately. A cost-effective, fast solution like ours readily enables deeper insights, regardless of scope or scale.
In fact, Amazon Athena itself is already used in variety of industries and areas:
- Atlassian uses Athena, along with other AWS Analytics, as part of its custom-built self-service data lake.
- OLX, a popular online marketplace, utilises Athena throughout their organisation to improve their time to market, reduce costs and optimise their service.
- Moveable Ink, a marketing platform, uses Athena to query 7 year’s worth of historical data, getting results in mere moments, all with the flexibility to further explore this data and generate deeper insights.
While a fast, Serverless data search engine is highly useful, we can also further automate the following steps. Since companies are using this information to make informed decisions, we can integrate this setup with machine learning to make instant decisions based on wealth of reliable data.
For example, our weather analysis model could be used in sales forecasting, primarily for companies dealing in FMCG (fast-moving consumer goods). Some items sell better, or worse, in certain weather conditions and a machine learning algorithm could use this data, combined with sales figures, to learn what sells best in various conditions and adapt their offers and commercial strategy accordingly.
Transportation could also serve as another example. A taxi firm, for instance, could anticipate changes in weather, especially weather that correlates with increased taxi usage, and be prepared for it.
Of courses, there are countless uses outside of weather analysis as well. For some industries, where fast reactions are vital, automated responses to data are essential. If you consider the likes of financial institutions, such as wealth and hedge fund managements, being able to adapt to market fluctuations, exchange rates and other factors, all of which contribute to financial risk, is crucial. Machine learning can use queries to pull all the data needed and make recommendations on the fly.
While AWS S3 Buckets and Quicksight are more than enough to handle smaller amounts of data, Athena provides an easy, cost-effective solution that scales with the demands of big data. Furthermore, it readily enables graphical visualisation, which is key in getting clear, readable results that enable you to make important decisions backed by well-informed conclusions.