Example of an ETL job in AWS Glue, and query in AWS Athena

Written by Craig Godden-Payne

On November 29, 2019

Reading this will take about 6 minutes.

I hope with this post to show a simple end to end run through of using AWS Glue to transform data from a format, into a more queryable format, and then query it using AWS Athena.

I’m aiming to show this through the console only, with more focus on how to automate this with terraform in the future.

First of all, I downloaded a data set to use as my example. In this case, it was some apache logs, which I then broke down into 5 pieces, to simulate that the logs were uploaded at 5 different times.

Here are links to the data sets that I used:

Github

Or you can see my split versions here:

log 1 log 2 log 3 log 4 log 5

I then uploaded the data files into my S3 bucket, keeping the data in the log format.

I then go to AWS, search for Glue, then head to the Databases tab. I create a database called craig-test

When I check the tables section, I see the message that no tables have been defined in the data catalogue, which is the case because I have yet to do anything. So the next thing to do is for me to create a crawler.

I create the crawler, looking at the S3 bucket, and set it to write to the database data catalogue that I created before.

Since I set the crawler to be on demand, I need to run it once I finished creating the crawler.

The Glue crawler pretty quickly determined that the format of the log was in apache format, and was able to define the schema for me.

The next stage, I wanted to create an ETL job, which will take the logs, in apache format and transform them into columnar data in parquet format, so that it is much more efficent and cost effective to query in Athena.

I created an Glue job, added the data source (craig-test bucket) and the destination as (craig-test-processed bucket) I then selected that I wanted to have the destination in parquet format. The apache spark code was auto generated for me, which means I can easily tweak and make changes if I need to.

I then run the job, and after a few minutes the job is completed. My first run failed.

Here are some of the logs generated in cloudwatch, they are pretty easy to figure out what is going on, and detailed enough to be able to intervene if things did not work as expected.

It was due to permissions, I forgot to allow the glue role to have access to my output bucket in IAM.

I updated that, and tried running the job again

I check the s3 processed bucket and the parquet files are there ready for me to query using athena.

The next stage now though, is to rebuild the schema to include the new data sets I have in the transformed columnar parquet format. So I go back to glue, create a new crawler, but this time crawl the S3 bucket I created containing the processed files.

Once ran, and the schema has been added to the data catalogue, it is time to move over to Athena for querying.

Note, it is fine to query the data in the log format without the ETL process of generating to parquet. The problem is, it is slow and costly, so it is likely you would also want to use something like parquet. Once in Athena, I can query the data from the logs, in a very fast, cost efficient way as if it were normal structured data

The difference in this simple query, is with the parquet data. If you look at the amount of data processed, you can see how much more expensive it would be

Written on November 29, 2019.