Defining AWS glue jobs as Infrastructure-as-Code
How to develop ETL workflows as Python AWS Glue jobs with AWS CDK and enable local development
Using Infrastructure-as-Code (IaC) for deployment of resources to the cloud is a no-brainer nowadays. The learning-curve at the start is a bit steeper than applying click-ops, but will pay off in te long-term. In this post I try to assist in getting started with IaC applied on the data engineering domain on AWS.
I use my 'datalake repository' as basis for this blog post. You can clone the repo and try it out yourself. Please follow the required steps in the README of the repo.
I use CDK with Python in this example. One of the major advantages of CDK is that you can use programming languages to define the infrastructure. Since Python is the most-used language for data engineering, it keeps the required technical knowledge for developing ETL lower by making sure that both the infra and the transformations are defined with a single (and the globally most used) programming language: Python.
Glue job
The two basic ingredients of data engineering are transformation logic, and data. Transformation (the T in ETL) logic is executed within AWS by a Glue job. I defined a CDK job construct that takes some properties as argument:
- The job's script: this contains the logic that is executed when a job run is triggered
- The IAM role: the set of permissions of the Glue job within AWS
- Runtime parameters: an adaptable set of arguments that allows you to parameterize the execution of job runs
- Infrastructure parameters: some options such as number of retry attempts on failure, and timeout duration of the job.
Job script
The Glue job script is the entry point of the job. It contains the Python (pyspark) commands that are executed by the job. I created a 'demo script' to check that everything works. This script does some job argument parsing and logging. It also imports the 'demo_lib' which is provided to the job as a Python wheel.
The demo library
In the 'demo_lib' folder in 'libs' I created a Python package that contains some logic for the Glue job to use. This python wheel is uploaded to the bucket by the CDK code. The Glue job installs the package during start-up.
For more details about creating a Python package, and building it into a wheel, check my other blog post: martijn.sturm/use-your-own-python-packages-in-glue-jobs
A library published to Pypi
I created a Python package called 'glue-helper-lib' (repo, pypi (pip install glue-helper-lib
)) that contains some useful functionality for Glue jobs specifically. This package will also be used by the Glue job as shown below. This package is installed from Pypi, during start-up of the job.
Let's define our resources...
In the CDK app file, we instantiate the resources that are to be deployed in AWS in a CDK Stack:
- A bucket that contains:
- the demo library as a Python wheel
- the Glue job script
- An IAM role for the Glue job to assume, which:
- allows to read and write objects in the bucket
- A Glue job
- Which refers to the Glue job script
- Contains some job arguments:
- A custom argument "demo_argument"
- Cloudwatch logging options
- Dependencies:
- The custom package that we put into the bucket
- A package that I published to Pypi called "glue-helper-lib"
Custom arguments
Glue jobs can be set with 'default arguments'. These arguments can be used at runtime. You can also overwrite the values of these arguments when triggering the job via an API call. I created a dataclass base class which allows you to define which arguments you are expecting in your Glue job, and instantiate a dataclass with those arguments. Check the 'glue-helper-lib' (repo, pypi). I have written examples on how to use this, such as the infra definition and the runtime logic
Runtime dependencies
You can use both Python wheels, and pip installable packages from Pypi. In this example we use both. For packages from Pypi, you can define which version to install. This is can be done very conveniently by using the datalake-lib package. For the custom wheels uploaded to S3, the version is defined within the wheel's name.
...deploy...
Run cdk deploy
while making sure you have the right credentials for an AWS account configured.
...and run!
Go the the AWS management console and navigate to the Glue jobs page. Make sure you selected the same region as you deployed the job to.
Trigger the job by clicking the 'run' button. The status of the job will update, and you can check the logs for the output. Within 2 minutes, the job should've run and be completed successfully!
Don't forget to clean up your resources after experimentation. Although the resources are serverless, we prefer keeping our AWS accounts clean and tidy.
Conclusion
This is just showcasing the basis of developing higher-quality infra in AWS. The next step would be to actually define some usefull Glue jobs making use of these principles. I hope this blog post assists in getting there. Please reach out to me if you're in need of any assistance.