Data Engineering on AWS: Best Practices Overview

This blog post contains a listing of best-practices for data engineering on AWS. I will try to update this post regularly with new insights and best practices. Please note that this is not an exhaustive list. Am I missing an important one? Please let me know in the comments!

Serverless: Keep It Simple, Stupid

AWS has a serverless ETL service named AWS Glue. There are many benefits for using serverless services, but the most important one is simplicity. If there is no very very good reason to spin up and manage a cluster for ETL purposes yourself, please don't. Just use Glue and enjoy the relief of additional work you have to do to start getting your ETL workflows to run. If Glue do falls short, check out AWS EMR and convince yourself that you really really need it before you sidestep Glue.

Keep your ETL logic DRY

Don't Repeat Yourself is one of the most important principles in software engineering, and should also be applied to data engineering logic. DRY can be applied to multiple separate AWS Glue jobs by making use of shared packages that are imported by each job. In AWS Glue, we can import Python packages in the form of Python wheels stored in AWS S3, or by installing from PYPI.

Please check out my other blog post on how to define your own Python packages and use them in multiple Glue jobs.

Use infrastructure-as-code

AWS Glue has multiple features that make data engineering tasks very user-friendly for people that are new to data engineering. Among these are Glue's drag-n-drop ETL GUI, and interactive notebooks. These toolings have enormous value in learning, and ad-hoc experimentation. However, they cannot be the basis of production-grade data engineering workflows, because of the following points:

  • There is no easy way to apply parameterization to jobs that you created in this way, without duplication of all the code
  • You cannot time-travel through the changes that you made to your infrastructure and code
  • A CICD setup is lacking in this approach, which also means that you are not unit-testing and integration-testing your infrastructure and your ETL logic

Nowadays, there are multiple good frameworks to use Infrastructure-as-Code (IaC) in combination with AWS Glue:

  • AWS Cloud Development Kit (CDK)
    • Allows you to define infrastructure in a fixed set of supported programming languages
    • Deploys infra via Cloudformation stacks
  • Pulumi
    • Allows you te define in fixed set of supported programming languages
    • Can be used for multiple cloud providers
    • Uses AWS API calls to deploy
  • Terraform
    • Needs you to define infrastructure with Hashicorp Language (which is additional syntax you need to learn which can only be used for infrastructure related tasks)
    • Deploys via AWS API calls

In this blog post, I show how AWS CDK can be used to deploy Glue jobs and other related resources to an AWS account.

Conclusion

Although the principles of this blog post do require some setup before you get your stuff to work, they do pay-off in the long-term. In my experience, these investments are a must and if you try to cut-short on them, your data engineering workflows will quickly become unmanageble and hard to change when your companies needs and requirements evolve. I hope this blog post gives a good overview on how to get started. If you need any help in setting this up, please reach out to me.