Scala Jobs on AWS Glue: A Practical Guide to Development, Local Testing and Deployment
As a data engineer I love Spark and use AWS Glue as one of the main platforms to deploy Spark jobs at my company. There are some things about Glue I absolutely love — it is highly scalable, cost effective and very easy to incorporate with other AWS services like Cloudwatch or Step Functions. It makes orchestrating complex pipelines very easy.
One thing I do not like is how the codebase for very large Python based Pyspark jobs has to be managed in order to run on AWS Glue. This is not a Spark issue, this is a Glue issue.
There are also performance issues with Pyspark jobs when we deal with UDFs as the data needs to be moved from the JVM to the Python process to be transformed and then moved back to the JVM. This shuffle is expensive and very time consuming. Here is a quick diagram of how the processes work.
I have been wanting to move over to Scala for development of Glue Jobs but always found it very difficult to test locally before deployment.
I have outlined the steps that I use to create a basic ETL jar and then test it using Docker.