In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool.
The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks.
To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.
JobMode property
A new property JobMode
describes the mode of AWS Glue jobs (script, visual, or notebook) to improve your UI experience. AWS Glue users can use the mode that best fits your preference. Some extract, transform, and load (ETL) developers prefer to use visual mode and create visual jobs using AWS Glue Studio visual editor. Some data scientists prefer to use notebooks jobs and use AWS Glue Studio notebooks. Some data engineers and developers prefer to implement script through the AWS Glue Studio script editor or preferred integrated development environment (IDE). After the job is created with the preferred mode, you can search for it by filtering on the job mode within your saved AWS Glue jobs page and find it easily. Additionally, if you are migrating existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode and do so for multiple jobs using this new API property, as demonstrated in this post.
How CreateJob API works with the new JobMode property
You can use CreateJob API to create AWS Glue script or a visual or notebook job. The following is an example of how it works for a visual job using AWS SDK for Python (Boto3): (replace <your-bucket-name> with your S3 bucket)
CODE_GEN_JSON_STR
represents the visual nodes for the AWS Glue Job. There are three nodes: node-1 uses S3 source, node-2 does transformation, and node-3 uses S3 target. The script instantiates the AWS Glue Boto3 client, loads the JSON, and calls the create_job
. JobMode
is set to VISUAL
.
After you run the Python script, a new job is created. The following screenshot shows how the created job looks in AWS Glue visual editor.
There are three nodes in the visual directed acyclic graph (DAG): node 1 sources product review data for the product_category
book from the public S3 bucket, node-2 drops some of the fields that aren’t needed for downstream systems, and node-3 persists the transformed data in a local S3 bucket.
How CloudFormation works with the new JobMode property
You can use AWS CloudFormation to create different types of AWS Glue jobs by specifying the JobMode
parameter with the AWS::Glue::Job resource. The supported job modes include:
In this example, you create a AWS Glue notebook job using AWS CloudFormation, which requires setting the JobMode
parameter to NOTEBOOK
.
- Create a Jupyter Notebook file containing your logic and code, and save the notebook file with a descriptive name, such as
my-glue-notebook.ipynb
. Alternatively you can download the notebook file, and rename it tomy-glue-notebook.ipynb
. - Upload the Notebook file to the
notebooks/
folder within theaws-glue-assets-<account-id>-<region> S3
bucket. - Create a new CloudFormation template to create a new AWS Glue job, specifying the
NotebookJobName
parameter as the same name as the Notebook file. Here’s the sample snippet of CloudFormation template: - Deploy the CloudFormation template. For
NotebookJobName
, enter same name as the notebook file. - Verify that the AWS Glue job you created is listed and that it has the name you specified in the CloudFormation template.
AWS Glue notebook shows the Notebook job that contains the existing cells that you had in the ipynb
file. You can review the job details to confirm it’s configured correctly.
Console experience
On the AWS Glue console, in the navigation pane, choose ETL Jobs to observe all your ETL jobs listed. Here you have different columns Job name, Type, Created by, Last modified, and AWS Glue version. You can sort and filter by these columns. The following screenshot shows how it looks.
We also enhanced the console experience with the JobMode
introduction. The Created by column on the console gives you information about JobMode
of the job. You can filter access jobs created by VISUAL, NOTEBOOK, or SCRIPT, as shown in the following screenshot.
This new console experience helps you search and discover your jobs based on JobMode.
Conclusion
This post demonstrated how AWS Glue Job API works with the newly introduced job mode property. With the new property, you can explicitly choose the mode of each job. The steps instructed detailed usage in API, AWS SDK, and CloudFormation. Additionally, the property makes it straightforward to search and discover your jobs quickly on the AWS Glue console.
About the Authors
Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure, and high-performance data solutions in the cloud.
Manoj Shunmugam is a DevOps Consultant in Professional Services at Amazon Web Services. He works with customers to establish infrastructures using cloud-centered and/or container-based platforms in the AWS Cloud.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.
Gal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.