In the first post of this series, we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. This native support simplifies reading and writing your data for these data lake frameworks so you can more easily build and maintain your data lakes in a transactionally consistent manner. This feature removes the need to install a separate connector and reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark jobs.
These data lake frameworks help you store data more efficiently and enable applications to access your data faster. Unlike simpler data file formats such as Apache Parquet, CSV, and JSON, which can store big data, data lake frameworks organize distributed big data files into tabular structures that enable basic constructs of databases on data lakes.
Expanding on the functionality we announced at AWS re:Invent 2022, AWS Glue now natively supports Hudi, Delta Lake and Iceberg through the AWS Glue Studio visual editor. If you prefer authoring AWS Glue for Apache Spark jobs using a visual tool, you can now choose any of these three data lake frameworks as a source or target through a graphical user interface (GUI) without any custom code.
Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases. In this post, we demonstrate how to ingest data stored in Hudi using the AWS Glue Studio visual editor.
To demonstrate the visual editor experience, this post introduces the Global Historical Climatology Network Daily (GHCN-D) dataset. The data is publicly accessible through an Amazon Simple Storage Service (Amazon S3) bucket. For more information, see the Registry of Open Data on AWS. You can also learn more in Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight.
The Amazon S3 location
s3://noaa-ghcn-pds/csv/by_year/ has all the observations from 1763 to the present organized in CSV files, one file for each year. The following block shows an example of what the records look like:
The records have fields including ID, DATE, ELEMENT, and more. Each combination of
ELEMENT represents a unique record in this dataset. For example, the record with
20220101 is unique.
In this tutorial, we assume that the files are updated with new records every day, and want to store only the latest record per the primary key (
ELEMENT) to make the latest snapshot data queryable. One typical approach is to do an INSERT for all the historical data, and calculate the latest records in queries; however, this can introduce additional overhead in all the queries. When you want to analyze only the latest records, it’s better to do an UPSERT (update and insert) based on the primary key and
DATE field rather than just an INSERT in order to avoid duplicates and maintain a single updated row of data.
To continue this tutorial, you need to create the following AWS resources in advance:
Process a Hudi dataset on the AWS Glue Studio visual editor
Let’s author an AWS Glue job to read daily records in 2022, and write the latest snapshot into the Hudi table on your S3 bucket using UPSERT. Complete following steps:
- Open AWS Glue Studio.
- Choose Jobs.
- Choose Visual with a source and target.
- For Source and Target, choose Amazon S3, then choose Create.
A new visual job configuration appears. The next step is to configure the data source to read an example dataset:
- Under Visual, choose Data source – S3 bucket.
- Under Node properties, for S3 source type, select S3 location.
- For S3 URL, enter
The data source is configured.
The next step is to configure the data target to ingest data in Apache Hudi on your S3 bucket:
- Choose Data target – S3 bucket.
- Under Data target properties- S3, for Format, choose Apache Hudi.
- For Hudi Table Name, enter
- For Hudi Storage Type, choose Copy on write.
- For Hudi Write Operation, choose Upsert.
- For Hudi Record Key Fields, choose
- For Hudi Precombine Key Field, choose
- For Compression Type, choose GZIP.
- For S3 Target location, enter
s3://. (Provide your S3 bucket name and prefix.)
To make it easy to discover the sample data, and also make it queryable from Athena, configure the job to create a table definition on the AWS Glue Data Catalog:
- For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
- For Database, choose
- For Table name, enter
- For Partition keys – optional, choose
Now your data integration job is authored in the visual editor completely. Let’s add one remaining setting about the IAM role, then run the job:
- Under Job details, for IAM Role, choose your IAM role.
- Choose Save, then choose Run.
- Navigate to the Runs tab to track the job progress and wait for it to complete.
Query the table with Athena
Now that the job has successfully created the Hudi table, you can query the table through different engines, including Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, in addition to AWS Glue for Apache Spark.
To query through Athena, complete the following steps:
- On the Athena console, open the query editor.
- In the query editor, enter the following SQL and choose Run:
SELECT * FROM "hudi_native"."ghcn" limit 10;
The following screenshot shows the query result.
Let’s dive deep into the table to understand how the data is ingested and focus on the records with ID=’AE000041196′.
- Run the following query to focus on the very specific example records with
SELECT * FROM "hudi_native"."ghcn" WHERE ID='AE000041196';
The following screenshot shows the query result.
The original source file
2022.csv has historical records for record
20221231, however the query result shows only four records, one record per
ELEMENT at the latest snapshot of the day
20221231. Because we used the UPSERT write option when writing data, we configured the ID field as a Hudi record key field, the
DATE field as a Hudi precombine field, and the
ELEMENT field as partition key field. When two records have the same key value, Hudi picks the one with the largest value for the precombine field. When the job ingested data, it compared all the values in the
DATE field for each pair of
ELEMENT, and then picked the record with the largest value in the
According to the preceding result, we were able to ingest the latest snapshot from all the 2022 data. Now let’s do an UPSERT of the new 2023 data to overwrite the records on the target Hudi table.
- Go back to AWS Glue Studio console, modify the source S3 location to
s3://noaa-ghcn-pds/csv/by_year/2023.csv, then save and run the job.
- Run the same Athena query from the Athena console.
Now you see that the four records have been updated with the new records in 2023.
If you have further future records, this approach works well to upsert new records based on the Hudi record key and Hudi precombine key.
Now to the final step, cleaning up the resources:
- Delete the AWS Glue database
- Delete the AWS Glue table
- Delete the S3 objects under
This post demonstrated how to process Hudi datasets using the AWS Glue Studio visual editor. The AWS Glue Studio visual editor enables you to author jobs while taking advantage of data lake formats and without needing expertise in them. If you have comments or feedback, please feel free to leave them in the comments.
About the authors
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.
Scott Long is a Front End Engineer on the AWS Glue team. He is responsible for implementing new features in AWS Glue Studio. In his spare time, he enjoys socializing with friends and participating in various outdoor activities.
Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.