Big Data News Hubb
Advertisement
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
Big Data News Hubb
No Result
View All Result
Home News

How Infomedia built a serverless data pipeline with change data capture using AWS Glue and Apache Hudi

admin by admin
March 15, 2023
in News


This is a guest post co-written with Gowtham Dandu from Infomedia.

Infomedia Ltd (ASX:IFM) is a leading global provider of DaaS and SaaS solutions that empowers the data-driven automotive ecosystem. Infomedia’s solutions help OEMs, NSCs, dealerships and 3rd party partners manage the vehicle and customer lifecycle. They are used by over 250,000 industry professionals, across 50 OEM brands and in 186 countries to create a convenient customer journey, drive dealer efficiencies and grow sales.

In this post, we share how Infomedia built a serverless data pipeline with change data capture (CDC) using AWS Glue and Apache Hudi.

Infomedia was looking to build a cloud-based data platform to take advantage of highly scalable data storage with flexible and cloud-native processing tools to ingest, transform, and deliver datasets to their SaaS applications. The team wanted to set up a serverless architecture with scale-out capabilities that would allow them to optimize time, cost, and performance of the data pipelines and eliminate most of the infrastructure management.

To serve data to their end-users, the team wanted to develop an API interface to retrieve various product attributes on demand. Performance and scalability of both the data pipeline and API endpoint were key success criteria. The data pipeline needed to have sufficient performance to allow for fast turnaround in the event that data issues needed to be corrected. Finally, the API endpoint performance was important for end-user experience and customer satisfaction. When designing the data processing pipeline for the attribute API, the Infomedia team wanted to use a flexible and open-source solution for processing data workloads with minimal operational overhead.

They saw an opportunity to use AWS Glue, which offers a popular open-source big data processing framework, and Apache Spark, in a serverless environment for end-to-end pipeline development and deployment.

Solution overview

The solution involved ingesting data from various third-party sources in different formats, processing to create a semantic layer, and then exposing the processed dataset as a REST API to end-users. The API retrieves data at runtime from an Amazon Aurora PostgreSQL-Compatible Edition database for end-user consumption. To populate the database, the Infomedia team developed a data pipeline using Amazon Simple Storage Service (Amazon S3) for data storage, AWS Glue for data transformations, and Apache Hudi for CDC and record-level updates. They wanted to develop a simple incremental data processing pipeline without having to update the entire database each time the pipeline ran. The Apache Hudi framework allowed the Infomedia team to maintain a golden reference dataset and capture changes so that the downstream database could be incrementally updated in a short timeframe.

To implement this modern data processing solution, Infomedia’s team chose a layered architecture with the following steps:

  1. The raw data originates from various third-party sources and is a collection of flat files with a fixed width column structure. The raw input data is stored in Amazon S3 in JSON format (called the bronze dataset layer).
  2. The raw data is converted to an optimized Parquet format using AWS Glue. The Parquet data is stored in a separate Amazon S3 location and serves as the staging area during the CDC process (called the silver dataset layer). The Parquet format results in improved query performance and cost savings for downstream processing.
  3. AWS Glue reads the Parquet file from the staging area and updates Apache Hudi tables stored in Amazon S3 (the golden dataset layer) as part of incremental data processing. This process helps create mutable datasets on Amazon S3 to store the versioned and latest set of records.
  4. Finally, AWS Glue is used to populate Amazon Aurora PostgreSQL-Compatible Edition with the latest version of the records. This dataset is used to serve the API endpoint. The API itself is a Spring Java application deployed as a Docker container in an Amazon Elastic Container Service (Amazon ECS) AWS Fargate environment.

The following diagram illustrates this architecture.

AWS Glue and Apache Hudi overview

AWS Glue is a serverless data integration service that makes it easy to prepare and process data at scale from a wide variety of data sources. With AWS Glue, you can ingest data from multiple data sources, extract and infer schema, populate metadata in a centralized data catalog, and prepare and transform data for analytics and machine learning. AWS Glue has a pay-as-you-go model with no upfront costs, and you only pay for resources that you consume.

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. It allows you to comply with data privacy laws, manage CDC operations, reinstate late-arriving data, and roll back to a particular point in time. You can use AWS Glue to build a serverless Apache Spark-based data pipeline and take advantage of the AWS Glue native connector for Apache Hudi at no cost to manage CDC operations with record-level insert, updates, and deletes.

Solution benefits

Since the start of Infomedia’s journey with AWS Glue, the Infomedia team has experienced several benefits over the self-managed extract, transform, and load (ETL) tooling. With the horizontal scaling of AWS Glue, they were able to seamlessly scale the compute capacity of their data pipeline workloads by a factor of 5. This allowed them to increase both the volume of records and the number of datasets they could process for downstream consumption. They were also able to take advantage of AWS Glue built-in optimizations, such as pre-filtering using pushdown predicates, which allowed the team to save valuable engineering time tuning the performance of data processing jobs.

In addition, Apache Spark-based AWS Glue enabled developers to author jobs using concise Spark SQL and dataset APIs. This allowed for rapid upskilling of developers who are already familiar with database programming. Because developers are working with higher-level constructs across entire datasets, they spend less time solving for low-level technical implementation details.

Also, the AWS Glue platform has been cost-effective when compared against running self-managed Apache Spark infrastructure. The team did an initial analysis that showed an estimated savings of 70% over running a dedicated Spark EC2 infrastructure for their workload. Furthermore, the AWS Glue Studio job monitoring dashboard provides the Infomedia team with detailed job-level visibility that makes it easy to get a summary of the job runs and understand data processing costs.

Conclusion and next steps

Infomedia will continue to modernize their complex data pipelines using the AWS Glue platform and other AWS Analytics services. Through integration with services such as AWS Lake Formation and the AWS Glue Data Catalog, the Infomedia team plans to maintain reference primary datasets and democratize access to high-value datasets, allowing for further innovation.

If you would like to learn more, please visit AWS Glue and AWS Lake Formation to get started on your data integration journey.


About the Authors

Gowtham Dandu is an Engineering Lead at Infomedia Ltd with a passion for building efficient and effective solutions on the cloud, especially involving data, APIs, and modern SaaS applications. He specializes in building microservices and data platforms that are cost-effective and highly scalable.

Praveen Kumar is a Specialist Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, streaming applications, and modern cloud data warehouses.



Source link

Previous Post

Hyperscale Analytics Growing Faster Than Expected, Ocient Says

Next Post

Enhancing the Amperity CDP with Personalized Product Recommendations

Next Post

Enhancing the Amperity CDP with Personalized Product Recommendations

Recommended

Will Artificial Intelligence Predict Your Next PR lead?

February 10, 2023

Capgemini Connects the Dots on Public Data Ecosystems

February 8, 2023

Aiven Announces General Availability of Aiven for Apache Flink

February 14, 2023

Don't miss it

News

Bill Gates Says the Age of AI Has Begun, Bringing Opportunity and Responsibility

March 25, 2023
Big Data

Techniques for training large neural networks

March 25, 2023
Big Data

O’Reilly 2023 Tech Trends Report Reveals Growing Interest in Artificial Intelligence Topics, Driven by Generative AI Advancement

March 24, 2023
Big Data

Democratizing the magic of ChatGPT with open models

March 24, 2023
News

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

March 24, 2023
News

ChatGPT Puts AI At Inflection Point, Nvidia CEO Huang Says

March 24, 2023

big-data-footer-white

© 2022 Big Data News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Big Data
  • News
  • Contact us

Newsletter Sign Up

No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us

© 2022 Big Data News Hubb All rights reserved.