Amazon EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS to build scalable data pipelines in a cost-effective manner. Monitoring the logs generated from the jobs deployed on EMR clusters is essential to help detect critical issues in real time and identify root causes quickly.
Pushing those logs into Amazon CloudWatch enables you to centralize and drive actionable intelligence from your logs to address operational issues without needing to provision servers or manage software. You can instantly begin writing queries with aggregations, filters, and regular expressions. In addition, you can visualize time series data, drill down into individual log events, and export query results to CloudWatch dashboards.
To ingest logs that are persisted on the Amazon Elastic Compute Cloud (Amazon EC2) instances of an EMR cluster into CloudWatch, you can use the CloudWatch agent. This provides a simple way to push logs from an EC2 instance to CloudWatch.
The CloudWatch agent is a software package that autonomously and continuously runs on your servers. You can install and configure the CloudWatch agent to collect system and application logs from EC2 instances, on-premises hosts, and containerized applications. CloudWatch processes and stores the logs collected by the CloudWatch agent, which further helps with the performance and health monitoring of your infrastructure and applications.
In this post, we create an EMR cluster and centralize the EMR step logs of the jobs in CloudWatch. This will make it easier for you to manage your EMR cluster, troubleshoot issues, and monitor performance. This solution is particularly helpful if you want to use CloudWatch to collect and visualize real-time logs, metrics, and event data, streamlining your infrastructure and application maintenance.
Overview of solution
The solution presented in this post is based on a specific configuration where the EMR step concurrency level is set to 1. This means that only one step is run at a time on the cluster. It’s important to note that if the EMR step concurrency level is set to a value greater than 1, the solution may not work as expected. We highly recommend verifying your EMR step concurrency configuration before implementing the solution presented in this post.
The following diagram illustrates the solution architecture.
The workflow includes the following steps:
- Users start an Apache Spark EMR job, creating a step on the EMR cluster. Using Apache Spark, the workload is distributed across the different nodes of the EMR cluster.
- In each node (EC2 instance) of the cluster, a CloudWatch agent watches different logs directories, capturing new entries in the log files and pushing them to CloudWatch.
- Users can view the step logs accessing the different log groups from the CloudWatch console. The step logs written by Amazon EMR are as follows:
- controller — Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
- stderr — The standard error channel of Spark while it processes the step.
- stdout — The standard output channel of Spark while it processes the step.
We provide an AWS CloudFormation template in this post as a general guide. The template demonstrates how to configure a CloudWatch agent on Amazon EMR to push Spark logs to CloudWatch. You can review and customize it as needed to include your Amazon EMR security configurations. As a best practice, we recommend including your Amazon EMR security configurations in the template to encrypt data in transit.
You should also be aware that some of the resources deployed by this stack incur costs when they remain in use.
In the next sections, we go through the following steps:
- Create and upload the bootstrap script to an Amazon Simple Storage Service (Amazon S3) bucket.
- Use the CloudFormation template to create the following resources:
- Monitor the Spark logs on the CloudWatch console.
This post assumes that you have the following:
Create and upload the bootstrap script to an S3 bucket
For more information, see Uploading objects and Installing and running the CloudWatch agent on your servers.
To create and the upload the bootstrap script, complete the following steps:
- Create a local file named
bootstrap_cloudwatch_agent.shwith the following content:
- On the Amazon S3 console, choose your S3 bucket.
- On the Objects tab, choose Upload.
- Choose Add files, then choose the bootstrap script.
- Choose Upload, then choose the file name:
- Choose Copy S3 URI. We use this value in a later step.
Provision resources with the CloudFormation template
Choose Launch Stack to launch a CloudFormation stack in your account and deploy the template:
This template creates an IAM role, IAM instance profile, Systems Manager parameter, and EMR cluster. The cluster starts the Spark PI estimation example application. You will be billed for the AWS resources used if you create a stack from this template.
The CloudFormation wizard will ask you to modify or provide these parameters:
- InstanceType – The type of instance for all instance groups. The default is m4.xlarge.
- InstanceCountCore – The number of instances in the core instance group. The default is 2.
- EMRReleaseLabel – The Amazon EMR release label you want to use. The default is emr-6.9.0.
- BootstrapScriptPath – The S3 path of your CloudWatch agent installation bootstrap script that you copied earlier.
- Subnet – The EC2 subnet where the cluster launches. You must provide this parameter.
- EC2KeyPairName – An optional EC2 keypair for connecting to cluster nodes, as an alternative to Session Manager.
Monitor the log streams
After the CloudFormation stack deploys successfully, on the CloudWatch console, choose Log groups in the navigation pane. Then filter the log groups by the prefix
The ID in the log group corresponds to the EC2 instance ID of the EMR primary node. If you have multiple EMR clusters, you can use this ID to identify a particular EMR cluster, based on the primary node ID.
In the log group, you will find the three different log streams.
The log streams contain the following information:
- step-stdout – The standard output channel of Spark while it processes the step.
- step-stderr – The standard error channel of Spark while it processes the step.
- step-controller – Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
To avoid future charges in your account, delete the resources you created in this walkthrough. The EMR cluster will incur charges as long as the cluster is active, so stop it when you’re done.
- On the CloudFormation console, in the navigation pane, choose Stacks.
- Choose the stack you launched (
EMR-CloudWatch-Demo), then choose Delete.
- Empty the S3 bucket you created.
- Delete the S3 bucket you created.
Now that you have completed the steps in this walkthrough, you have the CloudWatch agent running on your cluster hosts and configured to push EMR step logs to CloudWatch. With this feature, you can effectively monitor the health and performance of your Spark jobs running on Amazon EMR, detecting critical issues in real time and identifying root causes quickly.
You can package and deploy this solution through a CloudFormation template like this example template, which creates the IAM instance profile role, Systems Manager parameter, and EMR cluster.
To take this further, consider using these logs in CloudWatch alarms for alerts on a log group-metric filter. You could collect them with other alarms into a composite alarm or configure alarm actions such as sending Amazon Simple Notification Service (Amazon SNS) notifications to trigger event-driven processes such as AWS Lambda functions.
About the Author
Ennio Pastore is a Senior Data Architect on the AWS Data Lab team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. Ennio has over 10 years of experience in data analytics. He helps companies define and implement data platforms across industries, such as telecommunications, banking, gaming, retail, and insurance.