Key Alerts for Serverless Application

Enrico Portolan
6 min readJul 10, 2022

--

AWS Lambda is the main serverless service of AWS. It only runs serverless code, meaning that it offloads the infrastructure management costs from you and makes sure your code run and scales smoothly. At the same time, you need an effective way to monitor your function and understand if something is going wrong.

One of the main challenges for Serverless developers is the Observability of their applications.

A famous quote from Werner Vogels, CTO of AWS, says:

“Everything fails, all the time” — Werner Vogels

To monitor Lambda, you need to understand how functions are invoked, used, and interact with other AWS Services. A burst of requests can throttle your function, or a longer execution can timeout your function.

In this blog post, I will explain how you can monitor your Lambda function with the following key metrics:

  • Concurrent Execution
  • Errors
  • onFailure Destination
  • Dead Letter Queue (DLQ)
  • Iterator Age

Let’s use the following Serverless Application as an example for our observability exercise:

AWS Serverless Application Architecture

The users will interact through the API Gateway which will call a Lambda function (Producer). The Lambda function will queue the requests into an Amazon SQS Queue which will be consumed by a second Lambda function (Consumer). The Consumer function will write items in a DynamoDB table and push messages into an Amazon SNS Topic.

You can quickly notice that there a multiple moving pieces, from the services that invoke Lambas to how Lambda process requests. For example, the SQS trigger can fail to invoke Lambda or Lambda can fail to write an item to DynamoDB. In the next section, we will look at how we can monitor these behaviors.

If you prefer to follow along by watching a video, you can find it here:

Concurrent Execution

First of all, let’s define what is concurrency in Lambda. In AWS Lambda the unit of scale is a concurrent execution. This refers to the number of executions of your function code that are happening at any given time. Each AWS account has a 1000 concurrency limit per region.

There are two types of concurrency controls available:

  • Reserved concurrency — guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function.
  • Provisioned concurrency — initializes a requested number of execution environments so that they are prepared to respond immediately to your function’s invocations. Note that configuring provisioned concurrency incurs charges to your AWS account.

Reserving concurrency has the following effects.

  • Other functions can’t prevent your function from scaling — All of your account’s functions in the same Region without reserved concurrency share the pool of unreserved concurrency. Without reserved concurrency, other functions can use up all of the available concurrency. This prevents your function from scaling up when needed.
  • Your function can’t scale out of control — Reserved concurrency also limits your function from using concurrency from the unreserved pool, which caps its maximum concurrency. You can reserve concurrency to prevent your function from using all the available concurrency in the Region, or from overloading downstream resources.

If you hit the concurrency limit, the Lambda function will trigger a Throttle error. Amazon Cloudwatch metrics track two useful metrics for this: Throttles and Concurrent Executions. You can create a Cloudwatch alarm on these two metrics, adding a condition on the number of throttles or concurrent executions allowed. In this example, I created a Lambda function with concurrent executions of 25 and I want to set an alarm if the concurrent executions metric hit 80% of its capacity. The condition is the following:

Average ConcurrentExecutions >= 20 for 1 datapoints within 1 minute

Another alarm you could set is on the throttles value, checking if the function has been throttled an N number of times within one minute.

If you want to see how to set an alarm, you can watch my step-by-step video here:

Errors

There are two types of errors that can occur with Lambda: invocation and function. Invocation errors include when a service fails to invoke your function because of missing permission or if the function has reached the concurrent execution limit. Function errors are when there are errors in your function code or if your function times out. It’s important to understand that Lambda Errors metrics only log function errors.

To calculate the error rate, divide the value of Errors by the value of Invocations. Note that the timestamp on an error metric reflects when the function was invoked, not when the error occurred.

Error Rate: (100* Errors) / MAX[Invokations, Errors]

How should we use this metric to create an alarm? Based on your service SLA, let’s suppose it is 99, you set the error alert to 1%. The steps are the same as above, you create an alarm in Cloudwatch which will notify your service when the error threshold is above the limit.

Set onFailure Destination

When the Lambda function gets triggered by async actions such as an S3 event or a queue, you don’t need to wait for a response from the function code. You can configure how Lambda handles the onSuccess and onFailure destinations.

The following example shows a function that is processing asynchronous invocations. When the function returns a success response or exits without throwing an error, Lambda sends a record of the invocation to an EventBridge event bus. Lambda sends an invocation record to an Amazon SQS queue when an event fails all processing attempts.

The action point is to create an onFailure destination for your Lamba function.

Dead Letter Queue

As an alternative to the onFailure destination, you can configure your function with a dead-letter queue to save discarded events for further processing. A DLQ is very similar to the onFailure destination but is part of a function’s version. Events in a DQL can be reprocessed by setting the queue as an event source for another Lambda function.

It’s possible to configure an Amazon SQS queue or Amazon SNS topic as a DLQ. For this point, you can set an alarm on the DeadLetterErrors metrics to monitor how many messages are being pushed into the DLQ.

Age of messages

When you use Amazon Kinesis or DynamoDB Streams, you need to monitor the age of messages to spot if there are performance problems with your function. If a message is too old it means the function is not keeping up with the number of messages received. What is the impact? The issue is that you can experience data loss because the data in the streams are kept for only 24 hours.

How can we monitor this? AWS Cloudwatch monitors the IteratorAge metric and you can set an alarm on that specific metric

Conclusion

In this blog post, we have seen five different metrics that Serverless developers can use to monitor their applications.

Let me know what you think in the comments. Follow me on Twitter and Youtube for more!

--

--

Enrico Portolan
Enrico Portolan

Written by Enrico Portolan

Passionate about cloud, startups and new technologies. Full-stack web engineer