In our last post, we covered CloudWatch Metrics in a great deal of detail. We looked at how we can use it to monitor the health of our Lambda functions, including setting up service dashboards as well as alerts. In this post, we will focus on its sister service – CloudWatch Logs. We’ll see how it works and how to get the most out of it.
Whenever our function writes to stdout or stderr, the message is collected asynchronously without adding to our function’s execution time. They are then shipped to CloudWatch Logs and organized into log groups.
Every function has a matching log group with the same name, following the /aws/lambda prefix:
Inside each log group, there are log streams – each maps to a concurrent execution of the function. Each concurrent execution would write to its own log stream in parallel.
It’s worth mentioning that we can also easily ship logs from ECS tasks as well as API Gateway to CloudWatch Logs as well. For services such as Kinesis Firehose, it also has built-in support for sending service logs to CloudWatch Logs too.
In terms of pricing, CloudWatch Logs charges for both ingestion as well as storage.
At $0.50 per GB ingestion and $0.03 per GB per month for storage, CloudWatch Logs is significantly cheaper than most of its competitors. The system also autoscales based on traffic, is highly resilient, and can store data for practically as long as we want. So, what’s the catch?
The big limitation with CloudWatch Logs, its Achilles’ heel, is its inability to query logs across multiple log groups. This is applicable to CloudWatch Logs Insights too. Which is why most people would forward their logs from CloudWatch Logs to a managed ELK (Elasticsearch, Logstash, Kibana) stack. To do that, we need to use subscription filters.
A subscription filter lets us stream the log events to a destination within AWS. The supported destinations are:
- Kinesis stream
- Kinesis Firehose delivery stream
- A Lambda function
Despite being called “Subscription filters” in the CloudWatch Logs console, currently, there can only be one filter per log group. So, choose wisely!
When we’re not running hundreds or even thousands of concurrent Lambda executions, then it’s easiest to send our logs to a Lambda function.
However, as our scale increases, we need to concern ourselves with the number of concurrent Lambda executions. Since CloudWatch Logs is itself an async event source for Lambda, we don’t have much control over its concurrency. It’s possible for our log-shipping function to consume too much concurrency and cause business-critical functions to be throttled. We can put a concurrency limit on the log-shipping function, but that puts us at risk of losing logs. A much better solution (when running at scale) is to send the logs to Kinesis or Kinesis Firehose instead.
Kinesis Firehose supports a number of predefined destinations: Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. We don’t need to write any custom code, just point the delivery stream at the destination and we’re done! We can also use Lambda to enrich and transform the data first too. However, we’re limited to those few destinations.
Kinesis, on the other hand, offers us more flexibility. We can process data with multiple Lambda functions. One to ship logs to our ELK stack, and another for some other purpose. And, we can still forward the events to a Kinesis Firehose delivery stream, again, without having to write any custom code.
Fortunately, whichever destination we choose, we have open sourced some tools to make it easy to apply them to all our Lambda logs.
One of the most underappreciated features of CloudWatch Logs is the ability to turn logs into metrics and alerts with metric filters.
To create a new metric filter, select the log group, and click “Create Metric Filter”.
Then we need to set a filter pattern to select the relevant log messages. For more details on the query syntax, see this page.
After targeting the right log messages, we need to turn them into metrics. On the following screen, we can configure that conversion.
If we use a Metric Value of 1 then the metric is essentially a count of the number of matched log messages. This is all we need for many use cases such as counting the number of errors. But what if we want to collect custom latency metrics?
To capture latency values, we need to apply a pattern that captures different parts of the log message. The following example, for instance, captures the latency value and unit in named variables.
We can then reference these named variables when we define the metric.
Once the metric filter is created, we can see the custom metric in the CloudWatch Metrics console.
Metric filters is a powerful tool, but it’s not without limitations.
While you can parameterize the metric value using name fields, you can’t parameterize the metric name. This means you need to create a metric filter for every custom metric you intend to log from the function. As our application logic evolves, and we need to record more custom metrics, it becomes something that’s easy to forget. Also, there is a hard limit of 100 metric filters per log group.
CloudWatch Logs Insights
CloudWatch Logs’ built-in query capability is severely limited. And as such there was quite a lot of fanfare when AWS announced CloudWatch Logs Insights at re:invent 2018. It gives you the ability to write SQL-esque queries, and to sort and even generate stats from matched log messages.
You can also visualize the search results in the console, and even output the results to a CloudWatch Metrics dashboard.
Recently, AWS has even integrated basic insights about our Lambda functions directly into the Lambda management console. Now, we can see summaries of the most recent invocations and the most expensive invocations.
If you’re logging structured data in JSON, then CloudWatch Logs Insights can even discover the fields in your JSON message. You can reference these fields in your query, just as you would do in an ELK stack.
Sadly, you can only query logs one log group at a time. Which, to me, means it’s not very useful for debugging live issues in a non-trivial system. In these systems, a user transaction or feature flows through many Lambda functions and that’s why we need things like correlation IDs to help us find related logs. When you can only search the logs for one function at a time, there is nothing to correlate.
That said, it can still be very useful for understanding the behaviour of a function, once you have somehow pinpointed which function to investigate.
In this post, we took a deep dive into CloudWatch Logs and explored its many features:
- Subscription filter for streaming log events to a destination in real-time
- Metric filter for creating CloudWatch Metrics from log events
- CloudWatch Logs Insights for analyzing log messages using ad-hoc queries
We also discussed the limitations with these features and CloudWatch Logs in general. Primarily, that you can only analyze logs for one function at a time.
You should also consider adopting good logging practices, which are not specific to CloudWatch Logs. One of the easiest and most effective things you can do to make your logs more useful is to write structured logs.
Many people spend more on CloudWatch Logs in production than they do on Lambda invocations. To keep the cost of logging in check, you should not be logging at debug level in production. Instead, you should sample debug logs in production for a small percentage of transactions. To make following these good practices easy, I built and open sourced the dazn-lambda-powertools while I was at DAZN. Its logger and middlewares support structured logging, correlation IDs and samples debug logs out-of-the-box. In fact, you can even sample debug logs for the entire transaction that spans across multiple functions.