Container Services: Logging and Reporting
Logging and Exception Reporting are two obvious operational features every service must include. As we learned in the last post, Application Support in Docker-based Microservice Environments, building and deploying microservices makes implementing log infrastructure significantly more difficult.
The most difficult problem DevOps teams need to overcome is the transient nature of containers. Teams should not use aggregation* mechanisms that assume locality (e.g. log rotation on file system, aggregators looking at specific directories on the host machine). However, we often need to know where a container is for the purposes of debugging. So while we should not assume we know where a service is deployed, we do need to somehow propagate metadata about the container with log entries.
* In the context of logging, aggregation is the transformation (if necessary), movement, and storage of logs from nodes to one or more external repositories.
Generally, there are two strategies used in logging: Passive and Active. In Active Logging, the microservice participates in the logging infrastructure, making network connections to intermediate aggregators, sending data to third-party logging services (e.g. Loggly), or writing directly to a database or index. If the microservice does anything other than output logs to stdout
or stderr
, it's using Active Logging. Microservices that use Passive Logging, on the other hand, are unaware of the logging infrastructure and simply log messages to standard outputs.
Engineers should be aware of the implications of their chosen strategy. Active Logging is now considered an antipattern that should be avoided, while Passive Logging has flourished with a myriad of frameworks and technologies being built to support the pattern.
In this post, we will explore the implications of Active Logging and why it should be avoided. Then we will discuss Passive Logging and how to implement the practice in a Docker environment. Finally, we will conclude with a brief conversation about effective logging infrastructure, specifically, features and practices that should be present to support evolving microservice architectures.
Active Logging should be avoided
Active Logging is the practice of having services participate directly in the logging infrastructure. This is may include writing logs to the filesystem, sending them over the network to an aggregator (e.g. Logstash, Fluentd), or pushing entries to a SaaS service (e.g. Loggly). While Active Logging provides developers a lot of choice on how they should handle logging, I contend there are three reasons for avoiding the practice:
1. Active loggers can cause a service to fail
If you don't properly use or configure a logger, you could potentially crash the process. For instance, what happens when your aggregation endpoint goes down and the microservice can't send log entries to it? Does the microservice buffer logs until it runs out of memory? Or should the microservice purge old records? What if the logging client causes an uncaught runtime exception?
I admit to being a little hyperbolic here, but I hope you understand the point I'm making. Adding components that communicate with other endpoints increases the complexity of a service, thereby, increasing the likelihood of failures.
2. Active loggers tend to function differently in every environment
If you are using a service like Loggly, I doubt your boss is going to let you stream logs from your local development machine to the SaaS product. You are also unlikely to run the full logging infrastructure on your MacBook that can barely run four Docker containers (this is why I switched to a Linux laptop).
Most companies rarely achieve feature parity between development, testing, and production, particularly when it comes to logging. Instead, applications are left to dynamically configure the logger based on their deployment environment. Personally, I don't like doing this; it's not uncommon to experience problems between environments with simple components, let alone something as complex as a multi-output logger.
3. Active loggers are sensitive to infrastructure changes
The final thought to consider is what happens when you want to change your log infrastructure. For instance, what if your CTO jumps on the Cloud Native bandwagon and forces all project to standardize on Fluentd when your application was using Logstash? Or you decide to add file system logging to all services so you can have local access to log entries? To accommodate these changes you are going to need to change all of your microservices and redeploy the system!
Passive Logging is the recommended practice
Passive Logging is the output of log data to standard interfaces, typically stdout
and stderr
. This strategy leaves the deployment infrastructure responsible for aggregating data to the file system, external aggregators, and data stores. The advantage of Passive Logging is that it is simple, portable, and environment agnostic. These features allow you to change the underlying logging platform without having to recompile/redploy services.
More importantly, Docker is designed to facilitate Passive Logging. A native feature of the runtime is the Docker Logging Driver. This is a piece of configuration that can be specified when you run a container to swap out the default implementation (json-file
) with another transport:
# Example from: https://docs.docker.com/config/containers/logging/gelf/
$ docker run \
--log-driver gelf –-log-opt gelf-address=udp://1.2.3.4:12201 \
alpine echo hello world
Docker currently offers 11 different logging drivers (with none
being an option):
Current drivers for Docker CE v17.12.
Implementing Passive Logging in an application does not mean you start using System.out.println
or console.log
instead of classic log utilities. Applications still need to output structured log events (typically in JSON) that can be parsed by downstream aggregators. Instead, I encourage you to simply configure the logger to produce on the stdout
stream.
Once your application is exporting structured log entries to the console, you will need to implement the aggregation infrastructure. In the next section, we will discuss how to use standard deployment environments to support this use case.
Implementing Passive Logging
As demonstrated above, implementing Passive Logging in Docker is as easy as switching out the logging driver when you start a container. Let's take a moment to look at how this is done in realistic deployment scenarios:
Amazon Elastic Container Service (ECS)
ECS allows users to change the logging driver, making Passive Logging really easy. The configuration to change the logging driver is located in the ECS Task Definition, which usually maps to a microservice deployment (the service + any sidecar containers).
A common deployment pattern is to have a local instance of Logstash (or Fluentd) deployed to every ECS host in the infrastructure that can receive log entries from the local Docker daemon and forward those entries to the rest of the log infrastructure:
We can then set the logging driver information in the ECS Task Definition to use the GELF driver to forward logs to the local Logstash instance:
{
"family": "order-service",
"containerDefinitions": [
{
"name": "microservice",
"image": "rclayton/order-service:1.20.1",
"essential": true,
"portMappings": [
{ "containerPort": 80 }
],
"memory": 500,
"cpu": 10,
"logConfiguration": {
"logDriver": "gelf",
"options": {
"gelf-address": "tcp://172.17.0.1:5000",
"gelf-tcp-max-reconnect": 5,
"gelf-tcp-reconnect-delay": 2,
// Metadata added to the log entry.
"labels": "v1_20_1",
"tag": "orders"
}
}
}
]
}
In this example, we are assuming there is a local Logstash agent running on the ECS Host ready to receive log data from containers launched on that host:
$ docker run -d --name="logstash" \
--volume=/etc/logstash/logstash.conf:/etc/logstash/logstash.conf \
--restart=always \
logstash:latest \
-f /etc/logstash/logstash.conf
With the /etc/logstash/logstash.conf
configuration looking something like this:
input {
gelf {
port => 5000
}
}
# Add transformations as needed
filter {
# For example:
# If the "level" property is not present, add it with the default value "debug"
if ![level] {
mutate {
add_field => {
"level" => "debug"
}
}
}
}
output {
# We will just assume we are communicating with a remote aggregator
tcp {
host => "10.0.1.12"
port => 5000
}
}
I'm using IP addresses here, but this is definitely not the way to do it. In a future article, I will demonstrate how to integrate Consul for service discovery.
"Houston, we have a problem..."
If you override the logging driver to use something other than json-file
or journald
you will notice that docker logs
no longer works. That's because Docker needs to be able to source logs from the local node to display them in the terminal (see the note here). If preserving the local log entries is important, an alternative to overriding the default logging driver is to use a special daemon called Logspout:
Logspout
In the project's own words:
Logspout is a log router for Docker containers that runs inside Docker. It attaches to all containers on a host, then routes their logs wherever you want. - https://github.com/gliderlabs/logspout
Logspout is great for use cases where you cannot (or don't want) to change the Docker logging driver, but need to forward logs to another process. A great example of this is when you want to preserve logs for local debugging purposes (e.g. docker logs
) but also want logs shipped to a remote destination.
A typical Logspout deployment looks like this:
$ docker run -d --name="logspout" \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--restart=always
gliderlabs/logspout \
raw://192.168.10.10:5000
Another advantage of using Logspout (over the Docker logging driver) is that you can remove the need for a local aggregator like Logstash and push logs directly to an endpoint on a remote host. The default Logspout build supports a limited number of outputs (TCP, TLS, UDP, HTTP, syslog), but you can find third-party modules for Kafka, Redis/Logstash, Logstash, and GELF.
In the past, I've used the logspout-logstash module with great success. The only issue is that you will want to also colocate a Logstash node on the local machine (we will discuss why later). However, you can avoid this approach if you use the new logspout-redis-logstash module. However, I hesitate to recommend this module because I've never used it before in production.
Kubernetes
Kubernetes does not allow the logging driver to be configured at any level (cluster, service, pod, etc.) (see this discussion for reasons why). Instead, the Kubernetes documentation recommends various approaches including running application sidecars with the logging agent (e.g. Fluentd, Logstash) or using a DaemonSet to deploy a single aggregator on each node in the cluster. The latter approach is the preferred method since it provides better utilization of node resources.
Since Kubernetes does not allow us to transparently route logs from the Docker daemon to the aggregator, we have to find an alternative approach. From my experience, there are only two effective ways of doing this:
- Use a local logging agent to aggregate Docker logs from
/var/lib/docker/containers
. - Use Logspout to aggregate logs from the Docker daemon.
Helm, the Kubernetes package manager, makes the first option really easy. The quickest way to aggregate logs from the local Docker daemon is to use one of the fluent-*
charts (i.e. packages). fluent-elastic
, for instance, is used to directly forward Docker logs to ElasticSearch. fluent-cloudwatch
forwards logs to AWS CloudWatch. Finally, fluent-bit
is a more general purpose forwarder where users can configure logs to be delivered to a larger set of outputs (though it requires more configuration). All of the fluent-*
charts are deployed as DaemonSet processes that source logs from /var/lib/docker/containers
; the only difference is how they aggregate the logs.
$ export ES_HOST=whereever.es.is.at
$ helm install --name my-release \
--set elasticsearch.host=$ES_HOST,elasticsearch.port=9200 \
incubator/fluentd-elasticsearch
Exports all container logs on Kubernetes nodes to the ElasticSearch instance at whereever.es.is.at:9200
.
The second option is to deploy Logspout as a DaemonSet and forward logs to an aggregator either on the local node or somewhere else on the network. There's a discussion with some example code on Github if you decide to go this route: https://github.com/gliderlabs/logspout/issues/202.
Effective Logging Infrastructure
Now that we understand how to minimize a microservice's awareness of the deployment environment (by introducing transparent aggregation), I would like to share some lessons I learned in supporting logging infrastructure in production.
You may be tempted to skimp on your logging infrastructure. For instance, why should you place an aggregator on every node/host in your cluster when you can simply deploy one? Why not just write directly to the destination log storage?
Many initial microservices architectures can get away with these practices. However, as you add more services to your environment, your single aggregator is going to become overloaded. This is especially true if your aggregator has to perform transformations of log entries as they are received. You will also need to account for failures. One problem I frequently dealt with was remote datastores running out of disk space. Another possible issue is the data store becoming overloaded, especially if you have every aggregator writing directly.
Instead of listing all the problems you might encounter with your logging infrastructure, I would rather propose a general architecture:
1. Deploy a single aggregator to each Docker host.
Kubernetes allows you to do this easily with DaemonSets. In ECS and other environments, you will probably need to deploy this when the host is initialized (using prebaked AMIs or CloudInit).
2. Use the sidecar deployments for applications with specific logging needs.
If you need to apply log transformations for an application with a non-standard log event model, you may want to sidecar a process that can normalize the log entries before they are shipped to the node-level aggregator. For instance, you may want to transform Apache logs into Logstash JSON format prior to shipping it to the log infrastructure. This will prevent the need to redeploy the logging infrastructure every time the transformation rules of an application's log are updated.
3. Containers route logs to an aggregator located on the same host.
This allows host-specific data (IP, hostname, cluster ID, select environment variables, etc.) to be appended to every log message seen by the aggregator. Log transformations are also offloaded to edge nodes, reducing the computation required in the infrastructure. Most importantly, the aggregator serves as the bridge to the rest of the logging infrastructure.
4. Use local aggregators to report exceptions.
A lot of engineers use specialized clients within microservices to report exceptions. I think this function is better left to the aggregator, especially if you have a standardized structure for log entries. I typically check the log level of each message and if it is above a specific level (critical, fatal, etc.), I route the message to an additional endpoint (Slack, HipChat, Email, etc.).
While we could also perform the exception reporting in remote aggregators, I prefer to report the errors as soon as possible. If you wait for the message to reach a remote node, it may be too late to react (especially if the message is lost or buffered for a long period of time).
# Logstash configuration example
output {
# Send message to HipChat if a critical or fatal log message is observed
if [level] == "critical" or [level] == "fatal" {
hipchat {
room_id => 1312342
token => "asdhfkasdhfkhasdfkhasdfasd"
trigger_notify => true
}
}
# ...
}
5. Write log entries from local aggregators to a remote buffer.
Instead of shipping logs directly to remote aggregators or data stores, ship them to a buffer (Redis is commonly used for this purpose, but Kafka is also a great option). This practice saved my organization a week's worth of log data once when our ElasticSearch cluster ran out of disk space. Using a remote buffer can also alleviate spikes in log traffic that lead to extra strain on destination data stores.
# Logstash configuration example
output {
redis {
host => "elasticcache.buffer"
data_type => "list"
key => "logstash"
}
}
6. Employ remote aggregators to route log entries from the buffer to destination indices and storage.
Remote aggregators are responsible for pulling log entries out of a remote buffer and storing them in destination data stores like ElasticSearch and S3. Typically, you will have significantly less remote aggregators than local ones. The number depends on how you store data in the backend.
For instance, if you ship logs to S3, you may only want a single aggregator for that purpose (combining all log entries for a specific duration). If you are inserting into ElasticSearch, the number of aggregators needed will be based on how fast you can drain the buffer, as well as, SLAs around how fast log entries need to be available in the index.
# Logstash configuration example
input {
redis {
host => "elasticcache.buffer"
type => "log"
data_type => "list"
key => "logstash"
}
}
output {
# Amazon hosted ElasticSearch
amazon_es {
hosts => ["logs-asdfjkasdfkhaiweruihwyehaskdf.us-west-2.es.amazonaws.com"]
region => "us-west-2"
}
s3 {
region => "us-west-2"
bucket => "com.example.logs.production"
size_file => 2048 # bytes
time_file => 5 # minutes
}
}
Conclusion
Log aggregation is essential to supporting production services, but it becomes particularly important in a microservice environment where the scale and diversity of the environment are much greater. Developers should avoid aggregating log entries directly in microservices. This form of Active Logging causes services to become too aware of their environment, making them less portable, and potentially more vulnerable to failures. Instead, DevOps teams should employ Passive Logging technologies, which transform application stdout
and stderr
streams into log events and transport them to remote logging infrastructure.
We also covered a few basic examples of how to configure Passing Logging infrastructure in ECS and Kubernetes. For the most part, the discussion centered around how to bridge container logs with a remote aggregator and did not thoroughly cover how to deploy the logging infrastructure -- I plan to share some of these practices in a later post. However, we concluded with an architectural discussion on what the logging infrastructure should look like, specifically how data should flow from a Docker host to aggregation pipeline and finally to storage. I cannot emphasize enough the importance of building reliability and scalability into your logging infrastructure. In a microservice architecture, it's likely that for every client request you receive your services may generate 10 or more log entries. Therefore, you need to consider whether your log infrastructure has the capacity to scale as the number of requests to your system increases.
Thank you again for making it to the end of another long post! I hope you have taken away some ideas from my experiences. Should you have any questions or comments about this post, please don't hesitate to reach out to me on Twitter: @richardclayton.
You might also be interested in these articles...
Stumbling my way through the great wastelands of enterprise software development.