Introducing Nagios-Dropwizard
Super simple Nagios checks via Dropwizard Tasks.
On a project I'm currently working on, I had a need to create a framework to expose Nagios health checks within a Dropwizard service. You can find the framework and a check_url.py
Nagios check script on Github here: Nagios-Dropwizard.
Why not use Dropwizard Health Checks?
The health check facility on Dropwizard didn't provide enough information for our purposes. It was also not idiomatic to Nagios.
How does it work?
The Nagios-Dropwizard framework is built on top of the Dropwizard Task
mechanism. The assumption is that Nagios (i.e. check_url.py
) will call specific tasks on the Dropwizard admin port and that task will return a properly formatted Nagios health check. The output will be parsed by check_url.py
, which will translate the message into the appropriate exit code.
Writing a Nagios Check Task in Dropwizard.
The framework provides a convenient super type, com.bericotech.dropwizard.nagios.NagiosCheckTask
, that developers can extend. NagiosCheckTask
requires subtypes to implement the performCheck
method, which provides request parameters and expects a Nagios MessagePayload
object returned.
For example, say we had a task queue we wanted to monitor:
public class QueueCheckTask extends NagiosCheckTask {
static final int CRITICAL = 80;
static final int WARNING = 50;
Queue queue;
public QueueCheckTask(Queue queue) {
// Tasks must have names in Dropwizard and this
// is a constructor requirement of the framework.
// I wish I could make it more obvious.
super("check-queue");
this.queue = queue;
}
@Override
public MessagePayload performCheck(
ImmutableMultimap<String, String> requestParameters)
throws Throwable {
itemCount = queue.size();
Level level;
if (itemCount > CRITICAL)
level = Level.CRITICAL;
else if (itemCount > WARNING)
level = Level.WARNING;
else
level = Level.OK;
String message = String.format(
"Queue is %s at %s items.", level, itemCount);
return MessagePayload.builder()
.withLevel(level)
.withMessage(message)
.withPerfData(
PerfDatum.builder("count", itemCount).build()
)
.build();
}
}
To register the Nagios check task with Dropwizard, you simply add it to the environment as a Dropwizard Task
:
environment.addTask(new QueueCheckTask(queue));
Or, if you our using the Fallwizard framework, you simply need to define it as Spring Bean and it will be automatically registered with Dropwizard.
<bean class="my.namespace.QueueCheckTask" c:queue-ref="queueBean" />
Using the check_url.py
Nagios check script.
Checking the status of the Nagios check task is easy using the check_url.py
script. Assuming you have the Dropwizard server running, simply execute:
python check_url.py -u admin -p password -H localhost -P 8081 /
-U tasks/check-queue
Calling this, you should receive a message like:
Ok - Queue is OK at 25 items. | count=25
The exit code will also be mapped to the appropriate value (in this case, 0
).
Passing parameters.
Your status checks don't have to be static. If you need to, you can pass parameters to the status check which will be available in the ImmutableMultimap<String, String> requestParameters
parameter of the performCheck
method.
I've included a couple of utility functions that will allow you to pull the first parameter out of the requestParameters
, one that even throws an error if the parameter does not exist:
// Use the Guava Optional wrapper to indicate a possible null value.
Optional<String> queueName = getParameter(requestParams, "queueName");
// Throws an UnsatisfiedParameterException if the parameter
// is not found.
String queueName = getMandatoryParameter(requestParams, "queueName");
Error Handling.
The NagiosCheckTask
does not require derived classes to trap exceptions (throw
away!). If an exception is thrown by a derived class, the NagiosCheckTask
will wrap the exception and return a MessagePayload
indicating the service (or at least this check) is Level.CRITICAL
.
This, however, is not the same behavior for the check_url.py
script. Instead, we take the convention that if an error occurs, the status of the check is UNKNOWN
. I've taken this convention because it's completely possible that the check_url.py
script is misconfigured or there's a connection problem. I don't believe this to be the same case with a failure in a NagiosCheckTask
which tends to indicate some sort of failure within the system.
That's it for now. I would love to hear what you think.
Stumbling my way through the great wastelands of enterprise software development.