If you run infrastructure or application monitoring, you really care about three things: a fast and scalable core that schedules checks, manages state changes and sends notifications. A collection of check plugins. A way to visualise performance data. This post is about the middle piece, and why and how we ended up writing our own check plugin collection.
The first piece is solved by Icinga. The third piece is typically Grafana on top of a time-series database like InfluxDB or Prometheus.
The middle piece is where it gets hard and tedious in practice. The Icinga project does not ship check plugins out of the box, so most users start with the classic Monitoring Plugins (formerly Nagios Plugins). That has consequences.
Imagine a minimal Linux server. To improve uptime and to make troubleshooting bearable, the following metrics are the ones you want:
ESTABLISHED, CLOSE_WAIT, TIME_WAIT, ...).Of the metrics above, you get out of the box maybe two: free disk space, and some form of network latency. On top of that you get a pile of plugins for niche signal-strength wireless gear, game servers and Novell NetWare.
Looking closer, the plugins are written in three different languages (C, shell, Perl), are of different age and quality, and differ in their parameters, behaviour and output. The dependency closure of the full plugin set is non-trivial; on a slim base install, the plugin package can be larger than the Icinga core itself.
Searching outside the classic set is not much better. The data centre niche is full of one-shot plugins released years ago, with very specific feature subsets, often without proper error handling, written in yet more languages (Ruby, Go) which adds yet more dependencies.
After years of patching and writing custom plugins, we started a new check collection from scratch with a few rules of thumb:
WARN: something needs to be done, no need to panic.CRIT: only when someone has to get up at night and react now.Before kicking the project off, we wrote down the design rules:
Shared functionality lives in our own Python library, used by the plugins.
The legacy check_ping is still often used to determine whether a host is alive. It comes with significant limitations, though. Since other services and downstream hosts depend on host state, a ping plugin has to be reliable and tolerant. Our ping plugin is exactly that:
Our ping plugin is also fast: it sends five pings inside one second by default, which gives it the shortest execution time among ping checks we've benchmarked.
Imagine a CPU-usage check that reports 100 % (CRIT), then 20 % (OK), then 90 % (CRIT). Without time-window awareness, you get permanent flapping and people stop trusting state changes. Better: alert only when the threshold has been exceeded for a configurable amount of time, the same idea as Prometheus' for: 5m. Implemented in cpu-usage, disk-io, network-io, procs, docker-stats, podman-stats, php-fpm-status, plus the platform-specific siblings fortios-cpu-usage, fortios-network-io, fortios-ha-stats and qts-cpu-usage.
Before writing a new plugin, we look at upstream Monitoring Plugins, existing tools doing the job, or the Linux kernel itself, and port the ideas into the Monitoring Plugins project. Examples:
disk-smart: in parts a port of GSmartControl (originally C).mysql-* plugins (e.g. mysql-perf-metrics, mysql-innodb-buffer-pool-size): inspired by mysqltuner, split into specialised single-purpose checks.Sometimes you want to be informed about something without it being a real fault, e.g. a new release on GitHub or a security advisory. Nagios and Icinga have no NOTICE state, so we simulate it:
WARN.OK until the next change.Implemented in feed.
Comparing locally installed software against external resources (GitHub releases, vendor channels) requires being polite to those resources: even if you run the check every minute, don't fetch external URLs every minute. Use a local cache to keep traffic minimal. Implemented in matomo-version, nextcloud-version (which uses the official Nextcloud update channel rather than GitHub), rocketchat-version and others.
systemd-unit as the Swiss army knifeReplaces a number of legacy plugins (services, mounts, devices, timers). Some of the questions it answers:
Plugins that help locate the source of a problem rather than just signal that something is off:
about-me: a one-stop overview of host, distribution, uptime, load and many other facts.procs: filters processes via regex (name, argument, user) and with --top returns the n largest by CPU time or memory; --lengthy adds all platform-specific memory_info() fields to the table.The plugins run on any platform with Python 3.9+ (Linux, Windows, macOS, FreeBSD). For Windows we also ship pre-compiled binaries, so a Python install on the target is not required.
main.When upgrading, always use an official release, and keep the plugins and the Python library in sync.
We follow a small set of conventions to keep the plugins consistent:
ruff format.pydoc works.pylint runs without --disable flags: code is expected to satisfy the full linter, not only a curated subset.For details, see the Contributing Guidelines.
Need help building or running your Icinga monitoring? Have a look at our Icinga Subscriptions and Service & Support plans, and get in touch.