Linuxfabrik Monitoring Plugins: Behind the Scenes

icinga monitoring-plugins

If you run infrastructure or application monitoring, you really care about three things: a fast and scalable core that schedules checks, manages state changes and sends notifications. A collection of check plugins. A way to visualise performance data. This post is about the middle piece, and why and how we ended up writing our own check plugin collection.

The first piece is solved by Icinga. The third piece is typically Grafana on top of a time-series database like InfluxDB or Prometheus.

The middle piece is where it gets hard and tedious in practice. The Icinga project does not ship check plugins out of the box, so most users start with the classic Monitoring Plugins (formerly Nagios Plugins). That has consequences.

What should actually be monitored?

Imagine a minimal Linux server. To improve uptime and to make troubleshooting bearable, the following metrics are the ones you want:

  • CPU stats: user, system, iowait, idle (percentages).
  • Memory usage: used, buffered, cached, free (percentages).
  • Disk I/O: operations and bytes transferred per unit of time.
  • Free disk space on the local mounts.
  • File descriptors versus the system limit.
  • TCP connections by state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT, ...).
  • Network I/O: bytes received, bytes sent.
  • Network latency.

The shortcomings of the classic Monitoring Plugins

Of the metrics above, you get out of the box maybe two: free disk space, and some form of network latency. On top of that you get a pile of plugins for niche signal-strength wireless gear, game servers and Novell NetWare.

Looking closer, the plugins are written in three different languages (C, shell, Perl), are of different age and quality, and differ in their parameters, behaviour and output. The dependency closure of the full plugin set is non-trivial; on a slim base install, the plugin package can be larger than the Icinga core itself.

Looking for a replacement

Searching outside the classic set is not much better. The data centre niche is full of one-shot plugins released years ago, with very specific feature subsets, often without proper error handling, written in yet more languages (Ruby, Go) which adds yet more dependencies.

The consequence: a fresh check collection

After years of patching and writing custom plugins, we started a new check collection from scratch with a few rules of thumb:

  • Collect every metric that helps with troubleshooting.
  • Alert only on what actually requires action.
    • WARN: something needs to be done, no need to panic.
    • CRIT: only when someone has to get up at night and react now.
  • Keep configuration to a minimum.

What we want from each plugin

Before kicking the project off, we wrote down the design rules:

  • Focus on Icinga and Nagios.
  • Focus on the Red Hat ecosystem (the typical "minimal server" perspective), but stay open to Debian, Ubuntu and others.
  • Language: Python.
  • Fast and reliable execution.
  • Uniform behaviour: a "used" value means the same thing in every plugin, terse and precise output.
  • Self-configuring or auto-detecting where possible, with sane defaults, so the plugin runs without parameters in the common case.
  • Provide enough metrics to support troubleshooting, not just "OK / NOT OK".
  • Avoid third-party dependencies when reasonable, because pulling exotic libraries into a heterogeneous data centre is painful.
  • Released into the public domain (UNLICENSE).

Shared functionality lives in our own Python library, used by the plugins.

Some patterns from the collection

Host aliveness

The legacy check_ping is still often used to determine whether a host is alive. It comes with significant limitations, though. Since other services and downstream hosts depend on host state, a ping plugin has to be reliable and tolerant. Our ping plugin is exactly that:

  • If 4 out of 5 packets are lost (or 99 out of 100), the host is alive. As long as a single packet makes it through, return OK.
  • Warning thresholds on packet-loss ratios make no sense for pure host aliveness.

Our ping plugin is also fast: it sends five pings inside one second by default, which gives it the shortest execution time among ping checks we've benchmarked.

Time periods instead of flapping

Imagine a CPU-usage check that reports 100 % (CRIT), then 20 % (OK), then 90 % (CRIT). Without time-window awareness, you get permanent flapping and people stop trusting state changes. Better: alert only when the threshold has been exceeded for a configurable amount of time, the same idea as Prometheus' for: 5m. Implemented in cpu-usage, disk-io, network-io, procs, docker-stats, podman-stats, php-fpm-status, plus the platform-specific siblings fortios-cpu-usage, fortios-network-io, fortios-ha-stats and qts-cpu-usage.

Don't reinvent the wheel, port well-known tools

Before writing a new plugin, we look at upstream Monitoring Plugins, existing tools doing the job, or the Linux kernel itself, and port the ideas into the Monitoring Plugins project. Examples:

  • disk-smart: in parts a port of GSmartControl (originally C).
  • The mysql-* plugins (e.g. mysql-perf-metrics, mysql-innodb-buffer-pool-size): inspired by mysqltuner, split into specialised single-purpose checks.

Communicate with Icinga (acknowledgement as confirmation)

Sometimes you want to be informed about something without it being a real fault, e.g. a new release on GitHub or a security advisory. Nagios and Icinga have no NOTICE state, so we simulate it:

  1. New event: the plugin returns WARN.
  2. The operator acknowledges the warning ("Got it").
  3. On the next run, the plugin queries the Icinga API for the acknowledgement status.
  4. If acknowledged, the plugin returns OK until the next change.

Implemented in feed.

Checking for application updates

Comparing locally installed software against external resources (GitHub releases, vendor channels) requires being polite to those resources: even if you run the check every minute, don't fetch external URLs every minute. Use a local cache to keep traffic minimal. Implemented in matomo-version, nextcloud-version (which uses the official Nextcloud update channel rather than GitHub), rocketchat-version and others.

systemd-unit as the Swiss army knife

Replaces a number of legacy plugins (services, mounts, devices, timers). Some of the questions it answers:

  • Does the service exist?
  • Is the service running?
  • Is the service stopped and disabled?
  • Is the service with a particular instance name running?
  • Is a path mounted?
  • Is a device plugged in?
  • What is the current state of a timer job?
  • What is the state of a service that depends on a timer job?

Debugging and troubleshooting helpers

Plugins that help locate the source of a problem rather than just signal that something is off:

  • about-me: a one-stop overview of host, distribution, uptime, load and many other facts.
  • procs: filters processes via regex (name, argument, user) and with --top returns the n largest by CPU time or memory; --lengthy adds all platform-specific memory_info() fields to the table.

Platforms and how to get the plugins

The plugins run on any platform with Python 3.9+ (Linux, Windows, macOS, FreeBSD). For Windows we also ship pre-compiled binaries, so a Python install on the target is not required.

When upgrading, always use an official release, and keep the plugins and the Python library in sync.

Development standards

We follow a small set of conventions to keep the plugins consistent:

  • EAFP: easier to ask for forgiveness than permission.
  • PEP 8 coding style, enforced by ruff format.
  • Docstrings on the library functions, so that pydoc works.
  • pylint runs without --disable flags: code is expected to satisfy the full linter, not only a curated subset.
  • Unit tests on every plugin where the test surface allows it.

For details, see the Contributing Guidelines.

We can help

Need help building or running your Icinga monitoring? Have a look at our Icinga Subscriptions and Service & Support plans, and get in touch.

Next Post

DE · EN