Linuxfabrik Monitoring Plugins - Behind the Scenes

datacenter icinga monitoring-plugins

If you implement infrastructure or application monitoring, you mainly care about three things: You want to have a fast and scalable core that queues check execution and manages state changes as well as notifications. You need a collection of check plugins. You might want to visualize performance data from the past or do some kind of trend monitoring.

The perfect tool for the first part is Icinga2. The last part can be done using InfluxDB and Grafana, for example.

When it comes to the second part, the Icinga project does not provide any check plugins out of the box, so most of you start using nagios-plugins from monitoring-plugins.org. This has some impacts, therefore this article is about the art of check plugins, the drawbacks of nagios-plugins and why and how Linuxfabrik implemented a replacement.

What should be monitored first

Imagine a minimal Linux server. This is what IMO should be monitored to increase uptimes and assist you in troubleshooting:

  • CPU stats (user, system, iowait & idle percentages)
  • Memory usage (used, buffered, cached & free percentages)
  • Disk I/O (operations & amount of data transferred per unit time)
  • Free disk space on the local mounts
  • File descriptors vs. max system limit
  • TCP connections by state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT)
  • Network I/O (bytes received, bytes sent)
  • Network latency

Drawbacks of nagios-plugins

When trying to implement the above mentioned aspects using nagios-plugins-all all you get is:

  • Free disk space on the local mounts
  • Network latency (some kind of it)

That's it – but beside that you also get a bunch of mysterious plugins for checking the signal strength of some special wireless equipments or even game or Novell Netware servers.

If you have a closer look, the plugins are written in three different languages (C, shell script, Perl), are of different age and quality and differ noticeably in configuration options, check behavior and output details. Assuming you installed nagios-plugins-all on CentOS 7 Minimal, the size of nagios-plugins-all including all (Perl-)dependencies is twice as big as the size of the Icinga2 Core.

Searching for a replacement

With regard to datacenter monitoring, searching for a replacement or complement is even worse: thousands of authors just released one check plugin years ago with a very special feature subset. Most of those plugins are not "Enterprise Grade" in terms of error handling for example, and are written in even more languages like Ruby, Go etc., which leads to more library or tool dependencies.

The Consequence

After years of using all kind of plugins (including self-written ones) while the world has moved on, we started writing a new Check Collection from scratch, with the following global rules of thumb in mind:

  • Collect all metrics that help with troubleshooting.
  • Alert only on those that require an action.
    • WARN: something has to be done, but no need to get a heart attack.
    • CRIT: only where it is absolutely necessary, stand up at night and react immediately.
  • Spend as little time and effort as possible with check plugin configuration.

Specifications

Before we kick-started this project, we defined some essential software requirements, which outline functional and non-functional requirements and describe user interactions that the new check plugins must provide for perfect interaction.

Excerpt:

  • Focus on Icinga2.
  • Focus on Red Hat Software Stack (develop from a "CentOS Minimal Storage Server" perspective).
  • ... but be open to Ubuntu & Co.
  • Language: Python.
  • Create fast and reliable check plugins.
  • Uniform behaviour: report the same (for example always "used") in a short and precise manner.
  • The check should be "self configuring" or have some "auto detecting" capabilities, and is using best practice defaults, so that it runs on the command line without parameters (wherever possible).
  • Provide some metrics to help you troubleshoot your system.
  • Try to avoid 3rd party dependencies (it can be difficult to get any external libs on different OS within complex environments).
  • Publish using the UNLICENSE.

Python, Python 2 and Python 3

Today (2020-05-26), our checks are written in Python 2, because in a datacenter environment (where those checks are mainly used) the python == python2 side is still more popular. In CentOS 7, Python 2.7 is the default, Python 3 became available in CentOS 7.8. In CentOS 8, there is no default, you just need to specify whether you want Python 3 or 2. Support for Python 2 has officially ended, but not in CentOS 8 (Python 2 remains available in CentOS 8 until the late 2020's decade - for further details have a look here). Nevertheless, providing a Python 3 variant of each check is on our roadmap.

If we have to use 3rd party libraries for various reasons, we stick to official versions. At the time of writing, some check plugins need the Python libs:

Other shared functions are located in our self-written Python Library, for example dealing with:

  • base: mostly state, time or string related functions
  • cache: a kind of Redis-like key-value-store using SQLite
  • icinga: plugin functions for calling the Icinga API (get_service(), set_downtime() etc.)
  • url: fetch(), `fetch_json()`` with timeout, proxy and TLS handling

Currently Supported Platforms

We tested the checks on:

  • CentOS 7
  • CentOS 8
  • Fedora 30+
  • Ubuntu Server 16+

Linuxfabrik Monitoring Plugins in a Nutshell

Host Aliveness

In the wild check_ping is mainly used for checking host aliveness. Due to the fact that services (and in host hierarchies other hosts) depend on a host state, a new ping plugin has to be reliable and tolerant:

  • If we lose 4 out of 5 (or 99 out of 100) packets, a host is alive. Therefore, if only one packet find its way to the host, return OK.
  • From a host aliveness perspective, it makes no sense to configure and warn against packet loss ratios.

Beside that, our ping check is fast: it sends five pings in one second (by default), so it has the shortest plugin execution time amongst all ping checks.

Time Periods

Imagine a "cpu usage" check that reports 100% usage (resulting in crit), 20% (ok), 90% (crit) and so on: because the check plugin doesn't consider past results, we get an annoying flapping behaviour, so that everyone working with Icinga gets used to state changes. It would be better if the check only alerts when the condition has been above the warn/crit threshold for a specific amount of time – much like Prometheus does with its "for: 5m" construct. This behaviour is currently implemented in:

  • cpu-usage
  • disk-io
  • fortios-cpu-usage
  • fortios-network-io

Don't reinvent the Wheel – instead port Well-Known Tools

Before implementing a new check, we always have a look at the source code of monitoring-plugins.org, existing tools that do the job today or even the Linux kernel and try to port the ideas according to our Development Guidelines. Examples:

  • disk-smart, which is in parts a port from GSmartControl (written in C)
  • mysql-stats, which is ported from mysqltuner (written in Perl)

Communicate with Icinga

Sometimes we just want to be informed on something, for example on a new release on GitHub or on news item on a Security Portal. Unfortunately there is no simple NOTICE state in Nagios or Icinga, so one way to simulate this functionality is:

  • New event – the check fires a WARN state.
  • An operator acknowledges the WARN in Icinga (which means: "Thank you, I got it").
  • During the next evaluation, the check fetches the Icinga API to get its acknowledged state.
  • If it was acknowledged, the check returns OK from now on (until it determines a new result).

This behaviour is currently implemented in:

  • feed

Checking for Application Updates

If checking for application updates, we have to compare external resources (for example releases on GitHub) to locally installed software. Always be nice when using external resources: even if running every minute, don't fetch external URLs that don't change too often. Use a local cache to minimize traffic. This behaviour is currently implemented in:

  • matomo-version
  • nextcloud-version (we don't fetch GitHub — we use the official Nextcloud Update channels to check for new versions)
  • rocket.chat-version

systemd-unit

The "Swiss Army Knife" among our checks: it replaces legacy plugins checking for services, mounts, devices etc.

Some popular questions this check can answer:

  • Does the service exist?
  • Is the service running?
  • Is the service stopped and disabled?
  • Is the service with the special instance name running?
  • Is a path mounted?
  • Is a device plugged in?
  • What is the current state of a timer job?
  • What is the current state of a service depending on a timer job?

Debugging and Troubleshooting

Checks that provide some additional information to assist you in debugging and troubleshooting:

  • about-me
  • top3-most-memory-consuming-processes
  • top3-processes-opening-more-file-descriptors
  • top3-processes-which-caused-the-most-io
  • top3-processes-which-consumed-the-most-cpu-time

Where to get the Monitoring Plugins

Releases:

Please ensure that you always use an official release, and always the same release for the checks and the libraries.

Our Development Guidelines

Beside defining deliverables and development patterns like "naming conventions" or "prefer percentages over absolute values to assist users in comparing different systems with different absolute sizes", we also make use of some established Python coding styles:

  • EAFP: Easier to ask for forgiveness than permission.
  • We recently started to use PEP 8 — Style Guide for Python Code.
  • Not long ago we started to document our Python Libraries using docstrings, so that calls like pydoc lib/base.py work.
  • To further improve code quality, we recently started using Pylint with pure pylint for the libraries, and with pylint --disable=C0103,C0114,C0116 for the check plugins, on a more regular basis.
  • More and more checks come with Unit Tests.

For details, have a look at CONTRIBUTING.rst.

How to implement a "SELinux Mode Check" in three easy Steps

Hands on: we want to implement a simple plugin that checks the current SELinux enforcement state. If it is not equal to the default (enforcing) or the state given via a parameter, it fires a warning.

A first iteration that does nothing, simply returns OK and serves as a development template looks like this:

01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03: 
04: import argparse
05: import sys
06: from traceback import print_exc
07: 
08: from lib.globals import STATE_UNKNOWN, STATE_OK
09: import lib.base
10: 
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051501'
13: 
14: DESCRIPTION = '''Lorem ipsum.'''
15: 
16: 
17: def parse_args():
18:     parser = argparse.ArgumentParser(description=DESCRIPTION)
19:     parser.add_argument(
20:         '-V', '--version',
21:         action='version',
22:         version='%(prog)s: v{} by {}'.format(__version__, __author__),
23:     )
24:     return parser.parse_args()
25: 
26: 
27: def main():
28:     try:
29:         args = parse_args()
30:     except SystemExit as e:
31:         sys.exit(STATE_UNKNOWN)
32: 
33:     lib.base.oao('It works.', STATE_OK)
34: 
35: 
36: if __name__ == '__main__':
37:     try:
38:         main()
39:     except Exception as e:
40:         print_exc()
41:         sys.exit(STATE_UNKNOWN)

On line 04..06 we import some Python core libraries. On line 08 and 09 we do this for some of the Linuxfabrik libs as well.

After defining how to parse command line arguments, in the main() function at line 33 we simply say "Over and Out (oao)", print "It works." and fire OK.

Now, let's improve.

01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03: 
04: import argparse
05: import sys
06: from traceback import print_exc
07: 
08: from lib.globals import STATE_UNKNOWN, STATE_OK
09: import lib.base
10: 
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051901'
13: 
14: DESCRIPTION = '''Lorem ipsum.'''
15: 
16: CMD = 'getenforce'
17: DEFAULT_SELINUX_MODE = 'enforcing'
18: 
19: 
20: def parse_args():
21:     parser = argparse.ArgumentParser(description=DESCRIPTION)
22:     parser.add_argument(
23:         '-V', '--version',
24:         action='version',
25:         version='%(prog)s: v{} by {}'.format(__version__, __author__),
26:     )
27:     return parser.parse_args()
28: 
29: 
30: def main():
31:     try:
32:         args = parse_args()
33:     except SystemExit as e:
34:         sys.exit(STATE_UNKNOWN)
35: 
36:     stdout, stderr, retc = lib.base.coe(lib.base.shell_exec(CMD))
37:     if (stderr or retc != 0):
38:         lib.base.oao('Bash command `{}` failed.nStdout: {}nStderr: {}'.format(
39:             CMD, stdout, stderr), STATE_UNKNOWN)
40:     selinux_mode = stdout.strip().lower()
41: 
42:     lib.base.oao('It works.', STATE_OK)
43: 
44: 
45: if __name__ == '__main__':
46:     try:
47:         main()
48:     except Exception as e:
49:         print_exc()
50:         sys.exit(STATE_UNKNOWN)

Line 16 defines the shell command we want to use, line 17 what we expect if we don't get a command line argument from the operator later on.

Line 36 uses shell_exec() to execute the external command, returning the complete output as strings (stdout, stderr) and the program exit code (retc). It is surrounded by "continue or exit" (lib.base.coe), meaning if anything fails, the check exits here, returning UNKNOWN and the system error message. The last thing we have to do is to provide a help text and some real-world command line params, and check against them:

01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03: 
04: import argparse
05: import sys
06: from traceback import print_exc
07: 
08: from lib.globals import STATE_UNKNOWN, STATE_OK, STATE_WARN
09: import lib.base
10: 
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051901'
13: 
14: DESCRIPTION = '''Checks the current mode of SELinux against a desired mode,
15:     and returns a warning on a non-match.'''
16: 
17: CMD = 'getenforce'
18: DEFAULT_SELINUX_MODE = 'enforcing'
19: 
20: 
21: def parse_args():
22:     parser = argparse.ArgumentParser(description=DESCRIPTION)
23:     parser.add_argument(
24:         '-V', '--version',
25:         action='version',
26:         version='%(prog)s: v{} by {}'.format(__version__, __author__),
27:     )
28:     parser.add_argument(
29:         '--always-ok',
30:         dest='ALWAYS_OK',
31:         action='store_true',
32:         default=False,
33:     )
34:     parser.add_argument(
35:         '--mode',
36:         default=DEFAULT_SELINUX_MODE,
37:         dest='SELINUX_MODE',
38:         choices=['enforcing', 'permissive', 'disabled'],
39:     )
40:     return parser.parse_args()
41: 
42: 
43: def main():
44:     try:
45:         args = parse_args()
46:     except SystemExit as e:
47:         sys.exit(STATE_UNKNOWN)
48: 
49:     stdout, stderr, retc = lib.base.coe(lib.base.shell_exec(CMD))
50:     if (stderr or retc != 0):
51:         lib.base.oao('Bash command `{}` failed.nStdout: {}nStderr: {}'.format(
52:             CMD, stdout, stderr), STATE_UNKNOWN)
53:     selinux_mode = stdout.strip().lower()
54: 
55:     if selinux_mode == args.SELINUX_MODE.lower():
56:         lib.base.oao('SELinux mode is {} (as expected).'.format(
57:             selinux_mode), STATE_OK)
58:     lib.base.oao('SELinux mode is {}, but supposed to be {}.'.format(
59:         selinux_mode, args.SELINUX_MODE), STATE_WARN, always_ok=args.ALWAYS_OK)
60: 
61: 
62: if __name__ == '__main__':
63:     try:
64:         main()
65:     except Exception as e:
66:         print_exc()
67:         sys.exit(STATE_UNKNOWN)

Assuming you save this as mycheck, you can call it like so:

./mycheck
./mycheck --version
./mycheck --help
./mycheck --mode permissive

What's Next

The check plugins and the libraries are constantly evolving. We are publishing new releases on a frequent basis, so stay informed.

Previous Post Next Post