This addon installs hardware monitoring tools and configures Nagios checks for the system hardware and storage monitoring.
Vendors supported: Dell (MegaRAID) Gigabyte (LSI SAS) Huawei (LSI SAS) Supermicro (LSI SAS) NVMe cards from Intel and Samsung (see README)
Tools supported: freeipmi tools (ie. ipmimonitoring, ipmi-sel, etc.) megacli mdadm nvme sas2ircu sas3ircu ilorest
This charm installs various hardware system monitoring tools and configures Nagios NRPE checks. It will only work for bare-metal installations on specific hardware.
Currently supported hardware is:
- Any controller supported by the
megaraid_sasdriver (ie, any controller handled by the MegaRAID CLI)
- Supermicro: LSI SAS3008 RAID card with sas3ircu (Broadcoam's SAS3IRCU_P16)
- Huawei: LSI SAS2308 RAID card with sas2ircu (Huawei FusionServer Tools InfoCollect)
- SSD cards: Intel's PCIe Data Center SSD, Samsung's NVMe controllers for SM961/PM961 and 172Xa/172Xb.
- Linux software RAID (mdadm)
- IPMI as implemented by freeipmi (enable_ipmi config option is enabled by default)
- ipmiseld from the freeipmi suite (enable_ipmiseld is enabled by default) for logging system event log entries to syslog
In the backlog, hp-health logic still needs to be backported to support Hewlett-Packard gen8 and older equipment (HP Controllers with hpacucli)
Furthermore, other hardware in the roadmap is:
- Huawei's ES3000 V2 PCIe SSD Card with hio_info (Huawei ES3000 V2 Driver)
- S.M.A.R.T. Monitoring tool (smartctl)
juju deploy ubuntu juju deploy hw-health juju deploy nrpe juju add-relation ubuntu nrpe juju add-relation ubuntu hw-health juju add-relation hw-health nrpe
The Charmstore version already ships a resource. However, this resource is empty to avoid violating software redistribution license issues. To be useful, a new resource must be attached that includes your hardware manufacturer's RAID tools:
- Option 1:
juju deploy hw-health --resource tools=/tmp/zipfile.zip
- Option 2:
juju attach-resource hw-health tools=/tmp/zipfile.zip
In both cases format of zipfile.zip must be one of the following:
zip /tmp/zipfile.zip megacli sas2ircu sas3ircu zip /tmp/zipfile.zip megacli etc.
SEL entries can be filtered by date, in order to allow to maintain monitoring SEL content without the need to clear it.
To filter out all current SEL entries you must use the
ack-sel action. This
will leave out from the IPMI check all SEL entries older than today.
ack-sel action optionally takes a
date parameter. SEL entries older
date will be ignored in the check.
show-sel action also obeys the date filter.
juju run-action hw-health/8 ack-sel --wait # or juju run-action hw-health/8 ack-sel date=2019-08-24 --wait # view juju run-action hw-health/8 show-sel --wait
To clear the filter (ie., consider all SEL entries present), you must use the
Under the hood, the filtering is done by appending a
parameter to the
check_ipmi_sensor NRPE plugin. The charm will do the right
thing if a
--seloptions parameter is already present via the
ipmi_check_options config. But the SEL filtering set by the
action will take precedence over a date filter set manually by the
ipmi_check_options config. This is:
$ date Wed Mar 10 15:36:36 UTC 2021 $ juju config hw-health/0 ipmi_check_options='--seloptions --date-range=07/02/2019-now' $ juju run-action hw-health/0 ack-sel
... will cause SEL entries older than the 10th of March 2021 to be ignored.
Known Limitations and Issues
Charm only install method is via Juju resources. There are plans to support snaps but snapstore only supports strictly confined snaps. Hardware monitoring tools need special permissions that are under development.
"tools" resource needs to be attached in ZIP format, and hardware monitoring tool(s) need to be on the first level of the archive tree.
Building the tools.zip resource
In order to build the tools.zip resource it is necessary to source the binaries from the respective vendor support pages.
You will then have to extract, rename, and compress the binaries to obtain the following structure:
$ zipinfo tools.zip Archive: tools.zip Zip file size: 1204457 bytes, number of entries: 3 -rwxr-xr-x 3.0 unx 2720320 bx defN 19-Jan-16 11:31 megacli -rwxrwxr-x 3.0 unx 559164 bx defN 19-Jan-16 11:31 sas2ircu -rwxrwxr-x 3.0 unx 562560 bx defN 19-Jan-16 11:31 sas3ircu 3 files, 3842044 bytes uncompressed, 1204005 bytes compressed: 68.7%
Two more zip resources may be needed for functional tests to succeed: * tools-checksum.zip replaces the megacli tool by an empty file. * tools-missing.zip removes the megacli tool from the resource
$ zipinfo tools-checksum.zip Archive: tools-checksum.zip Zip file size: 547860 bytes, number of entries: 3 -rwxr-xr-x 3.0 unx 0 bx stor 19-Jan-16 11:35 megacli -rwxr-xr-x 3.0 unx 559164 bx defN 19-Jan-16 11:31 sas2ircu -rwxr-xr-x 3.0 unx 660560 bx defN 19-Jan-16 11:31 sas3ircu 3 files, 1219724 bytes uncompressed, 547408 bytes compressed: 55.1% $ zipinfo tools-missing.zip Archive: tools-missing.zip Zip file size: 547718 bytes, number of entries: 2 -rwxr-xr-x 3.0 unx 559164 bx defN 19-Jan-16 11:31 sas2ircu -rwxr-xr-x 3.0 unx 660560 bx defN 19-Jan-16 11:31 sas3ircu 2 files, 1219724 bytes uncompressed, 547408 bytes compressed: 55.1%
Note: vendor tools may be updated over time. The charm verifies that the shared binaries match a set of known checksums. If you feel a checksum is missing, please file a bug (see link below) and it will be added.
Manufacturer option needs to be left in auto mode.
Please contact the Nagios charmers via the "Submit a bug" link.
Upstream Project Name
- (boolean) Enable debug logging.
- (boolean) Enable the use of freeipmi tools to monitor hardware status.
- (boolean) Enable the logging of IPMI system event log (SEL) data to syslog
- (string) Space separated list of extra deb packages to install.
- (string) List of signing keys for install_sources package sources, per charmhelpers standard format (a yaml list of strings encoded as a string). The keys should be the full ASCII armoured GPG public keys. While GPG key ids are also supported and looked up on a keyserver, operators should be aware that this mechanism is insecure. null can be used if a standard package signing key is used that will already be installed on the machine, and for PPA sources where the package signing key is securely retrieved from Launchpad.
- (string) List of extra apt sources, per charm-helpers standard format (a yaml list of strings encoded as a string). Each source may be either a line that can be added directly to sources.list(5), or in the form ppa:<user>/<ppa-name> for adding Personal Package Archives, or a distribution component to enable.
- (string) YAML set of lines. Each item in the list fills an exclude file to exclude sensors, each line specifies an exclude. Specify name and type pipe delimited in this file to exclude a sensor, for example: System Chassis Chassis Intru|Physical Security If the first character of a line is '~' the name is treated as a regular expression. E.g. to exclude all sensor names from CPU0 to CPU9: ~CPU[0-9] Temp|Temperature
- ["~.*Link Down|Slot/Connector"]
- (string) Additional options to be passed to check_ipmi_sensor. For non-standard ipmi implementations you might for example need "--seloptions --assume-system-event-records"
- (string) Choose the tools to get deployed (hp, dell, supermicro, huawei) or leave the charm to self discover the tools needed to run hardware health checks. The special value "test" is only useful during testing to bypass installation restrictions and override hardware detection. Only "auto" and "test" are currently implemented.
- (string) Used by the nrpe subordinate charms. A string that will be prepended to instance name to set the host name in nagios. So for instance the hostname would be something like: juju-myservice-0 If you're running multiple environments with the same services in them this allows you to differentiate between them.
- (string) A comma-separated list of nagios servicegroups. If left empty, the nagios_context will be used as the servicegroup
- (string) The status of service-affecting packages will be set to this value in the dpkg database. Valid values are "install" and "hold".
- (string) YAML set of lines. Each item in the list excludes entries from the system event log. Specify name and type pipe delimitered in this file to exclude an entry, for example: "System Chassis Chassis Intru|Physical Security." If the first character of a line is '~', the line is treated as a regular expression.
- ["~ACPI State|System ACPI Power State"]
- (string) How often snapd handles updates for installed snaps. The default (an empty string) is 4x per day. Set to "max" to check once per month based on the charm deployment date. You may also set a custom string as described in the 'refresh.timer' section here: https://forum.snapcraft.io/t/system-options/87
- (int) Amount of time allowed for scripts to run before exiting.