Configuring the collector

Configuring the collector – step by step
Prev	Chapter 4. Configuration	Next

Configuring the collector – step by step

The collector reads its initial configuration from the default configuration file cluster.conf, located in the directory /etc/pscollect. The collector is running as non-privileged user pscd, therefore, this file should be owned and be readable by this user.

Comments start with a -- and are ignored until the end of line. The complete configuration is defined using LUA, a powerful scripting language. Refer to http://www.lua.org for more details.

The configuration is devided into particular steps, each of them is describing a particular configuration aspect. The next capters will describe these steps in detail.

The collector must be restarted to activate the newly created or modified configuration. To do so, run the command

    /etc/init.d/pscollect restart

Configuring the collector – basics

The first section within the collector configuration file defines global configuration entries, required for the collector. These entries typically don't have to be modified. Note the final call of the init() method. This is important bring the collector into a well defined state.

    --# -*- lua -*-
    --# pscollect configuration
    --#
    --# Lua 5.1 Reference Manual: http://www.lua.org/manual/5.1/
    --#

    --# Set root password to abc321 via:
    --# echo "passwd root abc321" | psget /var/run/pscollect/socket    (1)

    debug=false

    --# Load defaults
    include "/opt/parastation/config/sys_cluster.pscc"

    --# Overwrite defaults?
    -- put('config/db_root', '/var/lib/pscollect')
    -- put('config/bin_dir', '/opt/parastation/bin')
    -- put('config/rcmd', 'ssh')


    --# Initialize
    init()

Example 4.1. Basic collector configurations

Configuring the collector – step 1

These section defines whether the collector accepts connections from each node or only connections initiated by clients on the local node.

    --# ###########################################################
    --# Step 1: Should any node be able to read information from
    --#         collector (required, security)?
    --# Accept connections only from localhost:
    listen {host="localhost", port=4000}                               (1)
    --# Or accept connections from any host:
    --# (required if your webserver does not run on the same node)
    --listen {host="0.0.0.0", port=4000}
    --# Accept local socket connections:
    listen {socket="/var/run/pscollect/socket"}                        (2)

Example 4.2. Configuring collector accessability

The configuration entry (1) above tells the collector to accept connections from localhost, via TCP port 4000, and (2) configures it to open a socket in the local file system.

When connected through a local socket, the collector checks the user id of the peer process. If the owner is either root or the same user as that running the collector (normally pscd), that client automatically gains administrator privileges, without needing to provide a password. This can e.g. be used to change the password, as shown at the beginning (1) of the configuration file..

Configuring the collector – step 2

Within this section, the cluster name is defined:

    --# ###########################################################
    --# Step 2: define first cluster (required)
    cluster("Cluster1")                                                (1)

Example 4.3. Configuring cluster name

The entry cluster (1) tells the collector about a cluster called Cluster1. All further configuration steps refer to this cluster. The entry has to be modified to meet your current cluster name.

Configuring the collector – step 3

Within this section, all nodes belonging to a particular cluster are defined.

    --# ###########################################################
    --# Step 3: Define host list for first cluster (required)
    --#   Note: To enable reading S.M.A.R.T data from disk drives,
    --#         add the next two lines to the file /etc/sudoers
    --#         (on each node):
    --#         ------
    --#         Defaults:pscd   passwd_tries=0, !syslog
    --#         pscd    ALL=NOPASSWD: /usr/sbin/smartctl \
    --#           -[iAH] /dev/[sh]d[a-z], /usr/sbin/smartctl \
    --#           -d ata -[iAH] /dev/[sh]d[a-z]
    --#         ------
    --#         Using S.M.A.R.T data is optional
    --#   Warning: With some systems/controllers/disks, reading
    --#         S.M.A.R.T data may hang your system.
    --#         Test it!!!
    --#         Define host list, one by one:
    host("localhost")          -- replace with real hostname           (1)
    host("master")                                                     (2)
    host("node01")
    host("node02")
    host("node03")

    --# Or use a "lua for loop"?
    --# (defining nodes 'cnode1' up to 'cnode16')
    for n = 1, 16 do                                                   (3)
        host("cnode"..n)
    end

    --# Use node names with leading '0'
    # (defining nodes 'node-01' up to 'node-16')
    for n = 1, 16 do
        host("node-"..string.format("%0.2d",n))                        (4)
    end

Example 4.4. Configuring host list

The entry host (1) announces a new host to the cluster. All hosts announced using this function show up as cluster nodes within the graphical user interface. The name localhost should be replaced by the actual node name, e.g. master.

Note

It's generally not a good idea to list a node called localhost, as this name is ambiguous within a cluster. Use the cluster-internal real name of the node, like master or node01.

To enable the pscollect command to remotely run the required agent script, the ssh auto-login for the user pscd must be configured on the newly announced host, see the section called “Configuring the collector – enable pscd auto-login” for more information.

To announce a bunch of nodes at once, a LUA loop may be used (3). Quit often, the numbering part of the node names include leading zeros, like node01. Line (4) shows how to generate appropriate node names.

At least, one host entry is required.

Configuring the collector – step 4

Within this section, the nodes providing ParaStation process management information are defined.

    --# ###########################################################
    --# Step 4.1: Hostname of ParaStation master node  (opt.)
    ps4_host("localhost")                                              (1)

Example 4.5. Configuring ParaStation host name

The entry ps4_host (1) tells the collector to connect to the ParaStation daemon psid on localhost to gather ParaStation process management information, e.g. which jobs are currently active. Using localhost is ok, as this host name will not show up in the graphical user interface. This entry is optional and is set by defaults to localhost.

    --# ###########################################################
    --# Step 4.2: Hostname of ParaStation accounting node (opt.)
    --#     Note: To provide accouting information, grant the
    --#           user 'pscd' read access to all ParaStation
    --#           accouting files on the accounting host:
    --#           chmod g+rx /var/account;
    --#           chmod g+r /var/account/*;
    --#           chgrp -R pscd /var/account
    ps4_acc_host("localhost")                                          (1)

Example 4.6. Configuring ParaStation accounting host name

The entry ps4_acc_host (1) tells the collector to connect to the host localhost to read ParaStation job accounting information. Using localhost is ok, as this host name will not show up in the graphical user interface. This entry is optional and is disabled by default.

Note

The user pscd must be able to read the ParaStation accounting files located in the directory /var/account.

Configuring the collector – step 5

Within this section, the nodes providing batch system information are defined.

Note

Currently, only Torque is supported.

    --# ###########################################################
    --# Step 5.1: Hostname of TORQUE server node (optional)
    pbs_host("localhost")                                              (1)

Example 4.7. Configuring batch server name

The entry pbs_host (1) tells the collector to connect to the Torque server on localhost to gather batch job information. Using localhost is ok, as this host name will not show up in the graphical user interface. This entry is optional and is disabled by default.

    --# ###########################################################
    --# Step 5.2: Hostname of TORQUE accounting node (optional)
    pbs_acc_host("localhost")                                          (1)

Example 4.8. Configuring batch accounting host name

The entry pbs_acc_host (1) tells the collector to read job accounting information collected by Torque from host localhost. Using localhost is ok, as this host name will not show up in the graphical user interface. This entry is optional and is disabled by default.

Configuring the collector – step 6

Within this section, the configuration of the virtual sensors subsystem will be defined. These virtual sensors read real sensors by either using IPMI or using lmsensors package on a node. At least one of these two sensor sources must be configured.

    --# ###########################################################
    --# Step 6: Define hardware sensor sources (required)
    --#         Either IPMI (6.1) or lmsensors (6.2) must be used!

    --# ###########################################################
    --# Step 6.1: Define IPMI sensor sources (optional)

    --# ###########################################################
    --# Step 6.1.1: Define IPMI user access information in file
    --#             /etc/pscollect/ipmiuser (file is required)
    --#             File format: 'user:password'                       (1)
    --#       Note: This file should be readable only for user
    --#             'pscd'.

Example 4.9. Setting up the IPMI authenication

Within this step, the authentification information used to connect to a baseboard management controller (BMC) via IPMI is defined. This information is stored within a separate file /etc/pscollect/ipmiuser. The file contains only a single line with username and password, separated by a colon.

Note

This file should only be readable by the user pscd.

    --# ###########################################################
    --# Step 6.1.2: Map IPMI hosts (BMCs) to hosts (nodes)
    --#             (required, if IPMI), e.g.
    --#             ipmi_host("hostname","ipmi_host_addr") or
    --#             ipmi_host_p("hostname","ipmi_host_name",
    --#                         "username","password")
    --#             If no username/password is provided, the file
    --#             /etc/pscollect/ipmiuser will be consulted
    --#             (see 6.1.1)
    --#       Note: If your local BMC does not respond to requests
    --#             from the local host, e.g. ping from master to
    --#             master-bmc does not resolve the BMC address,
    --#             use the special IPMI host name "localhost".
    --#             Using this name, the ipmitool uses the 'open'
    --#             interface, which requires proper kernel
    --#             module support. Try
    --#               chkconfig -a ipmi
    --#               /etc/init.d/ipmi start
    --#             In addition, the user 'pscd' must be able to
    --#             run ipmitool as user root. Add the next two
    --#             lines to the file /etc/sudoers:
    --#             ------
    --#             Defaults:pscd   passwd_tries=0, !syslog
    --#             pscd    ALL=NOPASSWD: /usr/bin/ipmitool -A none \
    --#               -I open -H localhost [a-zA-Z/]*, \
    --#               /usr/bin/ipmitool -A none -I open \
    --#               -H localhost -S [a-zA-Z /]*
    --#             ------

    --# Single IPMI host
    --ipmi_host("node01","192.168.44.1")                               (1)
    --ipmi_host("node02","node02-ipmi")
    --ipmi_host("master","localhost")      -- see Note above

    --# 50 IMPI hosts in one loop
    for n = 1, 50 do                                                   (2)
      ipmi_host("node"..n,"192.168.44."..n)
    end

Example 4.10. Configuring IPMI host mapping

The ipmi_host entry (1) tells the collector to read IPMI information for host node01 from the BMC with address 192.168.44.1. Names like node01-bmc may also be used instead of the BMC IP address.

The entry (2) shows an example how to map a number of nodes (node1 up to node50 using a LUA loop.

    --# ###########################################################
    --# Step 6.1.3: Define IPMI chassis (optional), e.g.
    --#             ipmi_chassis("hostname","ipmi_chassis_addr"
    ipmi_chassis("chassis1","192.168.20.1")                            (1)

Example 4.11. Configuring IPMI chassis mapping

The ipmi_chassis entry (1) maps a Chassis BMC controller managing multiple blade servers to a chassis name.

Note

This is currently not supported in the GridMonitor GUI!

    --# ###########################################################
    --# Step 6.1.4: Map IPMI server sensor data (required, if IPMI),
    --#             e.g. map_ipmihost("hostname","virtualsensor"
    --#                              "realsensor")
    --#             Required virtual sensors are:
    --#               TempCPU1,
    --#               TempCPU2,
    --#               TempNode,
    --#               FAN1,
    --#               FAN2,
    --#               FAN3,
    --#               FAN4

    --# To map 16 nodes called node01 up to node16 at once, use:
    for n = 1, 16 do
        host = "node"..string.format("%0.2d",n)
        map_ipmihost(host,"TempNode","Ambient_Temp")                   (1)
        map_ipmihost(host,"TempCPU1","Temp1")
        map_ipmihost(host,"TempCPU1","Temp2")
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
    end

    --# Mapping suitable for 16 Dell server SC1435 called node01
    --# up to node16:                                                  (2)
    --for n = 1, 16 do
    --    host = "node"..string.format("%0.2d",n)
    --    map_ipmihost(host,"TempNode","Ambient_Temp")
    --    map_ipmihost(host,"TempCPU1","Temp1")
    --    map_ipmihost(host,"TempCPU1","Temp2")
    --    for i = 1, 2 do
    --        map_ipmihost(host,"FAN"..i,    "FAN_MOD_"..i.."A_RPM")
    --        map_ipmihost(host,"FAN"..i +2, "FAN_MOD_"..i.."B_RPM")
    --        map_ipmihost(host,"FAN"..i +4, "FAN_MOD_"..i.."C_RPM")
    --        map_ipmihost(host,"FAN"..i +6, "FAN_MOD_"..i.."D_RPM")
    --    end
    --end

    --# Mapping suitable for a Dell server PE1950:                     (3)
    --map_ipmihost("node01","TempNode","Ambient_Temp")
    --# Note: Temp3 and Temp4 seems to be constant (40), so
    --#       ignore it for now!
    --for i = 1, 2 do
    --   map_ipmihost("node01","TempCPU"..i,"Temp"..i)
    --end
    --for i = 1, 4 do
    --    map_ipmihost("node01","FAN"..i,     "FAN_MOD_"..i.."A_RPM")
    --    map_ipmihost("node01","FAN"..i +4 , "FAN_MOD_"..i.."B_RPM")
    --    map_ipmihost("node01","FAN"..i +8 , "FAN_MOD_"..i.."C_RPM")
    --    map_ipmihost("node01","FAN"..i +12, "FAN_MOD_"..i.."D_RPM")
    --end

Example 4.12. Configuring IPMI host sensor mappings

Within this section, the mappings from IPMI sensor values to virtual sensor values are defined. For example, entry (1) maps the IPMI sensor called Ambient_Temp to the virtual sensor name TempNode. Using LUA loops is very convenient to map a group of BMCs at once.

Use the parameter browser (ipmi->sdr->list) to list all available sensor names and values.

The entry (2) shows an example how to map the IPMI sensors of a Dell server SC1435. Similar, the entry (3) shows the mapping for a Dell server PE1950.

    --# ###########################################################
    --# Step 6.1.5: Map IPMI chassis sensor data (optional), e.g.
    --#             map_ipmichassis("hostname","virtualsensor",
    --#                             "realsensor")
    map_ipmichassis("chassis1","TempChassis","Temp1")                  (1)

Example 4.13. Configuring IPMI chassis sensor mappings

The map_ipmichassis entry (1) maps a Chassis BMC controller sensor value called Temp1 to the virtual sensor TempChassis.

Note

This is currently not supported in the GridMonitor GUI!

    --# ###########################################################
    --# Step 6.2: Define lmsensors sources (optional), e.g.
    --#           map_lmhost("hostname","Virtualsensor",
    --#                      "Realsensor")
    --#           Required virtual sensors are:
    --#               TempCPU1,
    --#               TempCPU2,
    --#               TempNode,
    --#               FAN1,
    --#               FAN2,
    --#               FAN3,
    --#               FAN4
    --#           To list your available sensors, use sensors
    --for n = 1, 16 do
    --    host = "node"..string.format("%0.2d",n)
    --    map_lmhost(host,"TempCPU1","temp1")                          (1)
    --    map_lmhost(host,"TempCPU2","temp2")
    --    map_lmhost(host,"TempNode","temp3")
    --    map_lmhost(host,"FAN1","fan1")
    --    map_lmhost(host,"FAN2","fan2")
    --    map_lmhost(host,"FAN3","fan3")
    --    map_lmhost(host,"FAN4","fan4")
    --end

Example 4.14. Configuring lmsensors sensor mappings

The map_lmhost entry (1) maps the sensor called temp1 read using lmsensors on node node01 to the virtual sensor called TempCPU1 for node node01.

Like mapping IPMI values, using a LUA for-loop is handy to map a group of identical nodes at once.

Use the parameter browser (hosts->sensors) to list all available sensor names.

Configuring the collector – step 7

This section describes how to configure SNMP managed network switches.

    --# ###########################################################
    --# Step 7: Define SNMP managed switches (optional), e.g.
    --#         snmp("addr")
    snmp("switch1")                                                    (1)

    --# Switch with non-default arguments:
    snmp("switch2", { host="sw4", version = "1" })                     (2)

    --# Switch with all arguments:
    snmp("switch3",                                                    (3)
        {
            host = "s",           -- snmp source (defaults to name)
            community = "public", -- snmp community
            version = "2c",       -- version ("1","2c" or "3")
            table_expire = 15,    -- cache expire in s
            value_expire = 5,     -- cache expire in s

            timeout = 500,        -- connection timeout in ms
            retries = 3           -- connection retries
        }
    )

Example 4.15. Setting up SNMP devices

The entry snmp (1) announces a SNMP managable device using default values to connect. Likewise, entry (2) announces a snmp device called switch2 using the address sw4 and the SNMP protocol version 1. Entry (3) shows all available options to the snmp mapping call.

Configuring the collector – step 8

This section describes how to configure an additional cluster within the collector.

    --# ###########################################################
    --# Step 8: Define second cluster (optional)
    --#         Repeat steps 2 to 7
    --cluster "Cluster2"
    --host "front2"
    --host "c2node1"
    --host "c2node2"
    --host "c2node3"
    --ps4_host "front2"

Example 4.16. Configuring an additional cluster

These entries show how to configure an additional cluster, managed by this collector. Just define a new cluster entry and repeat all required and optional configuration steps from step 2 up to step 7.

Configuring the collector – step 9

This section describes how to configure monitoring of parameter limits and saving parameters into the database.

    --# ###########################################################
    --# Step 9: Define monitoring limits and parameters stored
    --#         into database (required)


    ...

    --# ###########################################################
    --# Step 9.1: Save load1 values to DB (required) and define
    --#           monitor limit (optional)
    --#           Monitor all clusters and all hosts
    parameter("cluster/*/hosts/*",                                     (1)
        {
            monitor = { intern = true, group = "crit" },
            poll = 30,                                                 (2)
            load1 = {
                save_history = compress_load, -- required              (3)
    --#         Enable overload warnings: (max > nbr of cores)
    --          monitor = { max = 2.1, group = "warn" },               (4)
                poll = 300                                             (5)
            },
            memfree = {                                                (6)
                save_history = compress_min, -- required
    --#         Enable memory shortage warnings:
    --          monitor = { min = 20000, group = "warn" },
                poll = 600
            },
            swapfree = {                                               (7)
                save_history = compress_min, -- required
    --#         Enable swap shortage warnings:
    --          monitor = { min = 20000, group = "warn" },
                poll = 600
            }
        }
    )

Example 4.17. Defining general monitors

The entry (1) defines a monitor for the parameters load1, memfree and swapfree for all hosts on all clusters. The connection to this host is checked every 30 secs (2). Lost connections will be reported using the critical level ("crit").

The parameter load1 is stored to the database (3) and may be compared to an upper limit of 2.1 (4). Exceeding the maximum value would be reported using the event group ("warn"). Monitoring the upper limit is currently disabled. This parameter is read, compared and stored every 120 secs (5). The entry for parameter load1 is required.

Analogous to load1, monitors for the parameters memfree (6) and swapfree (7) are pre-defined. Every 10 minutes, both values are stored to the database. Monitoring of the minimum values is disabled within this example.

    --# ###########################################################
    --# Step 9.2: Monitor and save all required sensor limits
    --#           (required)
    --#     Note: This configures all nodes identically
    --#           using '.../hosts/*'
    parameter("cluster/*/sensors/hosts/*",
        {
            monitor = { intern = true, group = "crit" },
            poll = 30,
            TempCPU1 = {
    --#         save parameters to DB (for diagrams)
                save_history = compress_max_hi,
    --#         warn if temperature exceeds 60 (Celsius?)
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            TempCPU2 = {
                save_history = compress_max_hi,
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            TempNode = {
                save_history = compress_max_hi,
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            FAN1 = {
                save_history = compress_max,
    --#         warn if fan speed drops below 3000 rpms
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN2 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN3 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN4 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            }
        }
    )

Example 4.18. Defining virtual sensor monitors

Example 4.18 shows how to monitor and store the virtual sensor parameters. This entry is highly recommended. If configured, it will inform the administrator about fan or thermal problems.

Note

Virtual sensor parameters may be configured by using either IPMI or lmsensors data. For details how to map these entries, refer to the previous section.

    --# ###########################################################
    --# Step 9.3: Define required load status (required)
    parameter("cluster/*/stat",
        {
            load1 = {
                max = {
                   save_history = compress_load,
                   poll = 60
                },
                min = {
                   save_history = compress_load,
                   poll = 60
                },
                avg = {
                   save_history = compress_load,
                   poll = 60
                }
            }
        }
    )

Example 4.19. Defining load1 minimum, maximum and average

This monitor calculates every 60 secs the minimum, maximum and average of the load1 value of all hosts and saves it to the database. This monitor is required by the GridMonitor GUI and must not be modified!

Configuring the collector – step 10

This section describes how to configure the event notification system of the ParaStation GridMonitor.

Each parameter within the collector holds an internal state, e.g. unavailable or high. When transitioning from one state to another, events will be generated, which may be added to event groups.

Currently, only two event groups (warn, crit) are used. Refer to the previous section how to configure monitors and assign them to event groups.

    --# ###########################################################
    --# Step 10: Define event notification (required)

    --# ###########################################################
    --# Step 10.1: Define event notification for critical events
    --#            (required)
    lput("parameter/event/crit",                                       (1)
        {
            collect_time = 60,      -- collect events for 60 sec       (2)
            exec_time = 30*60,      -- max. 1 mail per 30 min          (3)

            exclude_states = {                                         (4)
                "ok"                -- dont send mails for state "ok"
            },

            unavailable = 3,        -- Warn after 3 read failures      (5)
    --        exec = event_system_call( \                              (6)
    --            "cat >> /tmp/pscollect.events",  \
    --             "Warnings:", "")
    --        exec = event_system_call( \                              (7)
    --            "env DISPLAY=:0 xmessage -file -",  \
    --            "Warnings:", "")
            exec = event_system_call( \                                (8)
                "mail root -s \"Cluster Cluster1 Critical Events\"",  \
                "Critical events:", "")
        }
    )

    --# ###########################################################
    --# Step 10.2: Define event notification for warning events
    --#            (required)
    lput("parameter/event/warn",                                       (9)
        {
            collect_time = 120,     -- collect events for 120 sec
            exec_time = 60*60,      -- max. 1 mail per 60 min

            exclude_states = {
                "ok"                -- dont send mails for state "ok"
            },

            unavailable = 3,        -- Warn after 3 read failures
    --        exec = event_system_call( \
    --            "cat >> /tmp/pscollect.events",  \
    --            "Warnings:", "")
    --        exec = event_system_call( \
    --            "env DISPLAY=:0 xmessage -file -",  \
    --            "Warnings:", "")
            exec = event_system_call( \
                "mail root -s \"Cluster Cluster1 Warnings\"",  \
                "Warnings:", "")
        }
    )

Example 4.20. Defining event notification

The entry (1) defines the configuration for events within the group crit. Initially, they will be collected for 60 secs (2) before an event handling call will be executed. After this initial collect time, more events of this type will be collected for 1800 secs (30 min) (3) before the next event handling call will be issued. This insures that the email system (see below) and the administrator will not be flooded in case of catastrophic errors.

The exclude_states list (4) defines a list of states which will not be reported, e.g. all ok states. The next entry (5) defines how many consecutive read failures may occure, before a connection is declared dead. The entries (6) and (7) give examples on how to act on crit events. The entry (8) define the actual action taken in case a crit event handling call is issued. In this example, an email will be sent notifying the administrator. The command will be executed as user pscd.

Similar to the entry in line (1), the entry (9) defines the timeout and action taken for the event group warn.