The collector reads its initial configuration from the default
      configuration file cluster.conf, located
      in the directory /etc/pscollect.
      The collector is running as non-privileged user
      pscd, therefore, this file should be owned
      and be readable by this user.
    
      Comments start with a -- and are ignored
      until the end of line.
      The complete configuration is defined using LUA, a powerful
      scripting language. Refer to http://www.lua.org for
      more details.
    
The configuration is devided into particular steps, each of them is describing a particular configuration aspect. The next capters will describe these steps in detail.
The collector must be restarted to activate the newly created or modified configuration. To do so, run the command
    /etc/init.d/pscollect restart
      
        The first section within the collector configuration file defines
        global configuration entries, required for the collector. These
        entries typically don't have to be modified. Note the final
        call of the init() method. This is
        important bring the collector into a well defined state.
      
    --# -*- lua -*-
    --# pscollect configuration
    --#
    --# Lua 5.1 Reference Manual: http://www.lua.org/manual/5.1/
    --#
    --# Set root password to abc321 via:
    --# echo "passwd root abc321" | psget /var/run/pscollect/socket    (1)
    debug=false
    --# Load defaults
    include "/opt/parastation/config/sys_cluster.pscc"
    --# Overwrite defaults?
    -- put('config/db_root', '/var/lib/pscollect')
    -- put('config/bin_dir', '/opt/parastation/bin')
    -- put('config/rcmd', 'ssh')
    --# Initialize
    init()
          Example 4.1. Basic collector configurations
These section defines whether the collector accepts connections from each node or only connections initiated by clients on the local node.
    --# ###########################################################
    --# Step 1: Should any node be able to read information from
    --#         collector (required, security)?
    --# Accept connections only from localhost:
    listen {host="localhost", port=4000}                               (1)
    --# Or accept connections from any host:
    --# (required if your webserver does not run on the same node)
    --listen {host="0.0.0.0", port=4000}
    --# Accept local socket connections:
    listen {socket="/var/run/pscollect/socket"}                        (2)
      Example 4.2. Configuring collector accessability
        The configuration entry (1) above tells the collector to accept
        connections from localhost, via TCP
        port 4000, and
        (2) configures it to open a socket
        in the local file system.
      
	When connected through a local socket, the collector checks the
	user id of the peer process. If the owner is
	either root or the same user as that
	running the collector (normally pscd), that
	client automatically gains administrator privileges, without
	needing to provide a password. This can e.g. be used to change
	the password, as shown at the beginning
	(1) of the configuration file..
      
See also the section called “Configuring basic GridMonitor GUI parameters”.
Within this section, the cluster name is defined:
    --# ###########################################################
    --# Step 2: define first cluster (required)
    cluster("Cluster1")                                                (1)
      Example 4.3. Configuring cluster name
        The entry cluster (1)
        tells the collector about a cluster called
        Cluster1.
        All further configuration steps refer to this cluster.
        The entry has to be modified to meet your current cluster
        name.
      
Within this section, all nodes belonging to a particular cluster are defined.
    --# ###########################################################
    --# Step 3: Define host list for first cluster (required)
    --#   Note: To enable reading S.M.A.R.T data from disk drives,
    --#         add the next two lines to the file /etc/sudoers
    --#         (on each node):
    --#         ------
    --#         Defaults:pscd   passwd_tries=0, !syslog
    --#         pscd    ALL=NOPASSWD: /usr/sbin/smartctl \
    --#           -[iAH] /dev/[sh]d[a-z], /usr/sbin/smartctl \
    --#           -d ata -[iAH] /dev/[sh]d[a-z]
    --#         ------
    --#         Using S.M.A.R.T data is optional
    --#   Warning: With some systems/controllers/disks, reading
    --#         S.M.A.R.T data may hang your system.
    --#         Test it!!!
    --#         Define host list, one by one:
    host("localhost")          -- replace with real hostname           (1)
    host("master")                                                     (2)
    host("node01")
    host("node02")
    host("node03")
    --# Or use a "lua for loop"?
    --# (defining nodes 'cnode1' up to 'cnode16')
    for n = 1, 16 do                                                   (3)
        host("cnode"..n)
    end
    --# Use node names with leading '0'
    # (defining nodes 'node-01' up to 'node-16')
    for n = 1, 16 do
        host("node-"..string.format("%0.2d",n))                        (4)
    end
      Example 4.4. Configuring host list
        The entry host (1)
        announces a new host to the cluster.
        All hosts announced using this function show up as cluster
        nodes within the graphical user interface.
        The name localhost should be replaced by
        the actual node name, e.g. master.
      
          It's generally not a good idea to list a node called
          localhost, as this name is ambiguous
          within a cluster.
          Use the cluster-internal real name of the node, like
          master or node01.
        
        To enable the pscollect command to remotely
        run the required agent script, the ssh
        auto-login for the user pscd must be
        configured on the newly announced host, see the section called “Configuring the collector – enable
      pscd auto-login” for more
        information.
      
        To announce a bunch of nodes at once, a LUA loop may be used (3).
        Quit often, the numbering part of the node names include
        leading zeros, like node01.
        Line (4) shows how to generate appropriate
        node names.
      
        At least, one host entry is required.
      
Within this section, the nodes providing ParaStation process management information are defined.
    --# ###########################################################
    --# Step 4.1: Hostname of ParaStation master node  (opt.)
    ps4_host("localhost")                                              (1)
      Example 4.5. Configuring ParaStation host name
        The entry ps4_host (1)
        tells the collector to connect to the ParaStation daemon
        psid on localhost to
        gather ParaStation process management information, e.g. which jobs are
        currently active.
        Using localhost is ok, as this host name
        will not show up in the graphical user interface.
        This entry is optional and is set by defaults to
        localhost.
      
    --# ###########################################################
    --# Step 4.2: Hostname of ParaStation accounting node (opt.)
    --#     Note: To provide accouting information, grant the
    --#           user 'pscd' read access to all ParaStation
    --#           accouting files on the accounting host:
    --#           chmod g+rx /var/account;
    --#           chmod g+r /var/account/*;
    --#           chgrp -R pscd /var/account
    ps4_acc_host("localhost")                                          (1)
      Example 4.6. Configuring ParaStation accounting host name
        The entry ps4_acc_host (1)
        tells the collector to connect to the host
        localhost to read ParaStation job accounting
        information.
        Using localhost is ok, as this host name
        will not show up in the graphical user interface.
        This entry is optional and is disabled by default.
      
          The user pscd must be able to read the
          ParaStation accounting files located in the directory
          /var/account.
        
Within this section, the nodes providing batch system information are defined.
Currently, only Torque is supported.
    --# ###########################################################
    --# Step 5.1: Hostname of TORQUE server node (optional)
    pbs_host("localhost")                                              (1)
      Example 4.7. Configuring batch server name
        The entry pbs_host (1)
        tells the collector to connect to the Torque server
        on localhost to
        gather batch job information.
        Using localhost is ok, as this host name
        will not show up in the graphical user interface.
        This entry is optional and is disabled by default.
      
    --# ###########################################################
    --# Step 5.2: Hostname of TORQUE accounting node (optional)
    pbs_acc_host("localhost")                                          (1)
      Example 4.8. Configuring batch accounting host name
        The entry pbs_acc_host (1)
        tells the collector to read job accounting information collected
        by Torque from host localhost.
        Using localhost is ok, as this host name
        will not show up in the graphical user interface.
        This entry is optional and is disabled by default.
      
Within this section, the configuration of the virtual sensors subsystem will be defined. These virtual sensors read real sensors by either using IPMI or using lmsensors package on a node. At least one of these two sensor sources must be configured.
    --# ###########################################################
    --# Step 6: Define hardware sensor sources (required)
    --#         Either IPMI (6.1) or lmsensors (6.2) must be used!
    --# ###########################################################
    --# Step 6.1: Define IPMI sensor sources (optional)
    --# ###########################################################
    --# Step 6.1.1: Define IPMI user access information in file
    --#             /etc/pscollect/ipmiuser (file is required)
    --#             File format: 'user:password'                       (1)
    --#       Note: This file should be readable only for user
    --#             'pscd'.
      Example 4.9. Setting up the IPMI authenication
        Within this step, the authentification information used to
        connect to a baseboard management controller (BMC) via IPMI is
        defined.
        This information is stored within a separate file
        /etc/pscollect/ipmiuser.
        The file contains only a single line with username and
        password, separated by a colon.
      
          This file should only be readable by the user
          pscd.
        
    --# ###########################################################
    --# Step 6.1.2: Map IPMI hosts (BMCs) to hosts (nodes)
    --#             (required, if IPMI), e.g.
    --#             ipmi_host("hostname","ipmi_host_addr") or
    --#             ipmi_host_p("hostname","ipmi_host_name",
    --#                         "username","password")
    --#             If no username/password is provided, the file
    --#             /etc/pscollect/ipmiuser will be consulted
    --#             (see 6.1.1)
    --#       Note: If your local BMC does not respond to requests
    --#             from the local host, e.g. ping from master to
    --#             master-bmc does not resolve the BMC address,
    --#             use the special IPMI host name "localhost".
    --#             Using this name, the ipmitool uses the 'open'
    --#             interface, which requires proper kernel
    --#             module support. Try
    --#               chkconfig -a ipmi
    --#               /etc/init.d/ipmi start
    --#             In addition, the user 'pscd' must be able to
    --#             run ipmitool as user root. Add the next two
    --#             lines to the file /etc/sudoers:
    --#             ------
    --#             Defaults:pscd   passwd_tries=0, !syslog
    --#             pscd    ALL=NOPASSWD: /usr/bin/ipmitool -A none \
    --#               -I open -H localhost [a-zA-Z/]*, \
    --#               /usr/bin/ipmitool -A none -I open \
    --#               -H localhost -S [a-zA-Z /]*
    --#             ------
    --# Single IPMI host
    --ipmi_host("node01","192.168.44.1")                               (1)
    --ipmi_host("node02","node02-ipmi")
    --ipmi_host("master","localhost")      -- see Note above
    --# 50 IMPI hosts in one loop
    for n = 1, 50 do                                                   (2)
      ipmi_host("node"..n,"192.168.44."..n)
    end
      Example 4.10. Configuring IPMI host mapping
        The ipmi_host entry (1)
        tells the collector to read IPMI information for host
        node01 from the BMC with address
        192.168.44.1.
        Names like node01-bmc may also be used
        instead of the BMC IP address.
      
        The entry (2) shows an example how to map a
        number of nodes (node1 up to
        node50 using a LUA loop.
      
    --# ###########################################################
    --# Step 6.1.3: Define IPMI chassis (optional), e.g.
    --#             ipmi_chassis("hostname","ipmi_chassis_addr"
    ipmi_chassis("chassis1","192.168.20.1")                            (1)
      Example 4.11. Configuring IPMI chassis mapping
        The ipmi_chassis entry (1)
        maps a Chassis BMC controller managing multiple blade servers
        to a chassis name.
      
This is currently not supported in the GridMonitor GUI!
    --# ###########################################################
    --# Step 6.1.4: Map IPMI server sensor data (required, if IPMI),
    --#             e.g. map_ipmihost("hostname","virtualsensor"
    --#                              "realsensor")
    --#             Required virtual sensors are:
    --#               TempCPU1,
    --#               TempCPU2,
    --#               TempNode,
    --#               FAN1,
    --#               FAN2,
    --#               FAN3,
    --#               FAN4
    --# To map 16 nodes called node01 up to node16 at once, use:
    for n = 1, 16 do
        host = "node"..string.format("%0.2d",n)
        map_ipmihost(host,"TempNode","Ambient_Temp")                   (1)
        map_ipmihost(host,"TempCPU1","Temp1")
        map_ipmihost(host,"TempCPU1","Temp2")
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
        map_ipmihost(host,"FAN"..n, "fan"..n)
    end
    --# Mapping suitable for 16 Dell server SC1435 called node01
    --# up to node16:                                                  (2)
    --for n = 1, 16 do
    --    host = "node"..string.format("%0.2d",n)
    --    map_ipmihost(host,"TempNode","Ambient_Temp")
    --    map_ipmihost(host,"TempCPU1","Temp1")
    --    map_ipmihost(host,"TempCPU1","Temp2")
    --    for i = 1, 2 do
    --        map_ipmihost(host,"FAN"..i,    "FAN_MOD_"..i.."A_RPM")
    --        map_ipmihost(host,"FAN"..i +2, "FAN_MOD_"..i.."B_RPM")
    --        map_ipmihost(host,"FAN"..i +4, "FAN_MOD_"..i.."C_RPM")
    --        map_ipmihost(host,"FAN"..i +6, "FAN_MOD_"..i.."D_RPM")
    --    end
    --end
    --# Mapping suitable for a Dell server PE1950:                     (3)
    --map_ipmihost("node01","TempNode","Ambient_Temp")
    --# Note: Temp3 and Temp4 seems to be constant (40), so
    --#       ignore it for now!
    --for i = 1, 2 do
    --   map_ipmihost("node01","TempCPU"..i,"Temp"..i)
    --end
    --for i = 1, 4 do
    --    map_ipmihost("node01","FAN"..i,     "FAN_MOD_"..i.."A_RPM")
    --    map_ipmihost("node01","FAN"..i +4 , "FAN_MOD_"..i.."B_RPM")
    --    map_ipmihost("node01","FAN"..i +8 , "FAN_MOD_"..i.."C_RPM")
    --    map_ipmihost("node01","FAN"..i +12, "FAN_MOD_"..i.."D_RPM")
    --end
      Example 4.12. Configuring IPMI host sensor mappings
        Within this section, the mappings from IPMI sensor values to
        virtual sensor values are defined.
        For example, entry (1) maps the IPMI sensor called
        Ambient_Temp to the virtual sensor name
        TempNode.
        Using LUA loops is very convenient to map a group of BMCs at
        once.
      
        Use the parameter browser
        (ipmi->sdr->list)
        to list all available sensor names and values.
      
The entry (2) shows an example how to map the IPMI sensors of a Dell server SC1435. Similar, the entry (3) shows the mapping for a Dell server PE1950.
    --# ###########################################################
    --# Step 6.1.5: Map IPMI chassis sensor data (optional), e.g.
    --#             map_ipmichassis("hostname","virtualsensor",
    --#                             "realsensor")
    map_ipmichassis("chassis1","TempChassis","Temp1")                  (1)
      Example 4.13. Configuring IPMI chassis sensor mappings
        The map_ipmichassis entry (1)
        maps a Chassis BMC controller sensor value called
        Temp1 to the virtual sensor
        TempChassis.
      
This is currently not supported in the GridMonitor GUI!
    --# ###########################################################
    --# Step 6.2: Define lmsensors sources (optional), e.g.
    --#           map_lmhost("hostname","Virtualsensor",
    --#                      "Realsensor")
    --#           Required virtual sensors are:
    --#               TempCPU1,
    --#               TempCPU2,
    --#               TempNode,
    --#               FAN1,
    --#               FAN2,
    --#               FAN3,
    --#               FAN4
    --#           To list your available sensors, use sensors
    --for n = 1, 16 do
    --    host = "node"..string.format("%0.2d",n)
    --    map_lmhost(host,"TempCPU1","temp1")                          (1)
    --    map_lmhost(host,"TempCPU2","temp2")
    --    map_lmhost(host,"TempNode","temp3")
    --    map_lmhost(host,"FAN1","fan1")
    --    map_lmhost(host,"FAN2","fan2")
    --    map_lmhost(host,"FAN3","fan3")
    --    map_lmhost(host,"FAN4","fan4")
    --end
      Example 4.14. Configuring lmsensors sensor mappings
        The map_lmhost entry (1)
        maps the sensor called temp1 read using
        lmsensors on node
        node01 to the virtual sensor called
        TempCPU1 for node
        node01.
      
Like mapping IPMI values, using a LUA for-loop is handy to map a group of identical nodes at once.
        Use the parameter browser
        (hosts->sensors)
        to list all available sensor names.
      
This section describes how to configure SNMP managed network switches.
    --# ###########################################################
    --# Step 7: Define SNMP managed switches (optional), e.g.
    --#         snmp("addr")
    snmp("switch1")                                                    (1)
    --# Switch with non-default arguments:
    snmp("switch2", { host="sw4", version = "1" })                     (2)
    --# Switch with all arguments:
    snmp("switch3",                                                    (3)
        {
            host = "s",           -- snmp source (defaults to name)
            community = "public", -- snmp community
            version = "2c",       -- version ("1","2c" or "3")
            table_expire = 15,    -- cache expire in s
            value_expire = 5,     -- cache expire in s
            timeout = 500,        -- connection timeout in ms
            retries = 3           -- connection retries
        }
    )
      Example 4.15. Setting up SNMP devices
        The entry snmp (1)
        announces a SNMP managable device using default values to
        connect.
        Likewise, entry (2)
        announces a snmp device called switch2
        using the address sw4 and the SNMP
        protocol version 1.
        Entry (3)
        shows all available options to the snmp
        mapping call.
      
This section describes how to configure an additional cluster within the collector.
    --# ###########################################################
    --# Step 8: Define second cluster (optional)
    --#         Repeat steps 2 to 7
    --cluster "Cluster2"
    --host "front2"
    --host "c2node1"
    --host "c2node2"
    --host "c2node3"
    --ps4_host "front2"
      Example 4.16. Configuring an additional cluster
        These entries show how to configure an additional cluster,
        managed by this collector. Just define a new
        cluster entry and repeat all required and
        optional configuration steps from step 2 up to step 7.
      
This section describes how to configure monitoring of parameter limits and saving parameters into the database.
    --# ###########################################################
    --# Step 9: Define monitoring limits and parameters stored
    --#         into database (required)
    ...
    --# ###########################################################
    --# Step 9.1: Save load1 values to DB (required) and define
    --#           monitor limit (optional)
    --#           Monitor all clusters and all hosts
    parameter("cluster/*/hosts/*",                                     (1)
        {
            monitor = { intern = true, group = "crit" },
            poll = 30,                                                 (2)
            load1 = {
                save_history = compress_load, -- required              (3)
    --#         Enable overload warnings: (max > nbr of cores)
    --          monitor = { max = 2.1, group = "warn" },               (4)
                poll = 300                                             (5)
            },
            memfree = {                                                (6)
                save_history = compress_min, -- required
    --#         Enable memory shortage warnings:
    --          monitor = { min = 20000, group = "warn" },
                poll = 600
            },
            swapfree = {                                               (7)
                save_history = compress_min, -- required
    --#         Enable swap shortage warnings:
    --          monitor = { min = 20000, group = "warn" },
                poll = 600
            }
        }
    )
      Example 4.17. Defining general monitors
        The entry (1) defines a monitor for the
        parameters load1,
        memfree and swapfree
        for all hosts on all clusters.
        The connection to this host is checked every 30 secs (2).
        Lost connections will be reported using the critical level
        ("crit").
      
        The parameter load1 is
        stored to the database (3) and may be compared to an upper
        limit of 2.1
        (4).
        Exceeding the maximum value would be reported using the
        event group ("warn").
        Monitoring the upper limit is currently disabled.
        This parameter is read, compared and stored every 120 secs
        (5).
        The entry for parameter load1 is
        required.
      
        Analogous to load1, monitors for the
        parameters memfree (6)
        and swapfree (7)
        are pre-defined.
        Every 10 minutes, both values are stored to the database.
        Monitoring of the minimum values is disabled within this
        example.
      
    --# ###########################################################
    --# Step 9.2: Monitor and save all required sensor limits
    --#           (required)
    --#     Note: This configures all nodes identically
    --#           using '.../hosts/*'
    parameter("cluster/*/sensors/hosts/*",
        {
            monitor = { intern = true, group = "crit" },
            poll = 30,
            TempCPU1 = {
    --#         save parameters to DB (for diagrams)
                save_history = compress_max_hi,
    --#         warn if temperature exceeds 60 (Celsius?)
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            TempCPU2 = {
                save_history = compress_max_hi,
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            TempNode = {
                save_history = compress_max_hi,
                monitor = { max = 60, group = "warn" },
                poll = 120
            },
            FAN1 = {
                save_history = compress_max,
    --#         warn if fan speed drops below 3000 rpms
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN2 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN3 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            },
            FAN4 = {
                save_history = compress_max,
                monitor = { min = 3000, group = "warn" },
                poll = 300
            }
        }
    )
      Example 4.18. Defining virtual sensor monitors
Example 4.18 shows how to monitor and store the virtual sensor parameters. This entry is highly recommended. If configured, it will inform the administrator about fan or thermal problems.
Virtual sensor parameters may be configured by using either IPMI or lmsensors data. For details how to map these entries, refer to the previous section.
    --# ###########################################################
    --# Step 9.3: Define required load status (required)
    parameter("cluster/*/stat",
        {
            load1 = {
                max = {
                   save_history = compress_load,
                   poll = 60
                },
                min = {
                   save_history = compress_load,
                   poll = 60
                },
                avg = {
                   save_history = compress_load,
                   poll = 60
                }
            }
        }
    )
      Example 4.19. Defining load1 minimum, maximum and average
        This monitor calculates every 60 secs the minimum, maximum and
        average of the load1 value of all hosts
        and saves it to the database.
        This monitor is required by the GridMonitor GUI and must not be
        modified!
      
This section describes how to configure the event notification system of the ParaStation GridMonitor.
        Each parameter within the collector holds an internal state, e.g.
        unavailable or high.
        When transitioning from one state to another, events will be
        generated, which may be added to event groups.
      
        Currently, only two event groups (warn,
        crit) are used.
        Refer to the previous section how to configure monitors and
        assign them to event groups.
      
    --# ###########################################################
    --# Step 10: Define event notification (required)
    --# ###########################################################
    --# Step 10.1: Define event notification for critical events
    --#            (required)
    lput("parameter/event/crit",                                       (1)
        {
            collect_time = 60,      -- collect events for 60 sec       (2)
            exec_time = 30*60,      -- max. 1 mail per 30 min          (3)
            exclude_states = {                                         (4)
                "ok"                -- dont send mails for state "ok"
            },
            unavailable = 3,        -- Warn after 3 read failures      (5)
    --        exec = event_system_call( \                              (6)
    --            "cat >> /tmp/pscollect.events",  \
    --             "Warnings:", "")
    --        exec = event_system_call( \                              (7)
    --            "env DISPLAY=:0 xmessage -file -",  \
    --            "Warnings:", "")
            exec = event_system_call( \                                (8)
                "mail root -s \"Cluster Cluster1 Critical Events\"",  \
                "Critical events:", "")
        }
    )
    --# ###########################################################
    --# Step 10.2: Define event notification for warning events
    --#            (required)
    lput("parameter/event/warn",                                       (9)
        {
            collect_time = 120,     -- collect events for 120 sec
            exec_time = 60*60,      -- max. 1 mail per 60 min
            exclude_states = {
                "ok"                -- dont send mails for state "ok"
            },
            unavailable = 3,        -- Warn after 3 read failures
    --        exec = event_system_call( \
    --            "cat >> /tmp/pscollect.events",  \
    --            "Warnings:", "")
    --        exec = event_system_call( \
    --            "env DISPLAY=:0 xmessage -file -",  \
    --            "Warnings:", "")
            exec = event_system_call( \
                "mail root -s \"Cluster Cluster1 Warnings\"",  \
                "Warnings:", "")
        }
    )
      Example 4.20. Defining event notification
        The entry (1) defines the configuration for
        events within the group crit.
        Initially, they will be collected for 60 secs
        (2)
        before an event handling call
        will be executed.
        After this initial collect time, more events of this type will
        be collected for 1800 secs (30 min)
        (3)
        before the next event handling call will be issued.
        This insures that the email system (see below) and the
        administrator will not be flooded in case of catastrophic
        errors.
      
        The exclude_states list
        (4)
        defines a list of states which will not be reported, e.g. all
        ok states.
        The next entry
        (5)
        defines how many consecutive read failures may occure, before
        a connection is declared dead.
        The entries
        (6) and
        (7)
        give examples on how to act on crit
        events.
        The entry
        (8)
        define the actual action taken in case a
        crit event handling call is issued.
        In this example, an email will be sent notifying the
        administrator.
        The command will be executed as user
        pscd.
      
        Similar to the entry in line
        (1), the entry
        (9) defines the timeout and action
        taken for the event group warn.