The collector reads its initial configuration from the default
configuration file cluster.conf
, located
in the directory /etc/pscollect
.
The collector is running as non-privileged user
pscd
, therefore, this file should be owned
and be readable by this user.
Comments start with a --
and are ignored
until the end of line.
The complete configuration is defined using LUA, a powerful
scripting language. Refer to http://www.lua.org for
more details.
The configuration is devided into particular steps, each of them is describing a particular configuration aspect. The next capters will describe these steps in detail.
The collector must be restarted to activate the newly created or modified configuration. To do so, run the command
/etc/init.d/pscollect restart
The first section within the collector configuration file defines
global configuration entries, required for the collector. These
entries typically don't have to be modified. Note the final
call of the init()
method. This is
important bring the collector into a well defined state.
--# -*- lua -*- --# pscollect configuration --# --# Lua 5.1 Reference Manual: http://www.lua.org/manual/5.1/ --# --# Set root password to abc321 via: --# echo "passwd root abc321" | psget /var/run/pscollect/socket (1) debug=false --# Load defaults include "/opt/parastation/config/sys_cluster.pscc" --# Overwrite defaults? -- put('config/db_root', '/var/lib/pscollect') -- put('config/bin_dir', '/opt/parastation/bin') -- put('config/rcmd', 'ssh') --# Initialize init()
Example 4.1. Basic collector configurations
These section defines whether the collector accepts connections from each node or only connections initiated by clients on the local node.
--# ########################################################### --# Step 1: Should any node be able to read information from --# collector (required, security)? --# Accept connections only from localhost: listen {host="localhost", port=4000} (1) --# Or accept connections from any host: --# (required if your webserver does not run on the same node) --listen {host="0.0.0.0", port=4000} --# Accept local socket connections: listen {socket="/var/run/pscollect/socket"} (2)
Example 4.2. Configuring collector accessability
The configuration entry (1) above tells the collector to accept
connections from localhost
, via TCP
port 4000
, and
(2) configures it to open a socket
in the local file system.
When connected through a local socket, the collector checks the
user id of the peer process. If the owner is
either root
or the same user as that
running the collector (normally pscd
), that
client automatically gains administrator privileges, without
needing to provide a password. This can e.g. be used to change
the password, as shown at the beginning
(1) of the configuration file..
See also the section called “Configuring basic GridMonitor GUI parameters”.
Within this section, the cluster name is defined:
--# ########################################################### --# Step 2: define first cluster (required) cluster("Cluster1") (1)
Example 4.3. Configuring cluster name
The entry cluster
(1)
tells the collector about a cluster called
Cluster1
.
All further configuration steps refer to this cluster.
The entry has to be modified to meet your current cluster
name.
Within this section, all nodes belonging to a particular cluster are defined.
--# ########################################################### --# Step 3: Define host list for first cluster (required) --# Note: To enable reading S.M.A.R.T data from disk drives, --# add the next two lines to the file /etc/sudoers --# (on each node): --# ------ --# Defaults:pscd passwd_tries=0, !syslog --# pscd ALL=NOPASSWD: /usr/sbin/smartctl \ --# -[iAH] /dev/[sh]d[a-z], /usr/sbin/smartctl \ --# -d ata -[iAH] /dev/[sh]d[a-z] --# ------ --# Using S.M.A.R.T data is optional --# Warning: With some systems/controllers/disks, reading --# S.M.A.R.T data may hang your system. --# Test it!!! --# Define host list, one by one: host("localhost") -- replace with real hostname (1) host("master") (2) host("node01") host("node02") host("node03") --# Or use a "lua for loop"? --# (defining nodes 'cnode1' up to 'cnode16') for n = 1, 16 do (3) host("cnode"..n) end --# Use node names with leading '0' # (defining nodes 'node-01' up to 'node-16') for n = 1, 16 do host("node-"..string.format("%0.2d",n)) (4) end
Example 4.4. Configuring host list
The entry host
(1)
announces a new host to the cluster.
All hosts announced using this function show up as cluster
nodes within the graphical user interface.
The name localhost
should be replaced by
the actual node name, e.g. master
.
It's generally not a good idea to list a node called
localhost
, as this name is ambiguous
within a cluster.
Use the cluster-internal real name of the node, like
master
or node01
.
To enable the pscollect command to remotely
run the required agent script, the ssh
auto-login for the user pscd
must be
configured on the newly announced host, see the section called “Configuring the collector – enable
pscd
auto-login” for more
information.
To announce a bunch of nodes at once, a LUA loop may be used (3).
Quit often, the numbering part of the node names include
leading zeros, like node01
.
Line (4) shows how to generate appropriate
node names.
At least, one host
entry is required.
Within this section, the nodes providing ParaStation process management information are defined.
--# ########################################################### --# Step 4.1: Hostname of ParaStation master node (opt.) ps4_host("localhost") (1)
Example 4.5. Configuring ParaStation host name
The entry ps4_host
(1)
tells the collector to connect to the ParaStation daemon
psid on localhost
to
gather ParaStation process management information, e.g. which jobs are
currently active.
Using localhost
is ok, as this host name
will not show up in the graphical user interface.
This entry is optional and is set by defaults to
localhost
.
--# ########################################################### --# Step 4.2: Hostname of ParaStation accounting node (opt.) --# Note: To provide accouting information, grant the --# user 'pscd' read access to all ParaStation --# accouting files on the accounting host: --# chmod g+rx /var/account; --# chmod g+r /var/account/*; --# chgrp -R pscd /var/account ps4_acc_host("localhost") (1)
Example 4.6. Configuring ParaStation accounting host name
The entry ps4_acc_host
(1)
tells the collector to connect to the host
localhost
to read ParaStation job accounting
information.
Using localhost
is ok, as this host name
will not show up in the graphical user interface.
This entry is optional and is disabled by default.
The user pscd
must be able to read the
ParaStation accounting files located in the directory
/var/account
.
Within this section, the nodes providing batch system information are defined.
Currently, only Torque is supported.
--# ########################################################### --# Step 5.1: Hostname of TORQUE server node (optional) pbs_host("localhost") (1)
Example 4.7. Configuring batch server name
The entry pbs_host
(1)
tells the collector to connect to the Torque server
on localhost
to
gather batch job information.
Using localhost
is ok, as this host name
will not show up in the graphical user interface.
This entry is optional and is disabled by default.
--# ########################################################### --# Step 5.2: Hostname of TORQUE accounting node (optional) pbs_acc_host("localhost") (1)
Example 4.8. Configuring batch accounting host name
The entry pbs_acc_host
(1)
tells the collector to read job accounting information collected
by Torque from host localhost
.
Using localhost
is ok, as this host name
will not show up in the graphical user interface.
This entry is optional and is disabled by default.
Within this section, the configuration of the virtual sensors subsystem will be defined. These virtual sensors read real sensors by either using IPMI or using lmsensors package on a node. At least one of these two sensor sources must be configured.
--# ########################################################### --# Step 6: Define hardware sensor sources (required) --# Either IPMI (6.1) or lmsensors (6.2) must be used! --# ########################################################### --# Step 6.1: Define IPMI sensor sources (optional) --# ########################################################### --# Step 6.1.1: Define IPMI user access information in file --# /etc/pscollect/ipmiuser (file is required) --# File format: 'user:password' (1) --# Note: This file should be readable only for user --# 'pscd'.
Example 4.9. Setting up the IPMI authenication
Within this step, the authentification information used to
connect to a baseboard management controller (BMC) via IPMI is
defined.
This information is stored within a separate file
/etc/pscollect/ipmiuser
.
The file contains only a single line with username and
password, separated by a colon.
This file should only be readable by the user
pscd
.
--# ########################################################### --# Step 6.1.2: Map IPMI hosts (BMCs) to hosts (nodes) --# (required, if IPMI), e.g. --# ipmi_host("hostname","ipmi_host_addr") or --# ipmi_host_p("hostname","ipmi_host_name", --# "username","password") --# If no username/password is provided, the file --# /etc/pscollect/ipmiuser will be consulted --# (see 6.1.1) --# Note: If your local BMC does not respond to requests --# from the local host, e.g. ping from master to --# master-bmc does not resolve the BMC address, --# use the special IPMI host name "localhost". --# Using this name, the ipmitool uses the 'open' --# interface, which requires proper kernel --# module support. Try --# chkconfig -a ipmi --# /etc/init.d/ipmi start --# In addition, the user 'pscd' must be able to --# run ipmitool as user root. Add the next two --# lines to the file /etc/sudoers: --# ------ --# Defaults:pscd passwd_tries=0, !syslog --# pscd ALL=NOPASSWD: /usr/bin/ipmitool -A none \ --# -I open -H localhost [a-zA-Z/]*, \ --# /usr/bin/ipmitool -A none -I open \ --# -H localhost -S [a-zA-Z /]* --# ------ --# Single IPMI host --ipmi_host("node01","192.168.44.1") (1) --ipmi_host("node02","node02-ipmi") --ipmi_host("master","localhost") -- see Note above --# 50 IMPI hosts in one loop for n = 1, 50 do (2) ipmi_host("node"..n,"192.168.44."..n) end
Example 4.10. Configuring IPMI host mapping
The ipmi_host
entry (1)
tells the collector to read IPMI information for host
node01
from the BMC with address
192.168.44.1
.
Names like node01-bmc
may also be used
instead of the BMC IP address.
The entry (2) shows an example how to map a
number of nodes (node1
up to
node50
using a LUA loop.
--# ########################################################### --# Step 6.1.3: Define IPMI chassis (optional), e.g. --# ipmi_chassis("hostname","ipmi_chassis_addr" ipmi_chassis("chassis1","192.168.20.1") (1)
Example 4.11. Configuring IPMI chassis mapping
The ipmi_chassis
entry (1)
maps a Chassis BMC controller managing multiple blade servers
to a chassis name.
This is currently not supported in the GridMonitor GUI!
--# ########################################################### --# Step 6.1.4: Map IPMI server sensor data (required, if IPMI), --# e.g. map_ipmihost("hostname","virtualsensor" --# "realsensor") --# Required virtual sensors are: --# TempCPU1, --# TempCPU2, --# TempNode, --# FAN1, --# FAN2, --# FAN3, --# FAN4 --# To map 16 nodes called node01 up to node16 at once, use: for n = 1, 16 do host = "node"..string.format("%0.2d",n) map_ipmihost(host,"TempNode","Ambient_Temp") (1) map_ipmihost(host,"TempCPU1","Temp1") map_ipmihost(host,"TempCPU1","Temp2") map_ipmihost(host,"FAN"..n, "fan"..n) map_ipmihost(host,"FAN"..n, "fan"..n) map_ipmihost(host,"FAN"..n, "fan"..n) map_ipmihost(host,"FAN"..n, "fan"..n) end --# Mapping suitable for 16 Dell server SC1435 called node01 --# up to node16: (2) --for n = 1, 16 do -- host = "node"..string.format("%0.2d",n) -- map_ipmihost(host,"TempNode","Ambient_Temp") -- map_ipmihost(host,"TempCPU1","Temp1") -- map_ipmihost(host,"TempCPU1","Temp2") -- for i = 1, 2 do -- map_ipmihost(host,"FAN"..i, "FAN_MOD_"..i.."A_RPM") -- map_ipmihost(host,"FAN"..i +2, "FAN_MOD_"..i.."B_RPM") -- map_ipmihost(host,"FAN"..i +4, "FAN_MOD_"..i.."C_RPM") -- map_ipmihost(host,"FAN"..i +6, "FAN_MOD_"..i.."D_RPM") -- end --end --# Mapping suitable for a Dell server PE1950: (3) --map_ipmihost("node01","TempNode","Ambient_Temp") --# Note: Temp3 and Temp4 seems to be constant (40), so --# ignore it for now! --for i = 1, 2 do -- map_ipmihost("node01","TempCPU"..i,"Temp"..i) --end --for i = 1, 4 do -- map_ipmihost("node01","FAN"..i, "FAN_MOD_"..i.."A_RPM") -- map_ipmihost("node01","FAN"..i +4 , "FAN_MOD_"..i.."B_RPM") -- map_ipmihost("node01","FAN"..i +8 , "FAN_MOD_"..i.."C_RPM") -- map_ipmihost("node01","FAN"..i +12, "FAN_MOD_"..i.."D_RPM") --end
Example 4.12. Configuring IPMI host sensor mappings
Within this section, the mappings from IPMI sensor values to
virtual sensor values are defined.
For example, entry (1) maps the IPMI sensor called
Ambient_Temp
to the virtual sensor name
TempNode
.
Using LUA loops is very convenient to map a group of BMCs at
once.
Use the parameter browser
(ipmi
->sdr
->list
)
to list all available sensor names and values.
The entry (2) shows an example how to map the IPMI sensors of a Dell server SC1435. Similar, the entry (3) shows the mapping for a Dell server PE1950.
--# ########################################################### --# Step 6.1.5: Map IPMI chassis sensor data (optional), e.g. --# map_ipmichassis("hostname","virtualsensor", --# "realsensor") map_ipmichassis("chassis1","TempChassis","Temp1") (1)
Example 4.13. Configuring IPMI chassis sensor mappings
The map_ipmichassis
entry (1)
maps a Chassis BMC controller sensor value called
Temp1
to the virtual sensor
TempChassis
.
This is currently not supported in the GridMonitor GUI!
--# ########################################################### --# Step 6.2: Define lmsensors sources (optional), e.g. --# map_lmhost("hostname","Virtualsensor", --# "Realsensor") --# Required virtual sensors are: --# TempCPU1, --# TempCPU2, --# TempNode, --# FAN1, --# FAN2, --# FAN3, --# FAN4 --# To list your available sensors, use sensors --for n = 1, 16 do -- host = "node"..string.format("%0.2d",n) -- map_lmhost(host,"TempCPU1","temp1") (1) -- map_lmhost(host,"TempCPU2","temp2") -- map_lmhost(host,"TempNode","temp3") -- map_lmhost(host,"FAN1","fan1") -- map_lmhost(host,"FAN2","fan2") -- map_lmhost(host,"FAN3","fan3") -- map_lmhost(host,"FAN4","fan4") --end
Example 4.14. Configuring lmsensors sensor mappings
The map_lmhost
entry (1)
maps the sensor called temp1
read using
lmsensors
on node
node01
to the virtual sensor called
TempCPU1
for node
node01
.
Like mapping IPMI values, using a LUA for-loop is handy to map a group of identical nodes at once.
Use the parameter browser
(hosts
->sensors
)
to list all available sensor names.
This section describes how to configure SNMP managed network switches.
--# ########################################################### --# Step 7: Define SNMP managed switches (optional), e.g. --# snmp("addr") snmp("switch1") (1) --# Switch with non-default arguments: snmp("switch2", { host="sw4", version = "1" }) (2) --# Switch with all arguments: snmp("switch3", (3) { host = "s", -- snmp source (defaults to name) community = "public", -- snmp community version = "2c", -- version ("1","2c" or "3") table_expire = 15, -- cache expire in s value_expire = 5, -- cache expire in s timeout = 500, -- connection timeout in ms retries = 3 -- connection retries } )
Example 4.15. Setting up SNMP devices
The entry snmp
(1)
announces a SNMP managable device using default values to
connect.
Likewise, entry (2)
announces a snmp device called switch2
using the address sw4
and the SNMP
protocol version 1.
Entry (3)
shows all available options to the snmp
mapping call.
This section describes how to configure an additional cluster within the collector.
--# ########################################################### --# Step 8: Define second cluster (optional) --# Repeat steps 2 to 7 --cluster "Cluster2" --host "front2" --host "c2node1" --host "c2node2" --host "c2node3" --ps4_host "front2"
Example 4.16. Configuring an additional cluster
These entries show how to configure an additional cluster,
managed by this collector. Just define a new
cluster
entry and repeat all required and
optional configuration steps from step 2 up to step 7.
This section describes how to configure monitoring of parameter limits and saving parameters into the database.
--# ########################################################### --# Step 9: Define monitoring limits and parameters stored --# into database (required) ... --# ########################################################### --# Step 9.1: Save load1 values to DB (required) and define --# monitor limit (optional) --# Monitor all clusters and all hosts parameter("cluster/*/hosts/*", (1) { monitor = { intern = true, group = "crit" }, poll = 30, (2) load1 = { save_history = compress_load, -- required (3) --# Enable overload warnings: (max > nbr of cores) -- monitor = { max = 2.1, group = "warn" }, (4) poll = 300 (5) }, memfree = { (6) save_history = compress_min, -- required --# Enable memory shortage warnings: -- monitor = { min = 20000, group = "warn" }, poll = 600 }, swapfree = { (7) save_history = compress_min, -- required --# Enable swap shortage warnings: -- monitor = { min = 20000, group = "warn" }, poll = 600 } } )
Example 4.17. Defining general monitors
The entry (1) defines a monitor for the
parameters load1
,
memfree
and swapfree
for all hosts on all clusters.
The connection to this host is checked every 30 secs (2).
Lost connections will be reported using the critical level
("crit
").
The parameter load1
is
stored to the database (3) and may be compared to an upper
limit of 2.1
(4).
Exceeding the maximum value would be reported using the
event group ("warn
").
Monitoring the upper limit is currently disabled.
This parameter is read, compared and stored every 120 secs
(5).
The entry for parameter load1
is
required.
Analogous to load1
, monitors for the
parameters memfree
(6)
and swapfree
(7)
are pre-defined.
Every 10 minutes, both values are stored to the database.
Monitoring of the minimum values is disabled within this
example.
--# ########################################################### --# Step 9.2: Monitor and save all required sensor limits --# (required) --# Note: This configures all nodes identically --# using '.../hosts/*' parameter("cluster/*/sensors/hosts/*", { monitor = { intern = true, group = "crit" }, poll = 30, TempCPU1 = { --# save parameters to DB (for diagrams) save_history = compress_max_hi, --# warn if temperature exceeds 60 (Celsius?) monitor = { max = 60, group = "warn" }, poll = 120 }, TempCPU2 = { save_history = compress_max_hi, monitor = { max = 60, group = "warn" }, poll = 120 }, TempNode = { save_history = compress_max_hi, monitor = { max = 60, group = "warn" }, poll = 120 }, FAN1 = { save_history = compress_max, --# warn if fan speed drops below 3000 rpms monitor = { min = 3000, group = "warn" }, poll = 300 }, FAN2 = { save_history = compress_max, monitor = { min = 3000, group = "warn" }, poll = 300 }, FAN3 = { save_history = compress_max, monitor = { min = 3000, group = "warn" }, poll = 300 }, FAN4 = { save_history = compress_max, monitor = { min = 3000, group = "warn" }, poll = 300 } } )
Example 4.18. Defining virtual sensor monitors
Example 4.18 shows how to monitor and store the virtual sensor parameters. This entry is highly recommended. If configured, it will inform the administrator about fan or thermal problems.
Virtual sensor parameters may be configured by using either IPMI or lmsensors data. For details how to map these entries, refer to the previous section.
--# ########################################################### --# Step 9.3: Define required load status (required) parameter("cluster/*/stat", { load1 = { max = { save_history = compress_load, poll = 60 }, min = { save_history = compress_load, poll = 60 }, avg = { save_history = compress_load, poll = 60 } } } )
Example 4.19. Defining load1 minimum, maximum and average
This monitor calculates every 60 secs the minimum, maximum and
average of the load1
value of all hosts
and saves it to the database.
This monitor is required by the GridMonitor GUI and must not be
modified!
This section describes how to configure the event notification system of the ParaStation GridMonitor.
Each parameter within the collector holds an internal state, e.g.
unavailable
or high
.
When transitioning from one state to another, events will be
generated, which may be added to event groups.
Currently, only two event groups (warn
,
crit
) are used.
Refer to the previous section how to configure monitors and
assign them to event groups.
--# ########################################################### --# Step 10: Define event notification (required) --# ########################################################### --# Step 10.1: Define event notification for critical events --# (required) lput("parameter/event/crit", (1) { collect_time = 60, -- collect events for 60 sec (2) exec_time = 30*60, -- max. 1 mail per 30 min (3) exclude_states = { (4) "ok" -- dont send mails for state "ok" }, unavailable = 3, -- Warn after 3 read failures (5) -- exec = event_system_call( \ (6) -- "cat >> /tmp/pscollect.events", \ -- "Warnings:", "") -- exec = event_system_call( \ (7) -- "env DISPLAY=:0 xmessage -file -", \ -- "Warnings:", "") exec = event_system_call( \ (8) "mail root -s \"Cluster Cluster1 Critical Events\"", \ "Critical events:", "") } ) --# ########################################################### --# Step 10.2: Define event notification for warning events --# (required) lput("parameter/event/warn", (9) { collect_time = 120, -- collect events for 120 sec exec_time = 60*60, -- max. 1 mail per 60 min exclude_states = { "ok" -- dont send mails for state "ok" }, unavailable = 3, -- Warn after 3 read failures -- exec = event_system_call( \ -- "cat >> /tmp/pscollect.events", \ -- "Warnings:", "") -- exec = event_system_call( \ -- "env DISPLAY=:0 xmessage -file -", \ -- "Warnings:", "") exec = event_system_call( \ "mail root -s \"Cluster Cluster1 Warnings\"", \ "Warnings:", "") } )
Example 4.20. Defining event notification
The entry (1) defines the configuration for
events within the group crit
.
Initially, they will be collected for 60 secs
(2)
before an event handling call
will be executed.
After this initial collect time, more events of this type will
be collected for 1800 secs (30 min)
(3)
before the next event handling call will be issued.
This insures that the email system (see below) and the
administrator will not be flooded in case of catastrophic
errors.
The exclude_states
list
(4)
defines a list of states which will not be reported, e.g. all
ok
states.
The next entry
(5)
defines how many consecutive read failures may occure, before
a connection is declared dead.
The entries
(6) and
(7)
give examples on how to act on crit
events.
The entry
(8)
define the actual action taken in case a
crit
event handling call is issued.
In this example, an email will be sent notifying the
administrator.
The command will be executed as user
pscd
.
Similar to the entry in line
(1), the entry
(9) defines the timeout and action
taken for the event group warn
.