Data collection process: collector

Data collection process: collector
Prev	Chapter 2. Introduction	Next

All available data is retrieved and managed by a central process called collector, using various protocols, like SNMP, and others talking to remote agents. To minimize overall system and network load, only data requested by a client application is read from agents by the collector. If no one is interested, no data is transfered and therefore no compute cycles and network bandwidth is wasted. In addition, the collector is especially designed to handle dead or overloaded nodes, broken network connections and limited network bandwidth.

The collector gathers various data from (almost) all information sources available within a cluster describing:

compute nodes,
fileservers, frontend nodes,
baseboard management controllers (BMCs),
network switches,
disks and storage devices,
environment monitoring devices,
runtime systems,
batch queuing systems.

For all types of data sources, dedicated plug-ins for the collector controlling local or remote agents are provided. These plug-ins read the requested data using the appropriate agent and format them to be usable by the collector.

Parameters not only include operating system values like system load, network counters or temperatures, but also parameters describing active jobs provided by the ParaStation process management, queued jobs provided by a batch queuing system, etc. If available, they may even include information provided by rack/room environment monitoring devices, uninterruptible power supplies or similar devices.

Each parameter is cached within the collector for a certain period of time. Clients reading a parameter within this cache timeout will be provided with the cached value and are therefore not able to cause excessive network traffic.

Data can be stored to and retrieved from a database; therefore a data history is available, e.g. for plotting diagrams. Parameters and sample frequencies can be configured independently.

'Virtual' parameters can be computed, monitored and stored to the database based on actual read data, e.g. the total system load as sum of all node load values.

Each known numerical parameter can be compared to an upper and lower limit. In case these value under-runs or over-runs those limits, actions can be triggered, e.g. generating an event. In addition, string parameters can be compared to constant strings, too. To constantly monitor these parameter, reading cycles can be defined. This is fully configurable with respect to parameter name, upper and lower limit, cycle time, etc.

Events describing abnormal situations within a cluster can be generated by monitoring parameter limits, node availablity, etc. Events will be stored within the database and reported by email.

Beside the actual data, the collector also provides information about the type of available data. This is called the parameter type system. Using this system, it's easy for the GridMonitor GUI to construct dynamic selection boxes without actually reading the data and therefore wasting network bandwidth and compute cycles. The parameter type system also enables a GridMonitor GUI to dynamically include new parameters without modifying the scripts or page layout.