This is an old revision of the document!
Table of Contents
Server Monitoring
Ganglia
Ganglia is a system for measuring, recording, and graphing certain metrics about hosts in a cluster. Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc. These metrics are saved and graphed periodically by RRDtool. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster.
You can see the ganglia installation here: http://hpc.ilri.cgiar.org/ganglia (if it's working)
For some reason sometimes the graphs do not draw. If you reboot the cluster the graphs work for some days and then seem to stop drawing (the graphs appear blank). However, if you call the script responsible for drawing the graphs with no arguments, you'll see graphs are actually working: http://hpc.ilri.cgiar.org/ganglia/graph.php
Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia
Notes
It appears as if other CGIAR clusters have been configured to query our gmond
daemon. I'm not sure why, as their respective ganglia pages do not show ILRI HPC stats. In any case, the relevant configuration settings are in /etc/gmond.conf
on the head node, note especially the trusted_hosts
entries:
# # Gmond config file for Cluster Cluster. # Generated by ganglia.xml node without aid from the database. # name "ILRI" owner "Cgiar" url "http://hpc.ilri.cgiar.org/" latlong "N32.87 W117.22" mcast_channel "237.170.26.97" # # Increase size of gmond user (gmetric) hash table. # num_custom_metrics 2048 # Uncomment the next line for monitoring by the Rocks Cluster Network. trusted_hosts 220.227.242.214 # hpc.icrisat.cgiar.org trusted_hosts 202.123.56.187 # hpc.irri.cgiar.org trusted_hosts 216.244.151.133 # ? trusted_hosts 200.62.229.37 # hpc.cip.cgiar.org # Listen only on the private cluster interface. mcast_if eth0
Connections refused
I kept seeing this error in /var/log/messages
on the head node:
Aug 28 12:44:24 hpc-ilri /usr/sbin/gmond[3453]: server_thread() Host 200.62.229.37 tried to connect and was refused
Apparently that is the Potato Center's (CIP) ganglia trying to talk to our ganglia. Adding trusted_hosts 200.62.229.37
to /etc/gmond.conf
on the head node and restarting gmond fixed it.
Monit
Monit is a free open source utility for managing and monitoring, processes, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations. Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. it logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert. Monit provides a built-in HTTP(S) interface and you can use a browser to access the Monit server.
M/Monit expand upon Monit's capabilities to provide monitoring and management of all Monit enabled hosts from one easy to use web-interface. Status and events from each monitored system is updated in real-time and displayed in charts, graphs and tables.
Get the latest version at: http://mmonit.com/monit/download
$ wget http://mmonit.com/monit/dist/monit-5.0.3.tar.gz $ tar xfz monit-5.0.3.tar.gz $ cd monit-5.0.3 $ ./configure && make && make install
Accessing monit: http://hpc.ilri.cgiar.org:2812
Nagios
Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. http://www.nagios.org/about
Installation
Download the latest version of nagios while hot, from http://www.nagios.org/download
$ wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.0.tar.gz $ cd nagios-3.2.0 $ ./configure $ make all $ useradd nagios $ make install $ make install-init $ make install-commandmode $ make install-config $ make install-webconf
Configuration
Running the following command will create a new file called htpasswd.users in the /usr/local/nagios/etc directory. It will also create an username/password entry for nagiosadmin. You will be asked to provide a password that will be used when nagiosadmin authenticates to the web server.
htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin