User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revisionBoth sides next revision
server_monitoring [2010/06/16 09:58] 172.26.0.166server_monitoring [2010/06/21 22:41] 172.26.14.218
Line 1: Line 1:
 ===== Server Monitoring ===== ===== Server Monitoring =====
-  * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage +  * [[server_monitoring:ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage 
-  * [[#monit|Monit]] - Monitors specific services +  * [[server_monitoring:monit|Monit]] - Monitors specific services 
-  * [[#nagios|Nagios]] - Monitors servers,hosts and services +  * [[server_monitoring:nagios|Nagios]] - Monitors servers,hosts and services 
-  * [[#Zabbix|Zabbix]] -Monitor servers,host and services +  * [[server_monitoring:zabbix|Zabbix]] -Monitor servers,host and services
- +
- +
-===== Ganglia ===== +
- +
-[[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster. +
- +
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia +
- +
-{{:ganglia_diagram_smaller.gif|}} +
- +
-Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia +
-==== Troubleshooting ==== +
-From time to time there are problems with Ganglia's web interface.  You can restart the needed services following this basic procedure: +
- +
-  - Stop data collection daemon on HPC: ''service gmetad stop'' +
-  - Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' +
-  - Start data collection daemon on HPC: ''service gmetad start'' +
-  - Wait a minute or two +
-  - Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%'' +
- +
-Now go check the Ganglia web interface and see if the nodes have returned.+
  
 ===== Monit ===== ===== Monit =====