User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
server_monitoring [2009/11/27 08:38] alanserver_monitoring [2010/05/22 14:19] – external edit 127.0.0.1
Line 8: Line 8:
 [[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster. [[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster.
  
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia (if it's working) +You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia
- +
-For some reason sometimes the graphs do not draw.  If you reboot the cluster the graphs work for some days and then seem to stop drawing (the graphs appear blank).  However, if you call the script responsible for drawing the graphs with no arguments, you'll see graphs are actually working: http://hpc.ilri.cgiar.org/ganglia/graph.php+
  
 {{:ganglia_diagram_smaller.gif|}} {{:ganglia_diagram_smaller.gif|}}
  
 Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia
 +==== Troubleshooting ====
 +From time to time there are problems with Ganglia's web interface.  You can restart the needed services following this basic procedure:
  
-==== Troubleshooting ==== +  - Stop data collection daemon on HPC: ''service gmetad stop'' 
-When Ganglia has problems displaying nodes it may need to be restarted.  Restart the daemons in this order: +  Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' 
-  * Stop data collection daemon on HPC: ''service gmetad stop'' +  Start data collection daemon on HPC: ''service gmetad start'' 
-  * Stop monitoring daemon on HPC: ''service gmond stop'' +  - Wait a minute or two 
-  * Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' +  Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%''
-  Start data collection daemon on HPC: ''service gmetad start'' +
-  * Star monitoring daemon on HPC: ''service gmond start'' +
-  Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%''+
  
 Now go check the Ganglia web interface and see if the nodes have returned. Now go check the Ganglia web interface and see if the nodes have returned.
Line 87: Line 84:
  
 with  username = "nagiosadmin" and password = "nagios" with  username = "nagiosadmin" and password = "nagios"
- 
-