User Tools

Site Tools


server_monitoring:ganglia

This is an old revision of the document!


Ganglia

Ganglia is a system for measuring, recording, and graphing certain metrics about hosts in a cluster. Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc. These metrics are saved and graphed periodically by RRDtool. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster.

You can see the ganglia installation here: http://hpc.ilri.cgiar.org/ganglia

Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia

Troubleshooting

From time to time there are problems with Ganglia's web interface. You can restart the needed services following this basic procedure:

  1. Stop data collection daemon on HPC: service gmetad stop
  2. Stop monitoring daemon on compute nodes: rocks run host compute 'service gmond stop'
  3. Start data collection daemon on HPC: service gmetad start
  4. Wait a minute or two
  5. Start monitoring daemon on compute nodes: rocks run host compute 'service gmond start'

Now go check the Ganglia web interface and see if the nodes have returned.

server_monitoring/ganglia.1277160091.txt.gz · Last modified: 2010/06/21 22:41 by 172.26.14.218