User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
server_monitoring [2010/05/22 14:19] – external edit 127.0.0.1server_monitoring [2024/01/16 09:21] (current) – removed aorth
Line 1: Line 1:
-===== Server Monitoring ===== 
-  * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage 
-  * [[#monit|Monit]] - Monitors specific services 
-  * [[#nagios|Nagios]] - Monitors services services 
  
-===== Ganglia ===== 
- 
-[[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster. 
- 
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia 
- 
-{{:ganglia_diagram_smaller.gif|}} 
- 
-Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia 
-==== Troubleshooting ==== 
-From time to time there are problems with Ganglia's web interface.  You can restart the needed services following this basic procedure: 
- 
-  - Stop data collection daemon on HPC: ''service gmetad stop'' 
-  - Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' 
-  - Start data collection daemon on HPC: ''service gmetad start'' 
-  - Wait a minute or two 
-  - Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%'' 
- 
-Now go check the Ganglia web interface and see if the nodes have returned. 
- 
-===== Monit ===== 
- 
-Monit is a free open source utility for managing and monitoring, processes, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations.  
-Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. it logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert. 
-Monit provides a built-in HTTP(S) interface and you can use a browser to access the Monit server.  
- 
-M/Monit expand upon Monit's capabilities to provide monitoring and management of all Monit enabled hosts from one easy to use web-interface. Status and events from each monitored system is updated in real-time and displayed in charts, graphs and tables. 
- 
-Get the latest version at: http://mmonit.com/monit/download 
- 
-<code>$ wget http://mmonit.com/monit/dist/monit-5.0.3.tar.gz 
-$ tar xfz monit-5.0.3.tar.gz 
-$ cd monit-5.0.3 
-$ ./configure && make && make install</code> 
-Accessing monit: 
-http://hpc.ilri.cgiar.org:2812 
- 
-===== Nagios ===== 
- 
-Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. http://www.nagios.org/about 
- 
-=== Installation === 
- 
----- 
- 
-Download the latest version of nagios while hot, from http://www.nagios.org/download 
-<file>$ wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.0.tar.gz  
-$ cd nagios-3.2.0 
-$ ./configure 
-$ make all  
-$ useradd nagios 
-$ make install 
-$ make install-init 
-$ make install-commandmode 
-$ make install-config 
-$ make install-webconf 
-</file> 
-=== Configuration === 
- 
----- 
-Running the following command will create a new file called htpasswd.users in the /usr/local/nagios/etc directory. It will also create an username/password entry for nagiosadmin. You will be asked to provide a password that will be used when nagiosadmin authenticates to the web server. 
-<code>htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin </code> 
- 
-Download and install plugins  
-<file> 
-$ wget http://prdownloads.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.14.tar.gz 
-$ tar xfz nagios-plugins-1.4.14.tar.gz 
-$ cd nagios-plugins-1.4.14 
-$ ./configure && make && make install  
-</file> 
-Edit the configuration files to add host and services to be monitored: 
-<code>vim /usr/local/nagios/etc/objects/localhost.cfg </code> 
- 
-Check remote services http://wiki.nagios.org/index.php/Howtos:checkbyssh_RedHat 
-=== Accessing Nagios === 
- 
----- 
-http://172.26.0.205:4020/nagios/ 
- 
-with  username = "nagiosadmin" and password = "nagios" 
server_monitoring.1274537972.txt.gz · Last modified: 2010/06/16 08:22 (external edit)