User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
server_monitoring [2009/11/27 12:10]
alan
server_monitoring [2010/10/19 17:18] (current)
aorth
Line 1: Line 1:
 ===== Server Monitoring ===== ===== Server Monitoring =====
-  * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage +  * [[server_monitoring:ganglia|Ganglia]] -- Monitors cluster CPU, disk, network usage 
-  * [[#monit|Monit]] - Monitors specific services +  * [[server_monitoring:monit|Monit]] -- Monitors specific services 
-  * [[#nagios|Nagios]] - Monitors services services +  * [[server_monitoring:nagios|Nagios]] -- Monitors servers, hosts and services 
- +  [[server_monitoring:zabbix|Zabbix]] -- Monitors servers, host and services
-===== Ganglia ===== +
- +
-[[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster. +
- +
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia +
- +
-{{:ganglia_diagram_smaller.gif|}} +
- +
-Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia +
- +
-==== Troubleshooting ==== +
-From time to time there are problems with Ganglia's web interface.  You can restart the needed services following this basic procedure: +
- +
-  Stop data collection daemon on HPC: ''service gmetad stop'' +
-  - Stop monitoring daemon on HPC: ''service gmond stop'' +
-  - Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' +
-  - Start data collection daemon on HPC: ''service gmetad start'' +
-  - Star monitoring daemon on HPC: ''service gmond start'' +
-  - Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%'' +
- +
-Now go check the Ganglia web interface and see if the nodes have returned. +
- +
-===== Monit ===== +
- +
-Monit is a free open source utility for managing and monitoringprocesses, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations.  +
-Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. it logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert. +
-Monit provides a built-in HTTP(S) interface and you can use a browser to access the Monit server.  +
- +
-M/Monit expand upon Monit's capabilities to provide monitoring and management of all Monit enabled hosts from one easy to use web-interface. Status and events from each monitored system is updated in real-time and displayed in charts, graphs and tables. +
- +
-Get the latest version at: http://mmonit.com/monit/download +
- +
-<code>$ wget http://mmonit.com/monit/dist/monit-5.0.3.tar.gz +
-$ tar xfz monit-5.0.3.tar.gz +
-$ cd monit-5.0.3 +
-$ ./configure && make && make install</code> +
-Accessing monit: +
-http://hpc.ilri.cgiar.org:2812 +
- +
-===== Nagios ===== +
- +
-Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. http://www.nagios.org/about +
- +
-=== Installation === +
- +
----- +
- +
-Download the latest version of nagios while hot, from http://www.nagios.org/download +
-<file>$ wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.0.tar.gz  +
-$ cd nagios-3.2.0 +
-$ ./configure +
-$ make all  +
-$ useradd nagios +
-$ make install +
-$ make install-init +
-$ make install-commandmode +
-$ make install-config +
-$ make install-webconf +
-</file> +
-=== Configuration === +
- +
----- +
-Running the following command will create a new file called htpasswd.users in the /usr/local/nagios/etc directory. It will also create an username/password entry for nagiosadmin. You will be asked to provide a password that will be used when nagiosadmin authenticates to the web server. +
-<code>htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin </code> +
- +
-Download and install plugins  +
-<file> +
-$ wget http://prdownloads.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.14.tar.gz +
-$ tar xfz nagios-plugins-1.4.14.tar.gz +
-$ cd nagios-plugins-1.4.14 +
-$ ./configure && make && make install  +
-</file> +
-Edit the configuration files to add host and services to be monitored: +
-<code>vim /usr/local/nagios/etc/objects/localhost.cfg </code> +
- +
-Check remote services http://wiki.nagios.org/index.php/Howtos:checkbyssh_RedHat +
-=== Accessing Nagios === +
- +
----- +
-http://172.26.0.205:4020/nagios/ +
- +
-with  username = "nagiosadmin" and password = "nagios" +
- +
server_monitoring.1259323806.txt.gz · Last modified: 2010/05/22 14:19 (external edit)