User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
server_monitoring [2009/11/17 11:52] 172.26.0.166server_monitoring [2010/06/16 09:41] 172.26.0.166
Line 2: Line 2:
   * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage   * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage
   * [[#monit|Monit]] - Monitors specific services   * [[#monit|Monit]] - Monitors specific services
-  * [[#nagios|Nagios]] - Monitors services services+  * [[#nagios|Nagios]] - Monitors servers,hosts and services 
 +  * [[#Zabbix|Zabbix]] -Monitor servers,host and services 
  
 ===== Ganglia ===== ===== Ganglia =====
Line 8: Line 10:
 [[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster. [[http://ganglia.info/|Ganglia]] is a system for measuring, recording, and graphing certain metrics about hosts in a cluster.  Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has a web-based interface where you can monitor the health of the cluster.
  
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia (if it's working) +You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia
- +
-For some reason sometimes the graphs do not draw.  If you reboot the cluster the graphs work for some days and then seem to stop drawing (the graphs appear blank).  However, if you call the script responsible for drawing the graphs with no arguments, you'll see graphs are actually working: http://hpc.ilri.cgiar.org/ganglia/graph.php+
  
 {{:ganglia_diagram_smaller.gif|}} {{:ganglia_diagram_smaller.gif|}}
  
 Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia
 +==== Troubleshooting ====
 +From time to time there are problems with Ganglia's web interface.  You can restart the needed services following this basic procedure:
  
-==== Notes ====+  - Stop data collection daemon on HPC: ''service gmetad stop'' 
 +  - Stop monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond stop%%'%%'' 
 +  - Start data collection daemon on HPC: ''service gmetad start'' 
 +  - Wait a minute or two 
 +  - Start monitoring daemon on compute nodes: ''rocks run host compute %%'%%service gmond start%%'%%''
  
-It appears as if other CGIAR clusters have been configured to query our ''gmond'' daemon.  I'm not sure why, as their respective ganglia pages do not show ILRI HPC stats.  In any case, the relevant configuration settings are in ''/etc/gmond.conf'' on the head node, note especially the ''trusted_hosts'' entries: +Now go check the Ganglia web interface and see if the nodes have returned.
-<file># +
-# Gmond config file for Cluster Cluster. +
-# Generated by ganglia.xml node without aid from the database. +
-+
-name "ILRI" +
-owner "Cgiar"  +
-url "http://hpc.ilri.cgiar.org/" +
-latlong "N32.87 W117.22" +
-mcast_channel "237.170.26.97" +
- +
- +
-+
-# Increase size of gmond user (gmetric) hash table. +
-+
-num_custom_metrics 2048 +
- +
-# Uncomment the next line for monitoring by the Rocks Cluster Network. +
-trusted_hosts 220.227.242.214 # hpc.icrisat.cgiar.org +
-trusted_hosts 202.123.56.187  # hpc.irri.cgiar.org +
-trusted_hosts 216.244.151.133 # ? +
-trusted_hosts 200.62.229.37   # hpc.cip.cgiar.org +
- +
-# Listen only on the private cluster interface+
-mcast_if eth0</file> +
- +
-==== Connections refused ==== +
-I kept seeing this error in ''/var/log/messages'' on the head node: +
-<code>Aug 28 12:44:24 hpc-ilri /usr/sbin/gmond[3453]: server_thread() Host 200.62.229.37 tried to connect and was refused</code> +
-Apparently that is the Potato Center's (CIP) ganglia trying to talk to our ganglia.  Adding ''trusted_hosts 200.62.229.37'' to ''/etc/gmond.conf'' on the head node and restarting gmond fixed it.+
  
 ===== Monit ===== ===== Monit =====
Line 78: Line 55:
 $ cd nagios-3.2.0 $ cd nagios-3.2.0
 $ ./configure $ ./configure
-$ make all </file> +$ make all  
 +$ useradd nagios 
 +$ make install 
 +$ make install-init 
 +$ make install-commandmode 
 +$ make install-config 
 +$ make install-webconf 
 +</file>
 === Configuration === === Configuration ===
  
 ---- ----
 +Running the following command will create a new file called htpasswd.users in the /usr/local/nagios/etc directory. It will also create an username/password entry for nagiosadmin. You will be asked to provide a password that will be used when nagiosadmin authenticates to the web server.
 +<code>htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin </code>
 +
 +Download and install plugins 
 +<file>
 +$ wget http://prdownloads.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.14.tar.gz
 +$ tar xfz nagios-plugins-1.4.14.tar.gz
 +$ cd nagios-plugins-1.4.14
 +$ ./configure && make && make install 
 +</file>
 +Edit the configuration files to add host and services to be monitored:
 +<code>vim /usr/local/nagios/etc/objects/localhost.cfg </code>
 +
 +Check remote services http://wiki.nagios.org/index.php/Howtos:checkbyssh_RedHat
 +=== Accessing Nagios ===
 +
 +----
 +http://172.26.0.205:4020/nagios/
 +
 +with  username = "nagiosadmin" and password = "nagios"
 +
 +==== Zabbix ====
 +----
 +Installation:
 +
 +RHEL-compatible Linux:
 +<code>sudo echo '[andrewfarley]
 +name=Andrew Farley RPM Repository
 +baseurl=http://repo.andrewfarley.com/centos/$releasever/$basearch/
 +enabled=1
 +gpgcheck=0' > /etc/yum.repos.d/andrewfarley.com.repo</code>
 +
 +
 +
 +And then you can install zabbix agent, zabbix server, zabbix get, or zabbix proxy with…
 +<file>
 +    sudo yum install zabbix-agent
 +    sudo yum install zabbix-server
 +    sudo yum install zabbix-get
 +    sudo yum install zabbix-proxy </file>
 +
 +If it fails to install, you might need to clean the metadata with the following command and try again…
 +
 +    sudo yum clean metadata
 +
 +
 +Debian-Based Linux:
 +
 +
 +=== Accessing Zabbix ===
 +
 +http://172.26.12.29/zabbix
 +username: Admin
 +password: zabbix
 +