User Tools

Site Tools


server_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
server_monitoring [2009/11/17 11:07] 172.26.0.166server_monitoring [2010/06/21 22:41] 172.26.14.218
Line 1: Line 1:
 ===== Server Monitoring ===== ===== Server Monitoring =====
-  * [[#ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage +  * [[server_monitoring:ganglia|Ganglia]] - Monitors cluster CPU, disk, network usage 
-  * [[#monit|Monit]] - Monitors specific services +  * [[server_monitoring:monit|Monit]] - Monitors specific services 
-  * [[#nagios|Nagios]] - Monitors services services+  * [[server_monitoring:nagios|Nagios]] - Monitors servers,hosts and services 
 +  * [[server_monitoring:zabbix|Zabbix]] -Monitor servers,host and services
  
-===== Ganglia =====+===== Monit =====
  
-[[http://ganglia.info/|Ganglia]] is a system for measuringrecording, and graphing certain metrics about hosts in cluster Metrics include things like CPU load, network traffic, disk space, RAM utilization, etc.  These metrics are saved and graphed periodically by [[http://oss.oetiker.ch/rrdtool|RRDtool]]. Rocks automatically configures the head node to query the compute nodes and has web-based interface where you can monitor the health of the cluster.+Monit is a free open source utility for managing and monitoring, processesfilesdirectories and filesystems on UNIX systemMonit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations 
 +Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resourcesit logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert. 
 +Monit provides built-in HTTP(S) interface and you can use a browser to access the Monit server
  
-You can see the ganglia installation here:  http://hpc.ilri.cgiar.org/ganglia (if it's working)+M/Monit expand upon Monit's capabilities to provide monitoring and management of all Monit enabled hosts from one easy to use web-interfaceStatus and events from each monitored system is updated in real-time and displayed in charts, graphs and tables.
  
-For some reason sometimes the graphs do not draw.  If you reboot the cluster the graphs work for some days and then seem to stop drawing (the graphs appear blank).  However, if you call the script responsible for drawing the graphs with no arguments, you'll see graphs are actually working: http://hpc.ilri.cgiar.org/ganglia/graph.php+Get the latest version at: http://mmonit.com/monit/download
  
-{{:ganglia_diagram_smaller.gif|}}+<code>$ wget http://mmonit.com/monit/dist/monit-5.0.3.tar.gz 
 +$ tar xfz monit-5.0.3.tar.gz 
 +$ cd monit-5.0.3 
 +$ ./configure && make && make install</code> 
 +Accessing monit: 
 +http://hpc.ilri.cgiar.org:2812
  
-Interesting documentation: http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia+===== Nagios =====
  
-==== Notes ====+Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. http://www.nagios.org/about
  
-It appears as if other CGIAR clusters have been configured to query our ''gmond'' daemon.  I'm not sure why, as their respective ganglia pages do not show ILRI HPC stats.  In any case, the relevant configuration settings are in ''/etc/gmond.conf'' on the head node, note especially the ''trusted_hosts'' entries: +=== Installation ===
-<file># +
-# Gmond config file for Cluster Cluster. +
-# Generated by ganglia.xml node without aid from the database. +
-+
-name "ILRI" +
-owner "Cgiar"  +
-url "http://hpc.ilri.cgiar.org/" +
-latlong "N32.87 W117.22" +
-mcast_channel "237.170.26.97"+
  
 +----
  
-+Download the latest version of nagios while hot, from http://www.nagios.org/download 
-# Increase size of gmond user (gmetric) hash table+<file>$ wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.0.tar.gz  
-# +$ cd nagios-3.2.0 
-num_custom_metrics 2048+$ ./configure 
 +$ make all  
 +$ useradd nagios 
 +$ make install 
 +$ make install-init 
 +$ make install-commandmode 
 +$ make install-config 
 +$ make install-webconf 
 +</file> 
 +=== Configuration ===
  
-# Uncomment the next line for monitoring by the Rocks Cluster Network. +---- 
-trusted_hosts 220.227.242.214 # hpc.icrisat.cgiar.org +Running the following command will create a new file called htpasswd.users in the /usr/local/nagios/etc directoryIt will also create an username/password entry for nagiosadminYou will be asked to provide a password that will be used when nagiosadmin authenticates to the web server
-trusted_hosts 202.123.56.187  # hpc.irri.cgiar.org +<code>htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin </code>
-trusted_hosts 216.244.151.133 # ? +
-trusted_hosts 200.62.229.37   # hpc.cip.cgiar.org+
  
-# Listen only on the private cluster interface+Download and install plugins  
-mcast_if eth0</file>+<file> 
 +$ wget http://prdownloads.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.14.tar.gz 
 +$ tar xfz nagios-plugins-1.4.14.tar.gz 
 +$ cd nagios-plugins-1.4.14 
 +$ ./configure && make && make install  
 +</file
 +Edit the configuration files to add host and services to be monitored: 
 +<code>vim /usr/local/nagios/etc/objects/localhost.cfg </code>
  
-==== Connections refused ==== +Check remote services http://wiki.nagios.org/index.php/Howtos:checkbyssh_RedHat 
-I kept seeing this error in ''/var/log/messages'' on the head node: +=== Accessing Nagios ===
-<code>Aug 28 12:44:24 hpc-ilri /usr/sbin/gmond[3453]: server_thread() Host 200.62.229.37 tried to connect and was refused</code> +
-Apparently that is the Potato Center's (CIP) ganglia trying to talk to our ganglia.  Adding ''trusted_hosts 200.62.229.37'' to ''/etc/gmond.conf'' on the head node and restarting gmond fixed it.+
  
-===== Monit =====+---- 
 +http://172.26.0.205:4020/nagios/
  
-Monit is a free open source utility for managing and monitoring, processes, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations.  +with  username = "nagiosadmin" and password = "nagios" 
-Monit can start a process if it does not run, restart a process if it does not respond and stop a process if it uses too much resources. it logs to syslog or to its own log file and notifies you about error conditions and recovery status via customizable alert. +==== Zabbix ==== 
-Monit provides a built-in HTTP(S) interface and you can use a browser to access the Monit server+---- 
 +Installation: Ref : http://www.zabbix.com/documentation/1.8/start
  
-M/Monit expand upon Monit's capabilities to provide monitoring and management of all Monit enabled hosts from one easy to use web-interfaceStatus and events from each monitored system is updated in real-time and displayed in charts, graphs and tables.+RHEL-compatible Linux: Ref: http://andrewfarley.com/sysadmin/rpm-repository-online 
 +<code>sudo echo '[andrewfarley] 
 +name=Andrew Farley RPM Repository 
 +baseurl=http://repo.andrewfarley.com/centos/$releasever/$basearch/ 
 +enabled=1 
 +gpgcheck=0' > /etc/yum.repos.d/andrewfarley.com.repo</code>
  
-Get the latest version at: http://mmonit.com/monit/download 
  
-<code>$ wget http://mmonit.com/monit/dist/monit-5.0.3.tar.gz 
-$ tar xfz monit-5.0.3.tar.gz 
-$ cd monit-5.0.3 
-$ ./configure && make && make install</code> 
-Accessing monit: 
-http://hpc.ilri.cgiar.org:2812 
  
-==== Nagios ====+And then you can install zabbix agent, zabbix server, zabbix get, or zabbix proxy with… 
 +<file> 
 +    sudo yum install zabbix-agent 
 +    sudo yum install zabbix-server 
 +    sudo yum install zabbix-get 
 +    sudo yum install zabbix-proxy </file> 
 + 
 +If it fails to install, you might need to clean the metadata with the following command and try again… 
 + 
 +    sudo yum clean metadata 
 + 
 + 
 +Debian-Based Linux: 
 +---- 
 +<code> 
 +root@simple:~# apt-cache search zabbix  
 +zabbix-agent - network monitoring solution - agent 
 +zabbix-frontend-php - network monitoring solution - PHP front-end 
 +zabbix-proxy-mysql - network monitoring solution - proxy (using MySQL) 
 +zabbix-proxy-pgsql - network monitoring solution - proxy (using PostgreSQL) 
 +zabbix-server-mysql - network monitoring solution - server (using MySQL) 
 +zabbix-server-pgsql - network monitoring solution - server (using PostgreSQL) 
 +root@simple:~# aptitude install zabbix-proxy-mysql zabbix-agent zabbix-server-mysql zabbix-frontend-php 
 +</code> 
 + 
 +=== Accessing Zabbix === 
 + 
 +http://172.26.12.29/zabbix 
 +username: Admin 
 +password: zabbix 
 +