配置nagios監控HA集群（二）

←手機掃碼閱讀火星人 @ 2014-03-09 , reply:0

配置nagios來監控HA集群（二）

終於開始監控集群了,先說下實驗環境：之前搭建的ha集群,192.168.10.101和192.168.10.102,運行的是http服務,nagios安裝在192.168.10.1,利用nrpe監控101和102.

好了,開始.老習慣,先找資料：

First off, we need to define what we mean by a "cluster". The simplest way to understand this is with an example. Let's say that your organization has five hosts which provide redundant DNS services to your organization. If one of them fails, its not a major catastrophe because the remaining servers will continue to provide name resolution services. If you're concerned with monitoring the availability of DNS service to your organization, you will want to monitor five DNS servers. This is what I consider to be a

service cluster. The service cluster consists of five separate DNS services that you are monitoring. Although you do want to monitor each individual service, your main concern is with the overall status of the DNS service cluster, rather than the availability of any one particular service.

這段講的重點其實就是我們之所以要搭建集群就是為了保證服務在其中節點宕機的情況下還能正常運行,所以用nagios監控集群的關鍵是將監控集群本身作為nagios的一個服務去看待,而這個服務的目標在於對集群這個整體運行狀態的監控,而不是具體針對其中哪一台機器出了問題.

here are several ways you could potentially monitor service or host clusters. I'll describe the method that I believe to be the easiest. Monitoring service or host clusters involves two things:

*Monitoring individual cluster elements

*Monitoring the cluster as a collective entity

Monitoring individual host or service cluster elements is easier than you think. In fact, you're probably already doing it. For service clusters, just make sure that you are monitoring each service element of the cluster. If you've got a cluster of five DNS servers, make sure you have five separate service definitions (probably using the check_dns plugin). For host clusters, make sure you have configured appropriate host definitions for each member of the cluster (you'll also have to define at least one service to be monitored for each of the hosts). Important: You're going to want to disable notifications for the individual cluster elements (host or service definitions). Even though no notifications will be sent about the individual elements, you'll still get a visual display of the individual host or service status in the status CGI. This will be useful for pinpointing the source of problems within the cluster in the future.

Monitoring the overall cluster can be done by using the previously cached results of cluster elements. Although you could re-check all elements of the cluster to determine the cluster's status, why waste bandwidth and resources when you already have the results cached? Where are the results cached? Cached results for cluster elements can be found in the

status file (assuming you are monitoring each element). The check_cluster plugin is designed specifically for checking cached host and service states in the status file. Important: Although you didn't enable notifications for individual elements of the cluster, you will want them enabled for the overall cluster status check.

而這一段具體說明了監控集群的兩個方面：一個是節點的監控（如我們上一篇所實驗的那樣,具體的監控）；第二個是集群的監控.節點的監控舉例告訴我們如果有5個節點的集群跑DNS服務,那麼在localhost.cfg文件裡面需要定義5個獨立的服務,利用check_dns插件來完成.並且強調了：在這裡的監控關閉通知,我們能在CGI界面上看到相應的監控狀態.那麼nagios需要通知的是什麼呢？想象下：一個維護人員不可能一天二十四小時都趴在機房裡等著出問題吧？剛剛說集群如果宕機了一個或者幾個節點,但是服務還是在正常的運行,沒必要現在就趕著去把那幾台宕掉的機器弄好吧？或者那幾台被fence device

重啟了你也大老遠的白跑一趟么？所以只有當集群或者服務真的出問題了或者運行得不穩定了我們才要引起重視,沒必要為單獨的一台伺服器的小毛病把自己的生活都弄得亂七八糟吧,維護人員是要掌握系統,而不是被系統掌握！

這裡還提到了用來監控集群的模塊check_cluster,我們再來想下,本來我們在進行節點監控的時候已經獲得了很多監控數據了,那現在監控集群又要再多去監控一遍么？沒必要吧,所以這裡提到check_cluster模塊是利用nagios本身的緩存cached的數據,這就不用再重新去獲得數據,而直接對數據進行分析,大大節省了資源.那麼這些cached的數據存放在哪裡呢？/usr/localhost/nagios/var/目錄下一個叫status.dat的文件里：

我們可以看到大量的數據,而check_cluster就是從這些數據里得到相應的然後進行分析的.廢話不多說了,開始定義check_cluster（command.cfg

里默認是沒有定義的,所以要手動寫進去）：

參數的意思我們可以用--help來查看下,這裡設置了4個參數：第一個我也不知道什麼意思,估計是個名字標識吧隨便寫了,第二個warning的級別定義,第三個critical的級別,第四個等下我們來解釋,下面再定義一個命令用來監控服務：

然後定義localhost.cfg:

是監控服務的定義warning的級別是1,critical是2,也就是1台宕機了warning,2台就critical了,後面的$SERVICESTATEID $HOSTSTATEID是

是服務檢查中產生的宏,存放了service檢測的返回值,後面冒號接定義的host然後冒號接定義的服務名,這裡注意要用$結尾,然後逗號寫下一個.$HOSTSTATEID也是同樣的道理,不過就不用寫服務了.

重啟一邊nagios然後看狀態：

其中服務中提示有一條critical,我查看下是：

當然會critical,ha集群只有一台服務在運行狀態.現在我把node1的httpd給stop掉再看：

Node2開始接管了,我沒有調整

nagios收集信息的參數,所以延時還是蠻大的,哎..手動執行了,等不了了：

繼續,我現在把node1給shutdown,按照我們的設置應該會cluster_host提示warning,cluster_service還是OK狀態吧？看看吧：

紅紅的一片,node1已經down了,再來看cluster_host和cluster_service：

奇怪了,應該cluster_host會warning啊,怎麼後面寫著1up,1down狀態還是

ok的啊？查了資料,這裡我弄錯了,原來warning是指的大於設置的值,我設置1那麼就要大於1台down才會報警.好了,關掉node2然後再看狀態：

同時報警,warning而不是critical,可以看出critical也是大於所設定的值.好了,終於完成了.不過這也只是冰山一角,nagios還有很多的配置參數,很多插件,還有特別是它的報警機制可以值得去研究的,確實是公司企業,運維必備的利器！！

參考：http://blog.chinaunix.net/u2/89218/showart_2019588.html

http://vincentwen.blog.51cto.com/824577/385231

本文出自「no2實驗室」博客,請務必保留此出處http://linuxfan.blog.51cto.com/1842325/427900

Tags: linux system 系統

[火星人 ] 配置nagios監控HA集群（二）已經有636次圍觀

本文地址：http://coctec.com/docs/linux/show-post-49723.html

配置nagios監控HA集群（二）

熱門文章

最新文章