關於GFS的奇怪故障,NNTP兄請入
硬體環境:
兩台寶德PG9251G2伺服器,分別作gfs的node
gfs-1: 192.168.11.226
gfs-2: 192.168.11.227
磁碟陣列使用的是豪威SB-3163SA,滿配SATA 500G,做成一個LUN共享出去,共7G容量,用SCSI線把兩台伺服器直連在陣列
軟體環境:
rhel 4 as u2,2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 i686 i386 GNU/Linux
rhcs包:rhel-4-u2-rhcs-i386.iso
包含rpm如下:
ccs-1.0.2-0.i686.rpm
ccs-devel-1.0.2-0.i686.rpm
cman-1.0.2-0.i686.rpm
cman-devel-1.0.2-0.i686.rpm
cman-kernel-2.6.9-39.5.i686.rpm
cman-kernel-hugemem-2.6.9-39.5.i686.rpm
cman-kernel-smp-2.6.9-39.5.i686.rpm
cman-kernheaders-2.6.9-39.5.i686.rpm
dlm-1.0.0-5.i686.rpm
dlm-devel-1.0.0-5.i686.rpm
dlm-kernel-2.6.9-37.7.i686.rpm
dlm-kernel-hugemem-2.6.9-37.7.i686.rpm
dlm-kernel-smp-2.6.9-37.7.i686.rpm
dlm-kernheaders-2.6.9-37.7.i686.rpm
fence-1.32.6-0.i686.rpm
gulm-1.0.4-0.i686.rpm
gulm-devel-1.0.4-0.i686.rpm
iddev-2.0.0-3.i686.rpm
iddev-devel-2.0.0-3.i686.rpm
ipvsadm-1.24-6.i386.rpm
magma-1.0.1-4.i686.rpm
magma-devel-1.0.1-4.i686.rpm
magma-plugins-1.0.2-0.i386.rpm
perl-Net-Telnet-3.03-3.noarch.rpm
piranha-0.8.1-1.i386.rpm
rgmanager-1.9.38-0.i386.rpm
system-config-cluster-1.0.16-1.0.noarch.rpm
GFS使用源碼來編譯,源碼如下:
GFS-6.1.0-0.src.rpm
GFS-kernel-2.6.9-35.5.src.rpm
lvm2-cluster-2.01.09-5.0.RHEL4.src.rpm
編譯時,修改了GFS-kernel中的:
%define kernel_version 2.6.9-11.EL為
%define kernel_version 2.6.9-22.EL並使用了--nodeps參數。
編譯出來的rpm包如下:
GFS-kernel-2.6.9-35.5.i686.rpm
GFS-kernel-debuginfo-2.6.9-35.5.i686.rpm
GFS-kernel-hugemem-2.6.9-35.5.i686.rpm
GFS-kernel-smp-2.6.9-35.5.i686.rpm
GFS-kernheaders-2.6.9-35.5.i686.rpm
GFS-6.1.0-0.i386.rpm
GFS-debuginfo-6.1.0-0.i386.rpm
lvm2-cluster-2.01.09-5.0.RHEL4.i386.rpm
lvm2-cluster-debuginfo-2.01.09-5.0.RHEL4.i386.rpm
將上述所有rpm包安裝。
cluster.conf的配置情況如下:
<?xml version="1.0"?>
<cluster config_version="4" name="gfs_pc">
<fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="gfs-1" votes="1">
<fence>
<method name="1">
<device name="MAN-FEN" nodename="gfs-1"/>
</method>
</fence>
</clusternode>
<clusternode name="gfs-2" votes="1">
<fence>
<method name="1">
<device name="MAN-FEN" nodename="gfs-2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual" name="MAN-FEN"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources>
<clusterfs device="/dev/VolGFS01/Data01" force_unmount="0" fstype="gfs" mountpoint="/gfs" name="gfs01" options="-t gfs"/>
</resources>
</rm>
</cluster>
gfs-1啟動正常,陣列能夠mount上,並且能夠進行IO操作,啟動gfs-2的服務時有故障
故障現象
1、gfs-2啟動fenced時,有時會導致gfs-1死機,而gfs-2的clvmd進程無法啟動,運行后一直是等待狀態
gfs-1屏幕報錯如下:
SM: Assertion failed on line 106 of /usr/src/build/XXXXX/sm_membership.c
2、如果gfs-2的fenced能夠正常啟動過去之後,clvmd和gfs都能夠啟動,並且vgdisplay -v能看到vg的信息,但當要進行mount的操作時,mount操作在gfs-2掛起,gfs-1上也無法再進行IO操作。通過另外的終端還能夠繼續控制著兩台機器,只是完全沒有辦法對陣列做操作。
以上的問題,反覆測試,都依然如故,相同的配置曾經在vmware和EMC CX500陣列測試上都能正常運行,現在無法判斷故障點出在哪裡,請各位賜教
《解決方案》
故障現象補充
還發現了一個奇怪的現象,當fenced啟動導致gfs-1死機之後,ping gfs-2,結果發現,每過幾個包,總有一個包的響應非常慢,有的時候甚至有丟包
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time=2756ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time=1045ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
Reply from 192.168.11.227: bytes=32 time<1ms TTL=64
《解決方案》
看看這個是不是你要的.
http://www.spinics.net/lists/cluster/msg06021.html
[ 本帖最後由 wysilly 於 2007-10-17 20:20 編輯 ]