redhat5.5雙機斷網關機不能實現重啟

火星人 @ 2014-03-04 , reply:0


redhat5.5雙機斷網關機不能實現重啟

問題:兩台IBM伺服器做雙機,斷網測試的時候,直接關機,不能實現重啟。查看日誌也沒有互相fence failed現象。
硬體IBM X3850 X5兩台,操作系統redhat5.5 兩個cisco3560swich,每個伺服器4塊網卡和2兩塊光纖卡,fence設備是IBM IMM。
eth0/eth1各接一個交換機,兩塊做bond0。eht4/eth5兩根心跳相連,做bond1。2塊光纖卡eth2/eth3。
之前出現過重啟伺服器后6塊網卡之間MAC地址漂移現象,最終通過加MAC到每個網卡配置文件把問題解決了。不知道雙機和這有沒有關係?
另有8台x3650跟3850同樣的配置,雙機測試已經沒有問題。
我的配置信息
主機名:
root@ynrhzf-db1  bond0:192.168.141.11  bond0:192.168.142.11  
root@ynrhzf-db2  bond0:192.168.141.12  bond0:192.168.142.12



# cat /etc/hosts

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
192.168.141.11  db1.anypay.yn   ynrhzf-db1
192.168.141.12  db2.anypay.yn   ynrhzf-db2
192.168.141.10  ynrhzf-db                  //浮動IP
192.168.142.11  pri-db1
192.168.142.12  pri-db2
192.168.141.103 imm-db1
192.168.141.104 imm-db2


# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]
# service cman start
Starting cluster:
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
[  OK  ]

cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="db1.anypay.yn" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="db2.anypay.yn" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="227.0.0.10"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="db-failover" ordered="1" restricted="1">
                                <failoverdomainnode name="db1.anypay.yn" priority="1"/>
                                <failoverdomainnode name="db2.anypay.yn" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.141.10/24" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="db-failover" name="db-services">
                        <ip ref="192.168.141.10/24"/>
                </service>
        </rm>
</cluster>

測試結果:
root@ynrhzf-db1 network-scripts]# fence_rsa -a 192.168.141.103 -l USERID -p PASSW0RD -o status
Status: ON
# fence_rsa -a 192.168.141.104 -l USERID -p PASSW0RD -o status
Status: ON

telnet遠程管理口也沒問題,進去后兩台都執行reset命令,都能實現重啟。





我的cluster.conf配置文件:
root@ynrhzf-db1 network-scripts]# cat /etc/cluster/cluster.conf
<?xml version="1.0" ?>
<cluster config_version="3" name="db-cluster">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="db1.anypay.yn" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="db2.anypay.yn" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="imm-db2"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr="227.0.0.10"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.103" login="USERID" name="imm-db1" passwd="PASSW0RD"/>
                <fencedevice agent="fence_rsa" ipaddr="192.168.141.104" login="USERID" name="imm-db2" passwd="PASSW0RD"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="db-failover" ordered="1" restricted="1">
                                <failoverdomainnode name="db1.anypay.yn" priority="1"/>
                                <failoverdomainnode name="db2.anypay.yn" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="192.168.141.10/24" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="db-failover" name="db-services">
                        <ip ref="192.168.141.10/24"/>
                </service>
        </rm>
</cluster>

這是在db1斷網測試的日誌:

Mar 12 05:32:04 db2 openais: Members Joined:
Mar 12 05:32:04 db2 openais:      r(0) ip(192.168.141.11)  
Mar 12 05:32:04 db2 openais: This node is within the primary component and will provide service.
Mar 12 05:32:04 db2 openais: entering OPERATIONAL state.
Mar 12 05:32:04 db2 openais: got nodejoin message 192.168.141.11
Mar 12 05:32:04 db2 openais: got nodejoin message 192.168.141.12
Mar 12 05:32:04 db2 openais: got joinlist message from node 1
Mar 12 05:32:21 db2 kernel: dlm: Using TCP for communications
Mar 12 05:32:21 db2 kernel: dlm: got connection from 1
Mar 12 05:32:22 db2 clurgmgrd: <notice> Resource Group Manager Starting
Mar 12 05:36:50 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:36:51 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:36:51 db2 dhclient: bound to 169.254.95.120 -- renewal in 294 seconds.
Mar 12 05:36:57 db2 openais: The token was lost in the OPERATIONAL state.
Mar 12 05:36:57 db2 openais: Receive multicast socket recv buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais: Transmit multicast socket send buffer size (320000 bytes).
Mar 12 05:36:57 db2 openais: entering GATHER state from 2.
Mar 12 05:37:17 db2 openais: entering GATHER state from 0.
Mar 12 05:37:17 db2 openais: Creating commit token because I am the rep.
Mar 12 05:37:17 db2 openais: Saving state aru 3b high seq received 3b
Mar 12 05:37:17 db2 openais: Storing new sequence id for ring 18
Mar 12 05:37:17 db2 openais: entering COMMIT state.
Mar 12 05:37:17 db2 openais: entering RECOVERY state.
Mar 12 05:37:17 db2 openais: position member 192.168.141.12:
Mar 12 05:37:17 db2 openais: previous ring seq 20 rep 192.168.141.11
Mar 12 05:37:17 db2 openais: aru 3b high delivered 3b received flag 1
Mar 12 05:37:17 db2 openais: Did not need to originate any messages in recovery.
Mar 12 05:37:17 db2 openais: Sending initial ORF token
Mar 12 05:37:17 db2 openais: CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais: New Configuration:
Mar 12 05:37:17 db2 kernel: dlm: closing connection to node 1
Mar 12 05:37:17 db2 fenced: db1.anypay.yn not a cluster member after 0 sec post_fail_delay
Mar 12 05:37:17 db2 openais:      r(0) ip(192.168.141.12)  
Mar 12 05:37:17 db2 fenced: fencing node "db1.anypay.yn"
Mar 12 05:37:17 db2 openais: Members Left:
Mar 12 05:37:17 db2 openais:      r(0) ip(192.168.141.11)  
Mar 12 05:37:17 db2 openais: Members Joined:
Mar 12 05:37:17 db2 openais: CLM CONFIGURATION CHANGE
Mar 12 05:37:17 db2 openais: New Configuration:
Mar 12 05:37:17 db2 openais:      r(0) ip(192.168.141.12)  
Mar 12 05:37:17 db2 openais: Members Left:
Mar 12 05:37:17 db2 openais: Members Joined:
Mar 12 05:37:17 db2 openais: This node is within the primary component and will provide service.
Mar 12 05:37:17 db2 openais: entering OPERATIONAL state.
Mar 12 05:37:17 db2 openais: got nodejoin message 192.168.141.12
Mar 12 05:37:17 db2 openais: got joinlist message from node 2
Mar 12 05:37:28 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:37:28 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:37:28 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:37:29 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:37:29 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:37:29 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:38:03 db2 ccsd: Attempt to close an unopened CCS descriptor (3180).
Mar 12 05:38:03 db2 ccsd: Error while processing disconnect: Invalid request descriptor
Mar 12 05:38:03 db2 fenced: fence "db1.anypay.yn" success
Mar 12 05:38:04 db2 clurgmgrd: <notice> Taking over service service:db-services from down member db1.anypay.yn
Mar 12 05:38:06 db2 avahi-daemon: Registering new address record for 192.168.141.10 on bond0.
Mar 12 05:38:07 db2 clurgmgrd: <notice> Service service:db-services started
Mar 12 05:41:11 db2 kernel: usb 8-1: new low speed USB device using uhci_hcd and address 2
Mar 12 05:41:12 db2 kernel: usb 8-1: configuration #1 chosen from 1 choice
Mar 12 05:41:12 db2 kernel: input:   USB Keyboard as /class/input/input1
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Keyboard [  USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:12 db2 kernel: input:   USB Keyboard as /class/input/input2
Mar 12 05:41:12 db2 kernel: input: USB HID v1.10 Device [  USB Keyboard] on usb-0000:00:1d.2-1
Mar 12 05:41:32 db2 kernel: usb 8-1: USB disconnect, address 2
Mar 12 05:41:44 db2 dhclient: DHCPREQUEST on usb0 to 169.254.95.118 port 67
Mar 12 05:41:45 db2 dhclient: DHCPACK from 169.254.95.118
Mar 12 05:41:45 db2 dhclient: bound to 169.254.95.120 -- renewal in 252 seconds.
Mar 12 05:42:04 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:04 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:04 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:04 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:04 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:05 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:06 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:06 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:06 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:06 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:08 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:08 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:08 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:08 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:10 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:10 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:10 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:10 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:11 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:11 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:11 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:11 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:11 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:42:48 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:42:48 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:42:48 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:42:48 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:42:49 db2 kernel: igb: eth5 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth5.
Mar 12 05:42:49 db2 kernel: bonding: bond1: making interface eth5 the new active one.
Mar 12 05:42:49 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:42:49 db2 kernel: igb: eth4 NIC Link is Up 10 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:42:49 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:27 db2 kernel: igb: eth4 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it
Mar 12 05:43:27 db2 kernel: igb: eth5 NIC Link is Down
Mar 12 05:43:27 db2 kernel: bonding: bond1: link status definitely down for interface eth5, disabling it
Mar 12 05:43:27 db2 kernel: bonding: bond1: now running without any active interface !
Mar 12 05:43:29 db2 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:29 db2 kernel: bonding: bond1: link status definitely up for interface eth4.
Mar 12 05:43:29 db2 kernel: bonding: bond1: making interface eth4 the new active one.
Mar 12 05:43:29 db2 kernel: bonding: bond1: first active interface up!
Mar 12 05:43:30 db2 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 12 05:43:30 db2 kernel: bonding: bond1: link status definitely up for interface eth5.



這個配置文件不知道影響大不大
# cat /etc/modprobe.conf
alias scsi_hostadapter megaraid_sas
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 lpfc
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding
options bond0 miimon=100 mode=1
alias bond1 bonding
options bond1 miimon=100 mode=0

### BEGIN UltraPath Driver Comments ###
remove upUpper if [ -d /proc/mpp ] && [ `ls -a /proc/mpp | wc -l` -gt 2 ]; then echo -e "Please Unload Physical HBA Driver prior to unloading upUpper."; else /sbin/modprobe -r --ignore-remove upUpper; fi
# Additional config info can be found in /opt/mpp/modprobe.conf.mppappend.
# The Above config info is needed if you want to make mkinitrd manually.
# Edit the '/etc/modprobe.conf' file and run 'upUpdate' to create Ramdisk dynamically.
### END UltraPath Driver Comments ###
options qla2xxx qlport_down_retry=5
options lpfc lpfc_nodev_tmo=30
alias eth2 e1000e
alias eth3 e1000e
alias eth4 igb
alias eth5 igb

這問題已經困擾我N多天了,大俠們幫我分析分析吧!!!!
《解決方案》

配置和日誌均顯示沒有什麼大問題。

但/etc/hosts中,將這個地方:
192.168.141.11  db1.anypay.yn   ynrhzf-db1
192.168.141.12  db2.anypay.yn   ynrhzf-db2
改成:
192.168.141.11  db1.anypay.yn  
192.168.141.12  db2.anypay.yn  

以避免斷網之後的fence錯誤。心跳線不能直連,必須通過交換機走。fence的時候不重啟而是關機的情況,估計得從伺服器的BIOS設置中去查。
《解決方案》

謝謝大俠,明天試試去
《解決方案》

bios里查看了一下,沒找到需要修改的東西哦 鬱悶呢
《解決方案》

集群調用fence_rsa的默認行為就是reboot,所以這個地方應該不是在操作系統上能配置的。再仔細查查。



[火星人 via ] redhat5.5雙機斷網關機不能實現重啟已經有222次圍觀

http://www.coctec.com/docs/service/show-post-5244.html