歡迎您光臨本站 註冊首頁

rhcs啟cman服務時,貌似節點之間相互感應不到對方

←手機掃碼閱讀     火星人 @ 2014-03-04 , reply:0

rhcs啟cman服務時,貌似節點之間相互感應不到對方

# clustat
Cluster Status for gfscluster @ Fri Mar 16 15:36:01 2012
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
GDAPQE16 1 Online, Local
GDAPQE15 2 Offline
# clustat
Cluster Status for gfscluster @ Fri Mar 16 15:36:10 2012
Member Status: Quorate

Member Name ID Status
------ ---- ---- ------
GDAPQE16 1 Offline
GDAPQE15 2 Online, Local

配置文件,兩節點的配置文件一樣
<?xml version="1.0"?>
<cluster alias="gfscluster" config_version="3" name="gfscluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="GDAPQE16" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="manualfence" nodename="GDAPQE16"/>
</method>
</fence>
</clusternode>
<clusternode name="GDAPQE15" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="manualfence" nodename="GDAPQE15"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_manual" name="manualfence"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>



log:
Mar 16 14:08:01 GDAPQE15 kernel: DLM (built Mar 16 2010 21:53:04) installed
Mar 16 14:08:01 GDAPQE15 kernel: GFS2 (built Mar 16 2010 21:53:24) installed
Mar 16 14:08:01 GDAPQE15 kernel: Lock_DLM (built Mar 16 2010 21:53:29) installed
Mar 16 14:08:01 GDAPQE15 ccsd: Starting ccsd 2.0.115:
Mar 16 14:08:01 GDAPQE15 ccsd: Built: Mar 16 2010 10:28:57
Mar 16 14:08:01 GDAPQE15 ccsd: Copyright (C) Red Hat, Inc. 2004 All rights reserved.
Mar 16 14:08:01 GDAPQE15 ccsd: cluster.conf (cluster name = gfscluster, version = 1) found.
Mar 16 14:08:01 GDAPQE15 ccsd: Unable to sendto broadcast ipv6 socket, but inet_ntop returned NULL pointer: Cannot assign requested address
Mar 16 14:08:29 GDAPQE15 last message repeated 16 times
Mar 16 14:08:30 GDAPQE15 ccsd: Unable to connect to cluster infrastructure after 30 seconds.
Mar 16 14:08:31 GDAPQE15 ccsd: Unable to sendto broadcast ipv6 socket, but inet_ntop returned NULL pointer: Cannot assign requested address
Mar 16 14:08:41 GDAPQE15 last message repeated 6 times
Mar 16 14:08:41 GDAPQE15 openais: AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
Mar 16 14:08:41 GDAPQE15 openais: Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
Mar 16 14:08:41 GDAPQE15 openais: Copyright (C) 2006 Red Hat, Inc.
Mar 16 14:08:41 GDAPQE15 openais: AIS Executive Service: started and ready to provide service.
Mar 16 14:08:41 GDAPQE15 openais: Using default multicast address of 239.192.161.86
Mar 16 14:08:41 GDAPQE15 openais: Token Timeout (10000 ms) retransmit timeout (495 ms)
Mar 16 14:08:41 GDAPQE15 openais: token hold (386 ms) retransmits before loss (20 retrans)
Mar 16 14:08:41 GDAPQE15 openais: join (60 ms) send_join (0 ms) consensus (20000 ms) merge (200 ms)
Mar 16 14:08:41 GDAPQE15 openais: downcheck (1000 ms) fail to recv const (50 msgs)
Mar 16 14:08:41 GDAPQE15 openais: seqno unchanged const (30 rotations) Maximum network MTU 1402
Mar 16 14:08:41 GDAPQE15 openais: window size per rotation (50 messages) maximum messages per rotation (17 messages)
Mar 16 14:08:41 GDAPQE15 openais: send threads (0 threads)
Mar 16 14:08:41 GDAPQE15 openais: RRP token expired timeout (495 ms)
Mar 16 14:08:41 GDAPQE15 openais: RRP token problem counter (2000 ms)
Mar 16 14:08:41 GDAPQE15 openais: RRP threshold (10 problem count)
Mar 16 14:08:41 GDAPQE15 openais: RRP mode set to none.
Mar 16 14:08:41 GDAPQE15 openais: heartbeat_failures_allowed (0)
Mar 16 14:08:41 GDAPQE15 openais: max_network_delay (50 ms)
Mar 16 14:08:41 GDAPQE15 openais: HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
Mar 16 14:08:41 GDAPQE15 openais: Receive multicast socket recv buffer size (262142 bytes).
Mar 16 14:08:41 GDAPQE15 openais: Transmit multicast socket send buffer size (262142 bytes).
Mar 16 14:08:41 GDAPQE15 openais: The network interface is now up.
Mar 16 14:08:41 GDAPQE15 openais: Created or loaded sequence id 52.56.0.186.47 for this ring.
Mar 16 14:08:41 GDAPQE15 openais: entering GATHER state from 15.
Mar 16 14:08:41 GDAPQE15 openais: CMAN 2.0.115 (built Mar 16 2010 10:29:01) started
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais CMAN membership service 2.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais extended virtual synchrony service'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais cluster membership service B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais availability management framework B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais checkpoint service B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais event service B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais distributed locking service B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais message service B.01.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais configuration service'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais cluster closed process group service v1.01'
Mar 16 14:08:41 GDAPQE15 openais: Service initialized 'openais cluster config database access v1.01'
Mar 16 14:08:41 GDAPQE15 openais: Not using a virtual synchrony filter.
Mar 16 14:08:41 GDAPQE15 openais: Creating commit token because I am the rep.
Mar 16 14:08:41 GDAPQE15 openais: Saving state aru 0 high seq received 0
Mar 16 14:08:41 GDAPQE15 openais: Storing new sequence id for ring 38
Mar 16 14:08:41 GDAPQE15 openais: entering COMMIT state.
Mar 16 14:08:41 GDAPQE15 openais: entering RECOVERY state.
Mar 16 14:08:41 GDAPQE15 openais: position member 56.0.186.47:
Mar 16 14:08:41 GDAPQE15 openais: previous ring seq 52 rep 56.0.186.47
Mar 16 14:08:41 GDAPQE15 openais: aru 0 high delivered 0 received flag 1
Mar 16 14:08:41 GDAPQE15 openais: Did not need to originate any messages in recovery.
Mar 16 14:08:41 GDAPQE15 openais: Sending initial ORF token
Mar 16 14:08:41 GDAPQE15 openais: CLM CONFIGURATION CHANGE
Mar 16 14:08:41 GDAPQE15 openais: New Configuration:
Mar 16 14:08:41 GDAPQE15 openais: Members Left:
Mar 16 14:08:41 GDAPQE15 openais: Members Joined:
Mar 16 14:08:41 GDAPQE15 openais: CLM CONFIGURATION CHANGE
Mar 16 14:08:41 GDAPQE15 openais: New Configuration:
Mar 16 14:08:41 GDAPQE15 openais: r(0) ip(56.0.186.47)
Mar 16 14:08:41 GDAPQE15 openais: Members Left:
Mar 16 14:08:41 GDAPQE15 openais: Members Joined:
Mar 16 14:08:41 GDAPQE15 openais: r(0) ip(56.0.186.47)
Mar 16 14:08:41 GDAPQE15 openais: This node is within the primary component and will provide service.
Mar 16 14:08:41 GDAPQE15 openais: entering OPERATIONAL state.
Mar 16 14:08:41 GDAPQE15 openais: quorum regained, resuming activity
Mar 16 14:08:41 GDAPQE15 openais: got nodejoin message 56.0.186.47
Mar 16 14:08:42 GDAPQE15 ccsd: Initial status:: Quorate
Mar 16 14:09:31 GDAPQE15 fenced: GDAPQE16 not a cluster member after 3 sec post_join_delay
Mar 16 14:09:31 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:31 GDAPQE15 fenced: fence "GDAPQE16" failed
Mar 16 14:09:36 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:36 GDAPQE15 fenced: fence "GDAPQE16" failed
Mar 16 14:09:41 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:41 GDAPQE15 fenced: fence "GDAPQE16" failed
Mar 16 14:09:46 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:46 GDAPQE15 fenced: fence "GDAPQE16" failed
Mar 16 14:09:51 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:51 GDAPQE15 fenced: fence "GDAPQE16" failed
Mar 16 14:09:56 GDAPQE15 fenced: fencing node "GDAPQE16"
Mar 16 14:09:56 GDAPQE15 fenced: fence "GDAPQE16" failed
然後就是一直fence對方節點了
《解決方案》

樓主的問題解決了沒啊 ?最近我也遇到了個類似的問題,啟動2個節點的集群,出現了不斷相互fence對方的情況,這是怎麼回事啊?
《解決方案》

Mar 16 14:08:41 GDAPQE15 openais: Token Timeout (10000 ms) retransmit timeout (495 ms)
Mar 16 14:08:41 GDAPQE15 openais: token hold (386 ms) retransmits before loss (20 retrans)
Mar 16 14:08:41 GDAPQE15 openais: join (60 ms) send_join (0 ms) consensus (20000 ms) merge (200 ms)
Mar 16 14:08:41 GDAPQE15 openais: downcheck (1000 ms) fail to recv const (50 msgs)
Mar 16 14:08:41 GDAPQE15 openais: seqno unchanged const (30 rotations) Maximum network MTU 1402
Mar 16 14:08:41 GDAPQE15 openais: window size per rotation (50 messages) maximum messages per rotation (17 messages)
Mar 16 14:08:41 GDAPQE15 openais: send threads (0 threads)
Mar 16 14:08:41 GDAPQE15 openais: RRP token expired timeout (495 ms)
Mar 16 14:08:41 GDAPQE15 openais: RRP token problem counter (2000 ms)
Mar 16 14:08:41 GDAPQE15 openais: RRP threshold (10 problem count)
Mar 16 14:08:41 GDAPQE15 openais: RRP mode set to none.

網路環境和拓撲描述一下?

《解決方案》

本帖最後由 xwzh2009 於 2012-05-05 19:05 編輯

我的網路環境是這樣的:創建的2個節點集群,它們通過一個三層交換機建立心跳,帶fence設備。
兩個節點同時起時集群能啟動成功,但是將其中一個節點異常重啟后,就再也加不進集群了,然後就把另外一個fence掉了,等這個節點起來后,它也加入不了集群,因此又把第一個節點fence掉了,就這樣不斷相互fence對方。

現在發現了一個新的情況,我把集群的fence設備去掉,使集群不帶fence,這時我重啟其中一個節點,剛開始不能加入集群,這時由於沒有fence設備,執行fence失敗,過了大概4分半鐘的時間,又能重新發現另外一個節點,使用tcpdump抓包看了下,在這4分半的時間內,節點在不斷發送組播消息,隨後進行了arp請求,並發現心跳。嘗試重啟2次,都是這個現象。現在我比較迷惑的是,這4分半到底是什麼時間,集群做了些啥?不知道是交換機的原因還是集群本身的原因?
《解決方案》

在這4分半的時間內,節點在不斷發送組播消息

那個交換機對組播的支持如何?有沒有做相關的設置呢?找根交叉網線,把心跳網卡直連最保險。

《解決方案》

sleepcat 發表於 2012-05-06 05:12 static/image/common/back.gif
那個交換機對組播的支持如何?有沒有做相關的設置呢?找根交叉網線,把心跳網卡直連最保險。

交換機的設置我不太清楚,都是默認的設置,你說的交換機對組播的支持是指什麼啊?難道交換機還需要特殊的配置嗎?
另外,交換機的心跳必須通過交換機,便於以後擴展。
謝謝。
《解決方案》

xwzh2009 發表於 2012-05-07 21:54 static/image/common/back.gif
交換機的設置我不太清楚,都是默認的設置,你說的交換機對組播的支持是指什麼啊?難道交換機還需要特殊 ...

某些交換機預設不允許組播。我對網路也不是很熟,別人這麼告訴我的。


《解決方案》

sleepcat 發表於 2012-05-07 22:12 static/image/common/back.gif
某些交換機預設不允許組播。我對網路也不是很熟,別人這麼告訴我的。

但過了4分多鐘的時間,節點間又能發現心跳了, 這應該不能說明交換機不允許組播,不然,節點間根本就建立不了心跳。

[火星人 ] rhcs啟cman服務時,貌似節點之間相互感應不到對方已經有1022次圍觀

http://coctec.com/docs/service/show-post-4571.html