Oracle11.2.0.4禁用HAIP
现象:
节点宕掉后,无法重启动,需拨心跳网卡几次,方能自启动,初步判定为由于HAIP莫名故障,导致一个节点无法启动CRS
1 检查网络
[grid@gmdb1 trace]$ oifcfg iflist -p -n
bond0 22.1.32.0 UNKNOWN 255.255.254.0
bond1 1.255.255.0 UNKNOWN 255.255.255.0
bond1 169.254.0.0 UNKNOWN 255.255.0.0
2 检查CRS
[root@gmdb2 tmp]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
3 检查ASM和HAIP无法启动:
[root@gmdb2 tmp]# crsctl stat res -t -init
NAME TARGET STATE SERVER STATE_DETAILS Cluster Resources
ora.asm 1 ONLINE OFFLINE
ora.cluster_interconnect.haip 1 ONLINE OFFLINE
4 用mcaasttest.pl检查,并无问题:
[grid@gmdb2 mcasttest]$ perl mcasttest.pl -n gmdb2,gmdb1 -i bond0,bond1
########### Setup for node gmdb2 ##########
Checking node access 'gmdb2'
Checking node login 'gmdb2'
Checking/Creating Directory /tmp/mcasttest for binary on node 'gmdb2'
Distributing mcast2 binary to node 'gmdb2'
########### Setup for node gmdb1 ##########
Checking node access 'gmdb1'
Checking node login 'gmdb1'
Checking/Creating Directory /tmp/mcasttest for binary on node 'gmdb1'
Distributing mcast2 binary to node 'gmdb1'
########### testing Multicast on all nodes ##########
成都创新互联自2013年起,先为金口河等服务建站,金口河等地企业,进行企业商务咨询服务。为金口河企业网站制作PC+手机+微官网三网同步一站式服务解决您的所有建站问题。
Test for Multicast address 230.0.1.0
11月 28 16:42:02 | Multicast Succeeded for bond0 using address 230.0.1.0:42000
11月 28 16:42:03 | Multicast Succeeded for bond1 using address 230.0.1.0:42001
Test for Multicast address 224.0.0.251
11月 28 16:42:04 | Multicast Succeeded for bond0 using address 224.0.0.251:42002
11月 28 16:42:05 | Multicast Succeeded for bond1 using address 224.0.0.251:42003
5 检查CSSD.LOG
2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: begin on node(2), waittime 193000
2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: set curtime (1040905644) for my node
2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: scanning 32 nodes
2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: Node gmdb1, number 1, is in an existing cluster with disk state 3
2017-11-28 11:48:02.797: [ CSSD][2139567872]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk
2017-11-28 11:48:02.808: [ CSSD][2358462208]clssnmvDHBValidateNcopy: node 1, gmdb1, has a disk HB, but no network HB, DHB has rcfg 405549564, wrtcnt, 39931581, LATS 1040905654, lastSeqNo 39931578, uniqueness 1510056501, timestamp 1511840882/1783220964
2017-11-28 11:48:03.287: [ CSSD][2144298752]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2017-11-28 11:48:03.782: [ CSSD][2363209472]clssnmvDHBValidateNcopy: node 1, gmdb1, has a disk HB, but no network HB, DHB has rcfg 405549564, wrtcnt, 39931583, LATS 1040906624,
日志中有大量的无网络心跳的记录;
检查
SQL> select * from v$cluster_interconnects;
NAME IPADDRESS IS SOURCE
eth2:1 169.254.134.65 NO
发现走的HAIP,而本地的HAIP无法启动,导致CSSD启动不起来;检查CSSD的依赖关系:
[root@12crac2 ~]# crsctl stat res ora.cluster_interconnect.haip -init -f
NAME=ora.cluster_interconnect.haip
TYPE=ora.haip.type
STATE=OFFLINE
TARGET=ONLINE
ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:grid:r-x
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
ACTIVE_PLACEMENT=0
AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%
AUTO_START=always
CARDINALITY=1
CARDINALITY_ID=0
CHECK_INTERVAL=30
CREATION_SEED=15
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION="Resource type for a Highly Available network IP"
ENABLED=0
FAILOVER_DELAY=0
FAILURE_INTERVAL=0
FAILURE_THRESHOLD=0
HOSTING_MEMBERS=
ID=ora.cluster_interconnect.haip
LOAD=1
LOGGING_LEVEL=1
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=0
PLACEMENT=balanced
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
SERVER_POOLS=
START_DEPENDENCIES=hard(ora.gpnpd,ora.cssd)pullup(ora.cssd)
临时解决办法:
在确定心跳网络无法的情况下
禁用HAIP:
crsctl modify res ora.cluster_interconnect.haip -attr "ENABLED=0" -init
crsctl modify res ora.asm -attr "START_DEPENDENCIES='hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)', STOP_DEPENDENCIES='hard(intermediate:ora.cssd)' " -init
修改完成后,再次检查:
相关文章:MOS上
Known Issues: Grid Infrastructure Redundant Interconnect and ora.cluster_interconnect.haip (文档 ID 1640865.1)
MOS上关于HAIP的BUG
文章标题:Oracle11.2.0.4禁用HAIP
文章位置:http://scyanting.com/article/pdjdsg.html