单机asm 需要ocr和votingv asm diskgroup吗

博客访问: 1857103
博文数量: 574
博客积分: 10716
博客等级: 上将
技术积分: 7186
注册时间:
IT168企业级官微
微信号:IT168qiye
系统架构师大会
微信号:SACC2013
分类: Oracle
注意以下的恢复手段,针对ASM中单独的ocr或者单独的votedisk丢失也有效,因为11.2中普遍把ocr和votedisk存放在ASM中,而ASM的启动又依赖于ocr和votedisk,所以在丢失ocr或votedisk仍一一者都会导致cluter无法正常启动;这里我们仅仅讨论如何让CRS正常启动,如果丢失的diskgroup中还存放有数据库的话,数据的恢复不属于本篇文章的讨论范畴。
前提:恢复的前提是你仍有和故障前一样多的ASM LUN DISK,且你有OCR的自动备份,注意默认情况下每4个小时会自动备份一次,只要你没有删除$GI_HOME,一般都会有备份可用;不要求有votedisk备份
恢复场景: 利用dd命令清空ocr和votedisk所在diskgroup header,模拟diskgroup corruption:
1. 检查votedisk和 ocr备份
[root@vrh1 ~]# crsctl query css votedisk
File Universal Id
File Name Disk group
-----------------
--------- ---------
a853d6204bbc4feabfd8c73d4c3b3001 (/dev/asm-diskh) [SYSTEMDG]
a5bf0fbf21d1d9f58c4a6b (/dev/asm-diskg) [SYSTEMDG]
36e5c51ffa (/dev/asm-diski) [SYSTEMDG]
af337dbf6adaaa (/dev/asm-diskj) [SYSTEMDG]
3c4a349e2e304ff6bf64b2b1c9d9cf5d (/dev/asm-diskk) [SYSTEMDG]
Located 5 voting disk(s).
[grid@vrh1 ~]$ ocrconfig -showbackup
PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy
/g01/11.2.0/maclean/grid/cdata/vrh-cluster/backup00.ocr
/g01/11.2.0/maclean/grid/cdata/vrh-cluster/backup01.ocr
/g01/11.2.0/maclean/grid/cdata/vrh-cluster/backup02.ocr
/g01/11.2.0/grid/cdata/vrh-cluster/day.ocr
/g01/11.2.0/grid/cdata/vrh-cluster/week.ocr
PROT-25: Manual backups for the Oracle Cluster Registry are not available
2. 彻底关闭所有节点上的clusterware ,OHASD
crsctl stop has -f
3. GetAsmDH.sh ==& GetAsmDH.sh是ASM disk header的备份脚本
请养成良好的习惯,做危险操作前备份asm header
[grid@vrh1 ~]$ ./GetAsmDH.sh
############################################
1) Collecting Information About the Disks:
############################################
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 9 03:28:13 2012
Copyright (c) , Oracle.
All rights reserved.
SQL& Connected.
SQL& SQL& SQL& SQL& SQL& SQL& SQL&
0 /dev/asm-diske
1 /dev/asm-diskd
0 /dev/asm-diskb
1 /dev/asm-diskc
2 /dev/asm-diskf
0 /dev/asm-diskh
1 /dev/asm-diskg
2 /dev/asm-diski
3 /dev/asm-diskj
4 /dev/asm-diskk
SQL& SQL& Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
-rw-r--r-- 1 grid oinstall 1048 Aug
9 03:28 /tmp/HC/asmdisks.lst
############################################
2) Generating asm_diskh.sh script.
############################################
-rwx------ 1 grid oinstall 666 Aug
9 03:28 /tmp/HC/asm_diskh.sh
############################################
3) Executing
asm_diskh.sh script to
generate dd
############################################
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_1_0.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_1_1.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_2_0.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_2_1.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_2_2.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_3_0.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_3_1.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_3_2.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_3_3.dd
-rw-r--r-- 1 grid oinstall 1048576 Aug
9 03:28 /tmp/HC/dsk_3_4.dd
############################################
4) Compressing dd dumps in the next format:
(asm_dd_header_all_.tar)
############################################
/tmp/HC/dsk_1_0.dd
/tmp/HC/dsk_1_1.dd
/tmp/HC/dsk_2_0.dd
/tmp/HC/dsk_2_1.dd
/tmp/HC/dsk_2_2.dd
/tmp/HC/dsk_3_0.dd
/tmp/HC/dsk_3_1.dd
/tmp/HC/dsk_3_2.dd
/tmp/HC/dsk_3_3.dd
/tmp/HC/dsk_3_4.dd
./GetAsmDH.sh: line 81: compress: command not found
ls: /tmp/HC/*.Z: No such file or directory
[grid@vrh1 ~]$
4. 使用dd 命令 破坏ocr和votedisk所在diskgroup
[root@vrh1 ~]# dd if=/dev/zero of=/dev/asm-diskh bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0. seconds, 247 MB/s
[root@vrh1 ~]# dd if=/dev/zero of=/dev/asm-diskg bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0045179 seconds, 232 MB/s
[root@vrh1 ~]# dd if=/dev/zero of=/dev/asm-diski bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0. seconds, 223 MB/s
[root@vrh1 ~]# dd if=/dev/zero of=/dev/asm-diskj bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0. seconds, 305 MB/s
[root@vrh1 ~]# dd if=/dev/zero of=/dev/asm-diskk bs=1024k count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0053518 seconds, 196 MB/s
5. 在一个节点上尝试重新启动HAS
[root@vrh1 ~]# crsctl start has
CRS-4123: Oracle High Availability Services has been started.
但是因为ocr和votedisk所在diskgroup丢失,所以CSS将无法正常启动,如以下日志所示:
alertvrh1.log
[cssd(5162)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 Details at (:CSSNM00070:) in /g01/11.2.0/grid/log/vrh1/cssd/ocssd.log
03:35:41.207
[cssd(5162)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 Details at (:CSSNM00070:) in /g01/11.2.0/grid/log/vrh1/cssd/ocssd.log
03:35:56.240
[cssd(5162)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 Details at (:CSSNM00070:) in /g01/11.2.0/grid/log/vrh1/cssd/ocssd.log
03:36:11.284
[cssd(5162)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 Details at (:CSSNM00070:) in /g01/11.2.0/grid/log/vrh1/cssd/ocssd.log
03:36:26.305
[cssd(5162)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 Details at (:CSSNM00070:) in /g01/11.2.0/grid/log/vrh1/cssd/ocssd.log
03:36:41.328
03:40:26.662: [
CSSD][]clssnmReadDiscoveryProfile: voting file discovery string(/dev/asm*)
03:40:26.662: [
CSSD][]clssnmvDDiscThread: using discovery string /dev/asm* for initial discovery
03:40:26.662: [
SKGFD][]Discovery with str:/dev/asm*:
03:40:26.662: [
SKGFD][]UFS discovery with :/dev/asm*:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskf:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskb:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskj:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskh:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskc:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskd:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diske:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskg:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diski:
03:40:26.665: [
SKGFD][]Fetching UFS disk :/dev/asm-diskk:
03:40:26.665: [
SKGFD][]OSS discovery with :/dev/asm*:
03:40:26.665: [
SKGFD][]Handle 0xdf22a0 from lib :UFS:: for disk :/dev/asm-diskf:
03:40:26.665: [
SKGFD][]Handle 0xf412a0 from lib :UFS:: for disk :/dev/asm-diskb:
03:40:26.666: [
SKGFD][]Handle 0xf3a680 from lib :UFS:: for disk :/dev/asm-diskj:
03:40:26.666: [
SKGFD][]Handle 0xf93da0 from lib :UFS:: for disk :/dev/asm-diskh:
03:40:26.667: [
CSSD][]clssnmvDiskVerify: Successful discovery of 0 disks
03:40:26.667: [
CSSD][]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
03:40:26.667: [
CSSD][]clssnmvFindInitialConfigs: No voting files found
03:40:26.667: [
CSSD][](:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds
正式的恢复ocr和votedisk所在diskgroup的步骤如下:
1. 以-excl -nocrs 方式启动cluster,这将可以启动ASM实例 但不启动CRS
[root@vrh1 vrh1]# crsctl start crs -excl -nocrs CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.mdnsd' on 'vrh1'
CRS-2676: Start of 'ora.mdnsd' on 'vrh1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'vrh1'
CRS-2676: Start of 'ora.gpnpd' on 'vrh1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'vrh1'
CRS-2672: Attempting to start 'ora.gipcd' on 'vrh1'
CRS-2676: Start of 'ora.cssdmonitor' on 'vrh1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'vrh1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'vrh1'
CRS-2672: Attempting to start 'ora.diskmon' on 'vrh1'
CRS-2676: Start of 'ora.diskmon' on 'vrh1' succeeded
CRS-2676: Start of 'ora.cssd' on 'vrh1' succeeded
CRS-2679: Attempting to clean 'ora.cluster_interconnect.haip' on 'vrh1'
CRS-2672: Attempting to start 'ora.ctssd' on 'vrh1'
CRS-2681: Clean of 'ora.cluster_interconnect.haip' on 'vrh1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'vrh1'
CRS-2676: Start of 'ora.ctssd' on 'vrh1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'vrh1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'vrh1'
CRS-2676: Start of 'ora.asm' on 'vrh1' succeeded
2.重建原ocr和votedisk所在diskgroup,注意compatible.asm必须是11.2:
[root@vrh1 vrh1]# su - grid
[grid@vrh1 ~]$ sqlplus
/ as sysasm
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 9 04:16:58 2012
Copyright (c) , Oracle.
All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL& create diskgroup systemdg high redundancy disk '/dev/asm-diskh','/dev/asm-diskg','/dev/asm-diski','/dev/asm-diskj','/dev/asm-diskk'
ATTRIBUTE 'compatible.rdbms' = '11.2', 'compatible.asm' = '11.2';
3.从ocr backup中恢复ocr并做ocrcheck检验:
[root@vrh1 ~]# ocrconfig -restore /g01/11.2.0/grid/cdata/vrh-cluster/backup00.ocr
[root@vrh1 ~]# ocrcheck
Status of Oracle Cluster Registry is as follows :
Total space (kbytes)
Used space (kbytes)
Available space (kbytes) :
Device/File Name
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
4. 准备恢复votedisk ,可能会遇到下面的错误:
[grid@vrh1 ~]$ crsctl replace votedisk
CRS-4602: Failed 27 to add voting file 2e4e0febf5473d00dcc0388.
CRS-4602: Failed 27 to add voting file 4fa54bb0cc5c4fafbf1a9be.
CRS-4602: Failed 27 to add voting file a109ead9ea4e4f28bfea.
CRS-4602: Failed 27 to add voting file 042c9fbd71b54f5abfcd3ab.
CRS-4602: Failed 27 to add voting file 7b5a8cd24f954fafbf835adf.
Failed to replace voting disk group with +SYSTEMDG.
CRS-4000: Command Replace failed, or completed with errors.
需要重新配置一下ASM的参数,并重启ASM:
SQL& alter system set asm_diskstring='/dev/asm*';
System altered.
SQL& creat
File created.
ORA-32004: obsolete or deprecated parameter(s) specified for ASM instance
ASM instance started
Total System Global Area
Fixed Size
2227664 bytes
Variable Size
ASM diskgroups mounted
SQL& show parameter spfile
------------------------------------ ----------- ------------------------------
/g01/11.2.0/grid/dbs/spfile+AS
[grid@vrh1 trace]$
crsctl replace votedisk
CRS-4256: Updating the profile
Successful addition of voting disk 85edc0e82d274f78bfc58cdc73b8c68a.
Successful addition of voting disk 201ffffc8ba44faabfe2efec2aa75840.
Successful addition of voting disk 6f2a25c589964faabff621ce.
Successful addition of voting disk 93ebbfc73d5.
Successful addition of voting disk 4f88bfbfbd31d8b3829f.
Successfully replaced voting disk group with +SYSTEMDG.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
5. 重启has服务,检验cluster是否正常:
[root@vrh1 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@vrh1 ~]# crsctl query css votedisk
File Universal Id
File Name Disk group
-----------------
--------- ---------
85edc0e82d274f78bfc58cdc73b8c68a (/dev/asm-diskh) [SYSTEMDG]
201ffffc8ba44faabfe2efec2aa75840 (/dev/asm-diskg) [SYSTEMDG]
6f2a25c589964faabff621ce (/dev/asm-diski) [SYSTEMDG]
93ebbfc73d5 (/dev/asm-diskj) [SYSTEMDG]
4f88bfbfbd31d8b3829f (/dev/asm-diskk) [SYSTEMDG]
Located 5 voting disk(s).
[root@vrh1 ~]# crsctl stat res -t
--------------------------------------------------------------------------------
STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.BACKUPDG.dg
ora.DATA.dg
ora.LISTENER.lsnr
ora.LSN_MACLEAN.lsnr
ora.SYSTEMDG.dg
OFFLINE OFFLINE
ora.net1.network
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
OFFLINE OFFLINE
OFFLINE OFFLINE
ora.scan1.vip
ora.vprod.db
ora.vrh1.vip
ora.vrh2.vip
INTERMEDIATE vrh1
阅读(791) | 评论(0) | 转发(0) |
相关热门文章
给主人留下些什么吧!~~
请登录后评论。RAC共享磁盘物理路径故障导致OCR、Votedisk所在ASM磁盘组不可访问的案例分析
来源:it-home
客户的环境是两台IBM X3850,安装Oracle Linux 6.x x86_64bit的操作系统部署的Oracle 11.2.0.4.0 RAC Database,共享存储是EMC,使用了EMC vplex虚拟化软件对存储做了镜像保护,操作系统安装了EMC原生的多路径软件。故障的现象是当vplex内部发生切换时,RAC其中一个节点的OCR和Votedisk所在的磁盘组变得不可访问,导致ora.crsd服务脱机,Grid Infrastrasture集群堆栈宕掉,但是该节点的数据库实例不受影响,但不再接受外部新的连接,在这个过程中另外一个节点完全不受影响。下面是相关的日志信息:1.操作系统日志:
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 4 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 2 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 3 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 1 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 0 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 11 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 12 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 10 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 9 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 8 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 7 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 5 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 3 Lun 6 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM
port CL2-00 is dead.
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 1 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 12 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 11 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 10 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 7 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 4 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 8 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 9 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 5 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 3 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 6 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 2 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Path Bus 3 Tgt 2 Lun 0 to CKM
Mar 18 08:25:48 dzqddb01 kernel: Error:Mpx:Bus 3 to VPLEX CKM
port CL2-04 is dead.
从操作系统日志可以看出,Mar 18 08:25:48的时候port CL2-00和port CL2-04两个链路dead了。2.ASM日志:
Fri Mar 18 08:25:59 2016
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.
Fri Mar 18 08:25:59 2016
NOTE: process _b000_+asm1 (66994) initiating offline of disk 0. (OCRVDISK_0000) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (66994) initiating offline of disk 1. (OCRVDISK_0001) with mask 0x7e in group 3
NOTE: process _b000_+asm1 (66994) initiating offline of disk 2. (OCRVDISK_0002) with mask 0x7e in group 3
NOTE: checking PST: grp = 3
GMON checking disk modes for group 3 at 10 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: checking PST for grp 3 done.
NOTE: initiating PST update: grp = 3, dsk = 0/0xbe3119c4, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 1/0xbe3119c3, mask = 0x6a, op = clear
NOTE: initiating PST update: grp = 3, dsk = 2/0xbe3119c2, mask = 0x6a, op = clear
GMON updating disk modes for group 3 at 11 for pid 48, osid 66994
ERROR: no read quorum in group: required 2, found 0 disks
Fri Mar 18 08:25:59 2016
NOTE: cache dismounting (not clean) group 3/0x3D81E95D (OCRVDISK)
WARNING: Offline for disk OCRVDISK_0000 in mode 0x7f failed.
WARNING: Offline for disk OCRVDISK_0001 in mode 0x7f failed.
WARNING: Offline for disk OCRVDISK_0002 in mode 0x7f failed.
NOTE: messaging CKPT to quiesce pins Unix process pid: 66996, image: oracle@dzqddb01 (B001)
Fri Mar 18 08:25:59 2016
NOTE: halting all I/Os to diskgroup 3 (OCRVDISK)
Fri Mar 18 08:25:59 2016
NOTE: LGWR doing non-clean dismount of group 3 (OCRVDISK)
NOTE: LGWR sync ABA=11.69 last written ABA 11.69
Fri Mar 18 08:25:59 2016
kjbdomdet send to inst 2
detach from dom 3, sending detach message to inst 2
Fri Mar 18 08:25:59 2016
List of instances:
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 96)
Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 3 invalid = TRUE
Fri Mar 18 08:25:59 2016
NOTE: No asm libraries found in the system
2 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
Fri Mar 18 08:25:59 2016
WARNING: dirty detached from domain 3
NOTE: cache dismounted group 3/0x3D81E95D (OCRVDISK)
SQL& alter diskgroup OCRVDISK dismount force /* ASM SERVER: */
Fri Mar 18 08:25:59 2016
NOTE: cache deleting context for group OCRVDISK 3/0x3d81e95d
GMON dismounting group 3 at 12 for pid 51, osid 66996
NOTE: Disk OCRVDISK_0000 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0001 in mode 0x7f marked for de-assignment
NOTE: Disk OCRVDISK_0002 in mode 0x7f marked for de-assignment
NOTE:Waiting for all pending writes to complete before de-registering: grpnum 3
ASM Health Checker found 1 new failures3.Clusterware告警日志:
11:53:19.394:
[crsd(47973)]CRS-1006:The OCR location +OCRVDISK is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.
11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48283)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:121} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
11:53:38.437:
[/u01/app/11.2.0/grid/bin/scriptagent.bin(80385)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/scriptagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:9:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.
11:53:38.437:
[/u01/app/11.2.0/grid/bin/orarootagent.bin(48177)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:5:3303} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/orarootagent_root/orarootagent_root.log.
11:53:38.437:
[/u01/app/11.2.0/grid/bin/oraagent.bin(48168)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:7} in /u01/app/11.2.0/grid/log/dzqddb01/agent/crsd/oraagent_grid/oraagent_grid.log.
11:53:38.442:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.
11:53:39.773:
[crsd(45323)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.
11:53:39.779:
[crsd(45323)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/dzqddb01/crsd/crsd.log.
11:53:40.470:
[ohasd(47343)]CRS-2765:Resource 'ora.crsd' has failed on server 'dzqddb01'.
这里我们会产生一个疑问,为什么ora.crsd挂掉,但是ora.cssd没有OFFLINE(通过crsctl stat res -t -init可以确认ora.cssd没有挂掉,数据库实例还正常运行,节点并没有被踢出去),原因在于OCRVDISK对应的磁盘只是短暂的不可访问,cssd进程是直接访问OCRVDISK对应的3个ASM磁盘,并不依赖于OCRVDISK磁盘组是MOUNT状态,并且Clusterware默认的磁盘心跳超时时间为200秒,所以cssd进程没有出现问题。
由此我们又会有更多的疑问,为什么RAC的另外一个节点没有出现故障?为什么只有OCRVDISK磁盘组dismount,其他的磁盘组都正常?
在出现问题后重启has服务之后该节点即可恢复正常,加上其他磁盘组,其他节点并没有出现故障,所以可以简单的判断共享存储没有太大的问题,只是链路断掉之后有短时间的不可访问,寻找问题的关键是ASM实例日志中的这个信息:WARNING: Waited 15 secs for write IO to PST disk,15秒的时间是否过短影响了OCRVDISK的脱机,下面是MOS上的解释:
Generally this kind messages comes in ASM alertlog file on below situations,
Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
The ASM disk could go into unresponsiveness, normally in the following scenarios:
Some of the paths of the physical paths of the multipath device are offline or lost
During path 'failover' in a multipath set up
Server load, or any sort of storage/multipath/OS maintenance
通过上面的这段描述,能大概的解释出现问题的原因,由于存储链路断掉了2条(可能发生failover),导致聚合后的共享存储设备短暂的不可访问,OCRVDISK是Normal冗余度的磁盘组,ASM会执行PST heartbeat检查,由于超过15秒OCRVDISK对应的磁盘组不可访问导致ASM将OCRVDISK直接dismount,进而导致OCR文件不可访问,导致crs服务OFFLINE,由于cssd的磁盘心跳超时时间为200秒,且是直接访问ASM磁盘,不经过ASM磁盘组,所以css服务没有受影响,hasd高可用堆栈依然正常工作,集群节点未被踢出,数据库实例正常工作。
Oracle给出了在数据库层面解决这个问题的办法:
you can not keep the disk unresponsiveness to below 15 seconds, then
the below parameter can be set in the ASM instance ( on all the Nodes of
_asm_hbeatiowait
As per internal bug
, based on internal testing the value should be increased to 120 secs, which is fixed in 12.1.0.2
Run below in asm instance to set desired value for _asm_hbeatiowait
alter system set "_asm_hbeatiowait"= scope=spfile sid='*';
And then restart asm instance / crs, to take new parameter value in effect.
为了避免类似的问题,可以将OCR镜像到不同的ASM磁盘组,这样将进一步的提高ora.crsd服务的可用性。
更详细的内容请参考文章:《ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (文档 ID )》
免责声明:本站部分内容、图片、文字、视频等来自于互联网,仅供大家学习与交流。相关内容如涉嫌侵犯您的知识产权或其他合法权益,请向本站发送有效通知,我们会及时处理。反馈邮箱&&&&。
学生服务号
在线咨询,奖学金返现,名师点评,等你来互动

我要回帖

更多关于 asm diskstring 的文章

 

随机推荐