1.4TBASM(RAC)磁盘损坏恢复小记

这周折腾了2天的时间帮客户成功恢复了一套近1.4tb的10.2.0.5 rac(asm). 该库在3月4号直接crash了。大家可以看到，该库在开始报错读取redo,controlfile报错,本质原因是diskgroup dismount了,信息如下： tue mar 04 18:09:59 cst 2014 errors in file /home/o
这周折腾了2天的时间帮客户成功恢复了一套近1.4tb的10.2.0.5 rac(asm). 该库在3月4号直接crash了。
大家可以看到，该库在开始报错读取redo,controlfile报错,本质原因是diskgroup dismount了,信息如下：
tue mar 04 18:09:59 cst 2014errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lgwr_15943.trc:ora-00345: redo log write error block 68145count 5ora-00312: online log 6 thread 2:'+data/xxxx/onlinelog/o2_t2_redo3.log'ora-15078: asm diskgroup was forcibly dismountedtue mar 04 18:09:59 cst 2014success: diskgroup data was dismountedsuccess: diskgroup data was dismountedtue mar 04 18:10:00 cst 2014errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_lmon_15892.trc:ora-00202: control file:'+data/xxxx/controlfile/o1_mf_4g1zr1yo_.ctl'ora-15078: asm diskgroup was forcibly dismountedtue mar 04 18:10:00 cst 2014kcf: write/open error block=0x1f41e online=1file=31 +data/xxxx/datafile/apps_ts_queues.310.692585175error=15078 txt:''tue mar 04 18:10:00 cst 2014kcf: write/open error block=0x47d5d online=1file=51 +data/xxx/datafile/apps_ts_tx_data.353.692593409error=15078 txt:''tue mar 04 18:10:00 cst 2014errors in file /home/oraprod/10.2.0/db/admin/xxxx/bdump/xxxx_dbw2_15939.trc:ora-00202: control file:'+data/prod/controlfile/o1_mf_4g1zr1yo_.ctl'ora-15078: asm diskgroup was forcibly dismountedtue mar 04 18:10:00 cst 2014kcf: write/open error block=0x47d5b online=1file=51 +data/prod/datafile/apps_ts_tx_data.353.692593409error=15078 txt:''tue mar 04 18:10:00 cst 2014数据库实例挂了之后，我们来看下asm实例的alert log信息，如下：
tue mar 04 18:10:04 cst 2014note: smon starting instance recoveryfor group 1 (mounted)tue mar 04 18:10:04 cst 2014warning: io failed. au:0 diskname:/dev/raw/raw5rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0status:2warning: io failed. au:0 diskname:/dev/raw/raw5rq:0x200000000207b518 buffer:0x200000000235c600 au_offset(bytes):0 iosz:4096 operation:0status:2note: f1x0 found on disk 0 fcn 0.160230519warning: io failed. au:33 diskname:/dev/raw/raw5rq:0x60000000002d64f0 buffer:0x400405df000 au_offset(bytes):0 iosz:4096 operation:0status:2warning: cache failed to read gn 1 fn 3 blk 10752count 1 from disk 2error: cache failed to read fn=3 blk=10752 from disk(s): 2ora-15081: failed to submit an i/o operation to a disknote: cache initiating offline of disk 2 group 1warning: process 12863 initiating offline of disk 2.2526420198 (data_0002) with mask 0x3 in group 1note: pst update: grp = 1, dsk = 2, mode = 0x6tue mar 04 18:10:04 cst 2014error: too many offline disks in pst (grp 1)tue mar 04 18:10:04 cst 2014error: pst-initiated mandatory dismount of group datatue mar 04 18:10:04 cst 2014warning: disk 2 in group 1 in mode: 0x7,state: 0x2 was taken offlinetue mar 04 18:10:05 cst 2014note: halting all i/os to diskgroup datanote: active pin found: 0x0x40045bb0fd0tue mar 04 18:10:05 cst 2014abort recovery for domain 1tue mar 04 18:10:05 cst 2014note: cache dismounting group 1/0xd916ec16 (data)tue mar 04 18:10:06 cst 2014大家可以看到，asm报了一个ora-15081错误,在该错误之前是报对其中一个盘/dev/raw/raw5的io操作错误。
细心的朋友可以看到,这里由于io 操作异常后,该disk被offline了。最后磁盘组无法mount。
我们测试使用kfed read无法读取该disk，dd也无法操作。但是却可以直接dd 该disk对应的物理盘。
磁盘组无法mount，从其中trace来看显然是磁盘头损坏，如下:
warning: cache read a corrupted block gn=1 dsk=2 blk=1 from disk 2osm metadata block dump:kfbh.endian: 0 ; 0x000: 0x00kfbh.hard: 0 ; 0x001: 0x00kfbh.type: 0 ; 0x002: kfbtyp_invalidkfbh.datfmt: 0 ; 0x003: 0x00kfbh.block.blk: 0 ; 0x004: t=0 numb=0x0kfbh.block.obj: 0 ; 0x008: type=0x0 numb=0x0kfbh.check: 0 ; 0x00c: 0x00000000kfbh.fcn.base: 0 ; 0x010: 0x00000000kfbh.fcn.wrap: 0 ; 0x014: 0x00000000kfbh.spare1: 0 ; 0x018: 0x00000000kfbh.spare2: 0 ; 0x01c: 0x00000000ce: (0x0x400417ee4e0) group=1 (data) obj=2 (disk) blk=1hashflags=0x0002 lid=0x0002 lruflags=0x0000 bastcount=1redundancy=0x11 fileextent=-2147483648 auindex=0 blockindex=1copy#0: disk=2 au=0bh: (0x0x40041795000) bnum=4586 type=reading state=reading chgst=not modifyingflags=0x00000000 pinmode=excl lockmode=share bf=0x0x40041400000kfbh_kfcbh.fcn_kfbh = 0.0 lowaba=655.8572 highaba=0.0last kfcbinitslotreturn code=null cpkt lnk is null大家知道oracle asm 10.2.0.5版本开始会对asm disk header 进行自动备份，如果如果仅仅是盘头
损坏那么恢复是很easy的。但是其实并不是这么简单，通过dd判断，该盘的前面几个block其实被损坏。
最后我们通过odu 直接将数据文件从磁盘拷贝到文件系统，然后起库，最后完成整个恢复过程。
备注：在恢复过程中，发现odu无法直接拷贝test201402.dbf 这样的文件，然而通过检查
asm alias directory发现，其实是完好的，这里可能odu处理还有点小问题，我们通过手工将该元数据
的au 读取出来，然后匹配将剩下的文件全部抽取出来了，包括redo，controlfile，直接顺利打开数据库。
不得不说，熊哥的odu太强大了，秒杀各种oracle asm的数据库恢复case！

1.4TBASM(RAC)磁盘损坏恢复小记

VIP推荐