现象:
客户的数据库(rac环境:11.1.0.6)发生了实例异常宕机现象,伴随有ora-07445错误:
sun jun 23 01:00:06 2013
exception [type: sigsegv, address not mapped to object] [addr:0xf] [pc:0x755773d, kcbw_get_bh()+67]
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_mman_2015.trc (incident=298938):
ora-07445: exception encountered: core dump [kcbw_get_bh()+67] [sigsegv] [addr:0xf] [pc:0x755773d] [address not mapped to object] []
incident details in: /oracle/app/11gr1/diag/rdbms/xij/xij1/incident/incdir_298938/xij1_mman_2015_i298938.trc
sun jun 23 01:00:07 2013
trace dumping is performing id=[cdmp_20130623010007]
sun jun 23 01:00:09 2013
sweep incident[298938]: completed
sun jun 23 01:00:09 2013
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_pmon_1981.trc:
ora-00822: mman process terminated with error
pmon (ospid: 1981): terminating the instance due to error 822
sun jun 23 01:00:09 2013
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
ora-00822: mman process terminated with error
sun jun 23 01:00:09 2013
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:
ora-00822: mman process terminated with error
system state dump is made for local instance
system state dumped to trace file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_diag_1987.trc
sun jun 23 01:00:09 2013
ora-1092 : opiodr aborting process unknown ospid (11096_47524616916112)
sun jun 23 01:00:09 2013
ora-1092 : opitsk aborting process
sun jun 23 01:00:09 2013
ora-1092 : opiodr aborting process unknown ospid (6317_47353365785744)
sun jun 23 01:00:09 2013
ora-1092 : opitsk aborting process
sun jun 23 01:00:09 2013
ora-1092 : opiodr aborting process unknown ospid (28698_47056912551056)
sun jun 23 01:00:09 2013
ora-1092 : opitsk aborting process
sun jun 23 01:00:09 2013
ora-1092 : opiodr aborting process unknown ospid (18927_47567504653456)
sun jun 23 01:00:10 2013
ora-1092 : opitsk aborting process
sun jun 23 01:00:10 2013
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_q001_3487.trc:
ora-00822: mman process terminated with error
ora-1092 : opidrv aborting process q001 ospid (3487_47252506410128)
sun jun 23 01:00:11 2013
ora-1092 : opitsk aborting process
sun jun 23 01:00:11 2013
license high water mark = 510
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_m000_22430.trc:
ora-00822: mman process terminated with error
ora-00822: mman process terminated with error
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
ora-00449: background process 'lgwr' unexpectedly terminated with error 822
ora-00822: mman process terminated with error
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
ora-00449: background process 'lgwr' unexpectedly terminated with error 822
ora-00822: mman process terminated with error
errors in file /oracle/app/11gr1/diag/rdbms/xij/xij1/trace/xij1_j000_22268.trc:
ora-00604: error occurred at recursive sql level 1
ora-00822: mman process terminated with error
ora-06512: at wksys.wk_job, line 442
ora-00449: background process 'mmon' unexpectedly terminated with error 822
ora-00822: mman process terminated with error
ora-06512: at line 1
ora-1092 : opidrv aborting process j000 ospid (22268_47357930925200)
sun jun 23 01:00:20 2013
instance terminated by pmon, pid = 1981
sun jun 23 01:00:21 2013
user (ospid: 22527): terminating the instance
instance terminated by user, pid = 22527
sun jun 23 01:00:26 2013
starting oracle instance (normal)
分析:
ora-07445通常是oracle自身的bug导致的,
首先使用ips收集了alert中的错误信息(ips使用方法见我的另一篇文章《ips简单使用方法》):
搜寻了一下metalink,发现客户的问题跟以下三篇note中描述的bug类似:
ora-7445 (kcbw_get_bh) [id 1341402.1]
bug 9728912 [https://bug.oraclecorp.com/pls/bug/webbug_edit.edit_info_top?rptno=9728912] - pmon terminates instance due to ora-7445 [kcbw_numperchunk] / ora-7445 [kcbw_get_bh]] [id 9728912.8]
instance crashed on ora-7445 kcbw_numperchunk [id 1364264.1]
但根据note可以看到,相关的bug已经在11.1.0.6中fix掉了。
看看客户数据库中的其余严重错误信息:
node1:
adrci> show problem
adr home = /oracle/app/11gr1/diag/rdbms/xij/xij1:
*************************************************************************
problem_id problem_key last_incident lastinc_time
-------------------- ----------------------------------------------------------- -------------------- ----------------------------------------
5 ora 7445 [kcbw_get_bh()+67] 298938 2013-06-23 01:00:06.373716 +08:00
11 ora 600 276161 2013-06-04 18:12:12.709933 +08:00
10 ora 600 [729] 276160 2013-06-04 18:09:27.857128 +08:00
7 ora 7445 [kgghash()+367] 253234 2013-06-03 15:27:04.349337 +08:00
9 ora 7445 [kksmapcursor()+323] 256538 2013-05-27 09:54:58.684956 +08:00
8 ora 7445 [qkabxo()+22] 251194 2013-05-01 22:03:37.715416 +08:00
2 ora 600 [kghfrh:ds] 238818 2013-01-28 11:35:23.755034 +08:00
6 ora 7445 [eoa_pm_push()+31] 239218 2013-01-28 11:24:42.835685 +08:00
3 ora 7445 [ioei_get_method_counts()+39] 71129 2012-10-17 11:17:39.735719 +08:00
4 ora 7445 [jol_calculate_transitive_interface_set()+1165] 74233 2012-10-17 11:05:51.570021 +08:00
1 ora 600 [kghfru:ds] 6369 2012-09-07 17:35:55.001585 +08:00
11 rows fetched
node2:
[oracle@xij02 ~]$ adrci
adrci: release 11.1.0.6.0 - beta on mon jun 24 14:59:37 2013