DB2 - Problem description
Problem IC99381 | Status: Closed |
Database hang during forced shutdown or HADR takeover | |
product: | |
DB2 FOR LUW / DB2FORLUW / A10 - DB2 | |
Problem description: | |
Due to a defect in the page cleaning code path, a page cleaner might still retain page latches during a database force scenario. Any other waiters for these page latches will not be interrupted properly. As a result, the database will start to hang, not being able to shut down properly. The problem might happen in the following scenarios: 1. An error causing the database to be marked bad, thus resulting in a forced database shutdown. 2. An HADR takeover by force, where the primary will hang as a result. If the problem happens during a forced HADR takeover, the primary will hang, although the standby will able to take over properly. However, the primary will not be able to enter the standby role and perform any further takeover. A sample call stack of an EDU waiting for a page latch (the actual stacks may vary, the important piece is "sqlbVerifyAndLatchPage"): SQLO_SLATCH_CAS64::getConflictComplex SQLO_SLATCH_CAS64::getConflict sqlo_latch_ns::get sqloSXULatch::get sqloSXUltch_notrack sqloSXUltch_track_page sqlbGetAndMonitorPageLatch sqlbVerifyAndLatchPage sqlbFindPageInBPOrSim sqlbfix sqlbFixPage sqlifix sqliaddk sqldUpdateIndexes sqldRowUpdate sqlriupd An excerpt from db2diag.log indicating that the database was forced during an HADR takeover and a page cleaner got terminated forcifully. If the database starts to hang after encountering similar messages and there are EDUs waiting for page latches, the problem has been reproduced. 2013-11-26-04.24.48.924238-300 I143262171A574 LEVEL: Severe PID : 16187616 TID : 49367 KTID : 74383395 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-8575 APPID: *N0.DB2.131126101026 HOSTNAME: myhostname EDUID : 49367 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrPoisonLocalMember, probe:41180 DATA #1 : <preformatted> HADR marking logs bad; database should shut down to avoid split brain; standby is taking over. ... 2013-11-26-04.24.49.194218-300 E143268528A535 LEVEL: Error PID : 16187616 TID : 42427 KTID : 68288521 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: myhostname EDUID : 42427 EDUNAME: db2pclnr (SAMPLE) 0 FUNCTION: DB2 UDB, data protection services, sqlpflog, probe:480 MESSAGE : ZRC=0x870F0151=-2029059759=SQLO_WP_TERM "The waitpost area has been terminated" ... 2013-11-26-04.24.49.195988-300 E143269064A686 LEVEL: Severe PID : 16187616 TID : 42427 KTID : 68288521 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: myhostname EDUID : 42427 EDUNAME: db2pclnr (SAMPLE) 0 FUNCTION: DB2 UDB, buffer pool services, sqlbgbWAR, probe:5933 MESSAGE : ZRC=0x870F0151=-2029059759=SQLO_WP_TERM "The waitpost area has been terminated" ... 2013-11-26-04.24.49.437225-300 I143296129A1261 LEVEL: Info PID : 16187616 TID : 47824 KTID : 71827519 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-8577 APPID: *N0.DB2.131126101028 HOSTNAME: myhostname EDUID : 47824 EDUNAME: db2agent (SAMPLE ) 0 FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::ForceDBShutdown, probe:15056 MESSAGE : Regular agent EDU doing ForceDBShutdown. Force DB shutdown agent ID | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * See Error Description * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 10.1 for Linux, UNIX, and Windows Fix Pack 4 * **************************************************************** | |
Local Fix: | |
Kill and restart the hanging database | |
available fix packs: | |
DB2 Version 10.1 Fix Pack 4 for Linux, UNIX, and Windows | |
Solution | |
Problem first fixed in DB2 10.1 for Linux, UNIX, and Windows Fix Pack 4 | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 14.02.2014 16.06.2014 16.06.2014 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) | |
10.1.0.4 |