DB2 - Problem description
Problem IC66800 | Status: Closed |
DB2STOP TAKES A LONG TIME ON HADR SYSTEM IF STANDBY IS OFFLINE AND DATABASE NOT ACTIVATED | |
product: | |
DB2 FOR LUW / DB2FORLUW / 910 - DB2 | |
Problem description: | |
When we have a HADR system, and the standby is offline. If the database in the primary is not activated, the first connection to the database will be the one to activate it. If we run db2 connect to <database name> or db2 start hadr on <database name> as primary, but without the "by force" option, the connection will try to start hadr and connect to the standby, timing out eventually after the HADR_TIMEOUT setting and getting SQL1768N Unable to start HADR. Reason code = "7". 2010-02-03-04.29.45.699578+000 I2838967A388 LEVEL: Warning PID : 606706 TID : 1 PROC : db2agent (FRH) 0 INSTANCE: db2frh NODE : 000 APPHDL : 0-171 APPID: *LOCAL.db2frh.100203042357 AUTHID : DB2FRH FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21151 MESSAGE : Info: HADR Startup has begun. 2010-02-03-04.30.16.734635+000 I2842396A552 LEVEL: Error PID : 999878 TID : 1 PROC : db2hadrp (FRH) 0 INSTANCE: db2frh NODE : 000 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20390 MESSAGE : HADR primary did not establish connection with standby within timeout and will shut down. BY FORCE option required to start primary without standby. Timeout seconds = DATA #1 : Hexdump, 4 bytes 0x07800001D52CD008 : 0000 001E 2010-01-14-06.35.07.374445+000 I5172086A471 LEVEL: Error PID : 1077874 TID : 1 PROC : db2agent (AB7) 0 INSTANCE: db2ab7 NODE : 000 APPHDL : 0-8 APPID: *LOCAL.db2ab7.100114063322 AUTHID : DB2AB7 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21300 MESSAGE : HADR EDU sqlcode: DATA #1 : Hexdump, 4 bytes 0x000000011121526C : FFFF F918 .... 2010-01-14-06.35.07.374514+000 I5172558A419 LEVEL: Severe PID : 1077874 TID : 1 PROC : db2agent (AB7) 0 INSTANCE: db2ab7 NODE : 000 APPHDL : 0-8 APPID: *LOCAL.db2ab7.100114063322 AUTHID : DB2AB7 FUNCTION: DB2 UDB, base sys utilities, sqledint, probe:230 DATA #1 : Hexdump, 4 bytes 0x000000011121526C : FFFF F918 .... If many of the connection attempts are issued, they will all be serialized until the database is activated: 1. The currently active connection that is trying to start HADR is holding the database latch. The application is waiting to reach the HADR timeout. 2. All other connections that are trying to start HADR are queued up behind the database latch in a serialized fashion. If in this scenario we run db2stop force, this might take a long time, depending on how many connections have been queued to activate the database (they will all fail with HADR timeout SQL1768N) When "db2stop force" kicks in, it will detect the number of applications that need to be forced: FUNCTION: DB2 UDB, base sys utilities, sqeAppServices::ExecuteStopForce, probe:1000 DATA #1 : String, 47 bytes [Force]->Number of applications to be forced : DATA #2 : Hexdump, 4 bytes 0x0FFFFFFFFFFFD698 : 0000 0004 .... It will until all queued up applications respond, and only then the database is actually stopped. This might take a long time, and could be perceived as db2stop force actually being hung. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * ALL * **************************************************************** * PROBLEM DESCRIPTION: * * DB2STOP TAKES A LONG TIME ON HADR SYSTEM IF STANDBY IS * * OFFLINE AND DATABASE NOT ACTIVATED * * When we have a HADR system, and the standby is offline. If * * the * * database in the primary is not activated, the first * * connection * * to the database will be the one to activate it. If we run * * db2 * * connect to <database name> or db2 start hadr on <database * * name> * * as primary, but without the "by force" option, the * * connection * * will try to start hadr and connect to the standby, timing * * out * * eventually after the HADR_TIMEOUT setting and getting * * SQL1768N * * Unable to start HADR. Reason code = "7". * * * * * * * * 2010-02-03-04.29.45.699578+000 I2838967A388 LEVEL: * * Warning * * PID : 606706 TID : 1 PROC : * * * * db2agent (FRH) 0 * * * * INSTANCE: db2frh NODE : 000 * * * * APPHDL : 0-171 APPID: * * *LOCAL.db2frh.100203042357 * * AUTHID : DB2FRH * * * * FUNCTION: DB2 UDB, High Availability Disaster Recovery, * * * * hdrEduStartup, probe:21151 * * * * MESSAGE : Info: HADR Startup has begun. * * * * * * * * * * * * 2010-02-03-04.30.16.734635+000 I2842396A552 LEVEL: * * Error * * PID : 999878 TID : 1 PROC : * * * * db2hadrp (FRH) 0 * * * * INSTANCE: db2frh NODE : 000 * * * * FUNCTION: DB2 UDB, High Availability Disaster Recovery, * * hdrEduP, * * probe:20390 * * * * MESSAGE : HADR primary did not establish connection with * * standby * * within timeout * * * * and will shut down. BY FORCE option required to * * start * * primary without * * * * standby. Timeout seconds = * * * * DATA #1 : Hexdump, 4 bytes * * * * 0x07800001D52CD008 : 0000 001E * * * * * * * * 2010-01-14-06.35.07.374445+000 I5172086A471 LEVEL: * * Error * * PID : 1077874 TID : 1 PROC : * * * * db2agent (AB7) 0 * * * * INSTANCE: db2ab7 NODE : 000 * * * * APPHDL : 0-8 APPID: * * *LOCAL.db2ab7.100114063322 * * AUTHID : DB2AB7 * * * * FUNCTION: DB2 UDB, High Availability Disaster Recovery, * * * * hdrEduStartup, probe:21300 * * * * MESSAGE : HADR EDU sqlcode: * * * * DATA #1 : Hexdump, 4 bytes * * * * 0x000000011121526C : FFFF F918 * * * * .... * * * * * * * * 2010-01-14-06.35.07.374514+000 I5172558A419 LEVEL: * * Severe * * PID : 1077874 TID : 1 PROC : * * * * db2agent (AB7) 0 * * * * INSTANCE: db2ab7 NODE : 000 * * * * APPHDL : 0-8 APPID: * * *LOCAL.db2ab7.100114063322 * * AUTHID : DB2AB7 * * * * FUNCTION: DB2 UDB, base sys utilities, sqledint, probe:230 * * * * DATA #1 : Hexdump, 4 bytes * * * * 0x000000011121526C : FFFF F918 * * * * .... * * * * * * * * If many of the connection attempts are issued, they will all * * be * * serialized until the database is activated: * * * * 1. The currently active connection that is trying to start * * HADR * * is holding the database latch. The application is waiting to * * * * reach the HADR timeout. * * * * 2. All other connections that are trying to start HADR are * * * * queued up behind the database latch in a serialized fashion. * * * * * * * * If in this scenario we run db2stop force, this might take a * * long * * time, depending on how many connections have been queued to * * * * activate the database (they will all fail with HADR timeout * * * * SQL1768N) * * * * * * * * * * * * When "db2stop force" kicks in, it will detect the number of * * * * applications that need to be forced: * * * * * * * * * * * * FUNCTION: DB2 UDB, base sys utilities, * * * * sqeAppServices::ExecuteStopForce, probe:1000 * * * * DATA #1 : String, 47 bytes * * * * [Force]->Number of applications to be forced : * * * * DATA #2 : Hexdump, 4 bytes * * * * 0x0FFFFFFFFFFFD698 : 0000 0004 * * * * .... * * * * * * * * It will until all queued up applications respond, and only * * then * * the database is actually stopped. This might take a long * * time, * * and could be perceived as db2stop force actually being hung. * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 Version 9.1 Fixpack 10 * **************************************************************** | |
Local Fix: | |
When the Standby is offline issue db2 start hadr on <database name> as primary by force to activate database on the Primary. This will avoid the time out waits and a db2stop force if needed, will respond quicker. | |
available fix packs: | |
DB2 Version 9.1 Fix Pack 10 for Linux, UNIX and Windows | |
Solution | |
Problem was first fixed in DB2 Version 9.1 Fixpack 10 | |
Workaround | |
not known / see Local fix | |
BUG-Tracking | |
forerunner : APAR is sysrouted TO one or more of the following: IC67509 IC67511 IC67514 IC67515 IC68048 follow-up : | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 02.03.2010 04.05.2010 08.03.2012 |
Problem solved at the following versions (IBM BugInfos) | |
9.1., 9.1.FP10 | |
Problem solved according to the fixlist(s) of the following version(s) | |
9.1.0.10 |