DB2 - Problem description
Problem IC75196 | Status: Closed |
A CONTROLLED TAKEOVER IN A TSA / HADR ENV MAY BE FOLLOWED BY AN IMMEDIATE FAILBACK IF THERE IS LATENCY IN CHANGES TO RESOURCES | |
product: | |
DB2 FOR LUW / DB2FORLUW / 970 - DB2 | |
Problem description: | |
In a TSA / HADR environment, the state of the database is monitored by scripts located under /usr/sbin/rsct/sapolicies/db2/. During the course of a user initiated HADR takeover, DB2 issues requests to TSA to created, lock, and unlock resources. If those operations are not propagated quickly enough, there is a chance the monitor scripts will report the database is down on both servers when there are no locks or flags in place. If that happens, TSA will issue a second takeover by force. A manual reintegration may be required. Note: this issue only affects controlled takeover and it is expected to happen only in rare cases. Node 1 syslogs: ## The database is reporting as online (return code 1): Jan 25 12:23:42 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3735742]: Returning 1 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:04 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3735792]: Returning 1 : db2inst1 db2inst1 SAMPLE ## The takeover is issued and has started several seconds ago, but the takeover is not finished and the monitor script runs and reports offline (return code 2): Jan 25 12:24:26 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[983346]: Returning 2 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:48 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2752700]: Returning 2 : db2inst1 db2inst1 SAMPLE Jan 25 12:25:10 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[3866888]: Returning 2 : db2inst1 db2inst1 SAMPLE Jan 25 12:25:32 server-a user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2228742]: Returning 2 : db2inst1 db2inst1 SAMPLE Node 2 syslogs: ## This was the standby, so reporting offline is expected Jan 25 12:23:45 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2294478]: Returning 2 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:07 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[1508116]: Returning 2 : db2inst1 db2inst1 SAMPLE ## The takeover starts and resource changes are made Jan 25 12:24:07 server-b user:debug root[2687334]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock Jan 25 12:24:13 server-b user:debug root[2294526]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock: 1 Jan 25 12:24:23 server-b user:debug root[1769768]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock Jan 25 12:24:23 server-b user:debug root[3211824]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0 Jan 25 12:24:26 server-b user:debug root[2687360]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock ## The monitor on node 1 ran at this point and returned offline. The last status for node 2 is also offline and we are still modifying the resources: Jan 25 12:24:29 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2294596]: Returning 2 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:32 server-b user:debug root[3473642]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock: 1 Jan 25 12:24:32 server-b user:debug root[3146076]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock Jan 25 12:24:32 server-b user:notice /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[3211836]: Entering : db2inst1 db2inst1 SAMPLE Jan 25 12:24:32 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[2949514]: su - db2inst1 -c db2gcf -t 3600 -u -i db2inst1 -i db2inst1 -h SAMPLE -L Jan 25 12:24:32 server-b user:debug root[2163396]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0 Jan 25 12:24:33 server-b user:notice /usr/sbin/rsct/sapolicies/db2/hadrV95_start.ksh[2163400]: Returning 0 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:33 server-b user:debug root[3211870]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock Jan 25 12:24:34 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[2687424]: Returning 1 : db2inst1 db2inst1 SAMPLE Jan 25 12:24:39 server-b user:debug root[2687428]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg lock: 1 Jan 25 12:24:40 server-b user:debug root[3211884]: Entering /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock Jan 25 12:24:40 server-b user:debug root[2753468]: Exiting /usr/sbin/rsct/sapolicies/db2/lockreqprocessed db2_db2inst1_db2inst1_SAMPLE-rg unlock: 0 Jan 25 12:24:56 server-b user:debug /usr/sbin/rsct/sapolicies/db2/hadrV95_monitor.ksh[6553680]: Returning 1 : db2inst1 db2inst1 SAMPLE | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Users of TSAMP HA solutions. * **************************************************************** * PROBLEM DESCRIPTION: * * See Problem Description above. * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 Version 9.7 Fix Pack 5. * **************************************************************** | |
Local Fix: | |
Verify the current TSA resources OpState and the actual DB2 HADR roles. If manual reintegration is required, issue "db2 start hadr on <dbname> as standby". If issue is readily reproducible, there may be an underlying latency problem. Resolve latency problem to reduce the chance the monitor scripts will run in the middle of resource management. | |
available fix packs: | |
DB2 Version 9.7 Fix Pack 5 for Linux, UNIX, and Windows | |
Solution | |
First Fixed in Version 9.7 Fix Pack 5. | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 23.03.2011 23.12.2011 23.12.2011 |
Problem solved at the following versions (IBM BugInfos) | |
9.7.FP5 | |
Problem solved according to the fixlist(s) of the following version(s) | |
9.7.0.5 |