DB2 - Problem description
Problem IC90996 | Status: Closed |
SQL0952N : INCORRECT TIMEOUT VALUE OF -1 LEADS TO NODE FAILURES AND INTERMITTENT "LOG STATE MARKED BAD" ERRORS | |
product: | |
DB2 FOR LUW / DB2FORLUW / A10 - DB2 | |
Problem description: | |
- This problem happens intermittently in DPF (multi-partition) environments. - You will notice INTERRUPTS (SQLCODE -952) on non-catalog node and ROLLBACKs (SQLCODE -1229) on catalog node, accompanied by following db2diag.log messages : On non-catalog nodes : 2013-02-27-19.42.XXX XXXX LEVEL: Error PID : 23330818 TID : 140509 PROC : db2sysc 22 INSTANCE: db2inst1 NODE : 015 DB : SAMPLE APPHDL : 0-22 APPID: xxx.xxx.xxx.xxx.xxxxx.xxxxxxxx AUTHID : user HOSTNAME: AAAAAA EDUID : 140509 EDUNAME: db2agntp (SAMPLE) 15 FUNCTION: DB2 UDB, data protection services, SQLP_DBCB::setLogState, probe:5005 DATA #1 : <preformatted> Error detected during initialization. As a result, for precautionary reasons the database log state has been marked bad. 2013-02-27-19.42.XXX XXXX LEVEL: Severe PID : 23330818 TID : 140509 PROC : db2sysc 22 INSTANCE: db2inst1 NODE : 015 DB : SAMPLE APPHDL : 0-22 APPID: xxx.xxx.xxx.xxx.xxxxx.xxxxxxxx AUTHID : user HOSTNAME: AAAAAA EDUID : 140509 EDUNAME: db2agntp (SAMPLE) 15 FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::FirstConnect, probe:8721 DATA #1 : SQLCA, PD_DB2_TYPE_SQLCA, 136 bytes sqlcaid : SQLCA sqlcabc: 136 sqlcode: -952 sqlerrml: 0 sqlerrmc: sqlerrp : SQLEDINT sqlerrd : (1) 0x00000000 (2) 0x00000000 (3) 0x00000000 (4) 0x00000000 (5) 0x00000000 (6) 0x00000016 sqlwarn : (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) sqlstate: - The first trigger of the problem can be found in db2diag.log when catalog node detects an fcm connection failure while trying to communicate with the non catalog node due to TIMEOUT : 2013-02-27-19.42.XXX XXXX LEVEL: Error PID : 23330818 TID : 140509 PROC : db2sysc 22 INSTANCE: db2inst1 NODE : 0 DB : SAMPLE APPHDL : 0-22 APPID: xxx.xxx.xxx.xxx.xxxxx.xxxxxxxx AUTHID : user HOSTNAME: AAAAAA EDUID : 1800 EDUNAME: db2fcms 0 FUNCTION: DB2 UDB, fast comm manager, sqkfNetworkServices::detectNodeFailure, probe:15 DATA #1 : <preformatted> Detected failure for node 15 - time elapsed: 4294967295; max timeout: 500; link state: 4 The max timeout by default is 500 (default values of 10 secs (CONN_ELAPSE ) and 5 ( MAX_CONNRETRIES ) it converts to 500 seconds). So in above example node 0 could not reach node 15 in more than 500 secs. Time elapsed: 4294967295, 4294967295 converts to hex 0xFFFFFFFF which is -1. This is the trigger of the FCM failures resulting in INTERRUPTS on non-catalog nodes, -1229's on catalog node and the log state being marked bad. This way the node becomes unreachable due to a timing problem in db2. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * ALL * **************************************************************** * PROBLEM DESCRIPTION: * * See Problem Description above. * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 Version V10.1 Fix Pack 3. * **************************************************************** | |
Local Fix: | |
N/A. | |
available fix packs: | |
DB2 Version 10.1 Fix Pack 3 for Linux, UNIX, and Windows | |
Solution | |
First fixed in DB2 Version 10.1 Fix Pack 3. | |
Workaround | |
not known / see Local fix | |
BUG-Tracking | |
forerunner : APAR is sysrouted TO one or more of the following: IC95228 follow-up : | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 20.03.2013 19.11.2013 19.11.2013 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) | |
10.1.0.3 | |
10.1.0.3 |