DB2 - Problem description
Problem IT05398 | Status: Closed |
RSCT WILL NOT NOTIFY DB2 THAT THE PORT IS DOWN WHEN WE MOVE THAT PORT TO ANOTHER DIFFERENT VLAN | |
product: | |
DB2 FOR LUW / DB2FORLUW / A50 - DB2 | |
Problem description: | |
Scenario : All the ports of the servers are belong to a same VLAN(e.g VlAN10) , if we change the RoCE0 of one member(e.g:member0) to another different VLAN(e.g VLAN11) , after about 5 minutes , db connect will hang on the rest of members(e.g:member1 and member2), member0 works as normal . EDUs on member 1 and 2 is waiting for 000E0000000000000000000076 SQLP_VALLOCK. The holder is member0. This caused the hang one member 1,2. From db2diag.log file for member0 , db2CFConnPoolMgr 0 is repeating sqleCaCeConnect, probe:720 and sqleSingleCaCreateNewConnec, probe:2135 when we connected to PRIMARY CF from device hba0, and it reports that PsConnect failed and port state detected by RSCT to be online, but encountered error. 2014-10-17-10.20.53.827735+480 I422919A2148 LEVEL: Severe PID : 16580738 TID : 24461 PROC : db2sysc 0 INSTANCE: instance NODE : 000 HOSTNAME: host EDUID : 24461 EDUNAME: db2CFConnPoolMgr 0 FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:720 MESSAGE : CA RC= 2148073473 DATA #1 : String, 17 bytes PsConnect failed. DATA #2 : PsToken_t, PD_TYPE_SD_PSTOKEN, 152 bytes Eye Catcher = CATOKEN CF Server Info : - Unique Sequence Number = 187 (0xbb) - Port Number = 56001 - Node Identifier = 1 - Instance Identifier = 0 - Netname = netname-ib0 Local Member Info : - uDAPL Device = ib0 Transport Type = UDAPL (0x1) Cmd Connection Use Types = NORMAL (0x0) DATA #3 : SAL CF Server Name, PD_TYPE_SAL_CF_SERVER_NAME, 13 bytes host DATA #4 : SAL Member Device Name, PD_TYPE_SAL_MEMBER_DEVICE_NAME, 4 bytes ib0 DATA #5 : CF Retry Position, PD_TYPE_SAL_RETRY_COUNTER, 8 bytes 10 DATA #6 : unsigned integer, 8 bytes 1 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x09000000063B9D84 sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1 + 0x42C [1] 0x09000000063B9E04 sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1 + 0x4AC [2] 0x0900000006339C7C sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1 + 0xB70 [3] 0x090000000502CB64 sqleSingleCaGrowPool__21SQLE_SINGLE_CA_HANDLEFCUlT1C17SAL_ADAPTE R_INDEX + 0x6CC [4] 0x0900000007AD9654 sqleCFConnPoolMgrEntry__FPUcUi + 0x5C8 [5] 0x0900000007ACEC90 sqleCFConnPoolMgrEntry__FPUcUi + 0x1B4 [6] 0x0900000007ACE678 sqleCFConnPoolMgrEntry__FPUcUi + 0x110 [7] 0x090000000644F9F0 sqloEDUEntry + 0x4B8 [8] 0x0900000000782E10 _pthread_body + 0xF0 [9] 0xFFFFFFFFFFFFFFFC ?unknown + 0xFFFFFFFF 2014-10-17-10.20.53.830449+480 I425068A1808 LEVEL: Warning PID : 16580738 TID : 24461 PROC : db2sysc 0 INSTANCE: instance NODE : 000 HOSTNAME: host EDUID : 24461 EDUNAME: db2CFConnPoolMgr 0 FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, SQLE_SINGLE_CA_HANDLE::sqleSingleCaCreateNewConnec, probe:2135 MESSAGE : Port state detected by RSCT to be online, but encountered error establishing a uDAPL connection. Netname, m_whichCa, numOfflineAdapters, numConsecutiveFailures, CF node num, numConnections, bInitialConnections DATA #1 : SAL CF Server Name, PD_TYPE_SAL_CF_SERVER_NAME, 13 bytes host DATA #2 : SAL Member Device Name, PD_TYPE_SAL_MEMBER_DEVICE_NAME, 4 bytes ib0 DATA #3 : SAL CF Index, PD_TYPE_SAL_CF_INDEX, 8 bytes 2 DATA #4 : unsigned integer, 8 bytes 1 DATA #5 : unsigned integer, 8 bytes 0 DATA #6 : SAL CF Node Number, PD_TYPE_SAL_CF_NODE_NUM, 2 bytes 129 DATA #7 : unsigned integer, 8 bytes 1 DATA #8 : Boolean, 8 bytes false DATA #9 : Codepath, 8 bytes 6:14:16 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x090000000633AAA8 sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1 + 0x199C [1] 0x090000000502CB64 sqleSingleCaGrowPool__21SQLE_SINGLE_CA_HANDLEFCUlT1C17SAL_ADAPTE R_INDEX + 0x6CC [2] 0x0900000007AD9654 sqleCFConnPoolMgrEntry__FPUcUi + 0x5C8 [3] 0x0900000007ACEC90 sqleCFConnPoolMgrEntry__FPUcUi + 0x1B4 [4] 0x0900000007ACE678 sqleCFConnPoolMgrEntry__FPUcUi + 0x110 [5] 0x090000000644F9F0 sqloEDUEntry + 0x4B8 [6] 0x0900000000782E10 _pthread_body + 0xF0 [7] 0xFFFFFFFFFFFFFFFC ?unknown + 0xFFFFFFFF Indeed, the ibstat output shows that port state as "UP" , ---------------------------------------------------------------- ETHERNET PORT 1 INFORMATION (roce0) ---------------------------------------------------------------- Link State: UP Link Speed: 10G XFI Link MTU: 9600 Hardware Address: f4:52:14:cf:4a:da GIDS (up to 3 GIDs): GID0 :00:00:00:00:00:00:00:00:00:00:f4:52:14:cf:4a:da GID1 :00:00:00:00:00:00:00:00:00:00:ff:ff:0a:de:01:65 GID2 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 And all the EDU's kept trying to reconnect to the CF using hba0 and did not try to use hba1 . Since we using RSCT to detect network adapter status , so if the status of the port is UP, RSCT will think it is UP and will notify DB2 that the port is "UP".While in this case , because of the VLAN isolation ,the port is suppose to report as INACTIVE state , so the expected behavior should be used hba1 to reconnect to CF for all EDU's . As the exposure scenario is not covered in lab, and we didn't consider it at the beginning design ,so lead to the current problem. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Members hang * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to V10.5fp7 * **************************************************************** | |
Local Fix: | |
Solution | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 06.11.2014 07.01.2016 07.01.2016 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) | |
10.5.0.7 |