Informix - Problem description
Problem IT27709 | Status: Closed |
PRIMARY AND SECONDARY UNABLE TO RECONNECT AFTER NETWORK FAILURE | |
product: | |
INFORMIX SERVER / 5725A3900 / C10 - IDS 12.10 | |
Problem description: | |
In some cases it might be possible that a network interruption could cause the primary and hdr secondary to not reconnect without bouncing the hdr secondary. It is possible that this would only be encountered on HDR pairs where the secondary is an UPDATABLE secondary, or if SMX_PING_INTERVAL/SMX_PING_RETRY were configured differently on the primary and secondary servers. In this specific case, it appears that the issue is that HDR is not able to properly shut itself down after detecting the network problems. If it can't shutdown properly, then it consequently can't get to the code to attempt to reconnect. The symptoms of this problems can be identified by checking the state and stack of both the dr_prsend thread and the dr_prping thread. At the point where the tear down appears to be stuck onstat -g ath would show the 2 threads in the following states: Threads: tid tcb rstcb prty status vp-class name 159 112258d48 10feee060 3 join wait 32846355 14cpu dr_prsend ... 32846355 1d22fdc58 2c9555520 3 yield time 1cpu dr_prping The stacks would look like this: Stack for thread: 159 dr_prsend ... 0x000000001118a62c (oninit)mt_join 0x0000000010ea5030 (oninit)dr_session_thread 0x00000000111ca69c (oninit)startup Stack for thread: 32846355 dr_prping ... 0x00000000111831a0 (oninit)mt_yield 0x00000000112ed520 (oninit)smx_recv 0x0000000010e9b7ec (oninit)dr_isSecondaryInCheckpoint 0x0000000010e86e90 (oninit)dr_primary_ping 0x00000000111ca69c (oninit)startup Another key element would be the following sequence of events based on errors in the MSGPATH file. What would be seen is that on the PRIMARY server, you would see smx messages about connections being closed because other server was unresponsive. Then it would report that smx had created a new transport to the hdr secondary. Then on the hdr secondary, it would then report that it had smx connections closed because the other server was unresponse. It's important that this message occur at some point in time after the primary had it's smx connections report being closed and it creating the new transport. So here is sample error sequences: PRIMARY MSGPATH file: 23:40:37 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:40:46 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:40:56 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:41:00 smx creates 1 transports to server allende3 23:42:55 WARNING: Detected slow or failing DNS service response 101 time(s). 23:54:30 DR: Receive error 23:54:30 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 23:54:30 DR_ERR set to -1 SECONDARY MSGPATH file: 23:43:22 DR: ping timeout 23:43:22 DR: Receive error 23:43:22 dr_secrcv thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 23:43:22 DR_ERR set to -1 23:43:23 DR: Terminating redirected write subsystem due to server disconnect. All open redirected transactions will be rolled back. 23:43:24 Updates from secondary currently not allowed 23:43:24 ERROR: Mach11 proxyWritePostPBlobCmdSync failed 23:43:24 DR: Turned off on secondary server 23:45:16 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). 23:45:18 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). 23:45:25 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). So the reported timings are important. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Users of IDS prior to 12.10.xC13. * **************************************************************** * PROBLEM DESCRIPTION: * * Primary and Secondary unable to reconnect after network * * failure. * **************************************************************** * RECOMMENDATION: * **************************************************************** | |
Local Fix: | |
Solution | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 09.01.2019 24.09.2019 24.09.2019 |
Problem solved at the following versions (IBM BugInfos) | |
12.10.xC13 | |
Problem solved according to the fixlist(s) of the following version(s) |