Informix - Problem description
Problem IT30240 | Status: Closed |
LOGICAL RECOVERY ERROR TRYING TO RESTART SERVER IF RTREE INDEX SENT TO SECONDARY BUT NO CHECKPOINT HAPPENS BEFORE FAILURE | |
product: | |
INFORMIX SERVER / 5725A3900 / B70 - IDS 11.70 | |
Problem description: | |
So to see the problem, you would have to see something like the following in the secondary servers MSGPATH file: 08:47:57 Checkpoint Completed: duration was 0 seconds. 08:47:57 Tue Sep 10 - loguniq 23, logpos 0x1086080, timestamp: 0x14f597 Interval: 523 08:47:57 Maximum server connections 0 08:47:57 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 21, Llog used 0 08:48:07 DR: Receiving index db1:"jrenaut".t2#t2_geo : Started 08:48:08 DR: Receiving index db1:"jrenaut".t2#t2_geo : Completed. 08:48:15 The Master Daemon Died So from this the sequence looks like a checkpoint happens and then HDR receives an index and then the server aborts. However, if you look at onlog output, you can see the CINDEX log record happens in a transaction prior to the checkpoint @ 8:47:57 logpos 0x1086080. Here's the onlog: 107f018 56 BEGIN 15 23 0 09/10/2019 08:47:41 1522 jrenaut ... 10841fc 136 CINDEX 15 0 1084034 300002 1 1084284 136 CINDEX 15 0 10841fc 300002 1 ... 1085284 56 COMMIT 15 0 10851cc 09/10/2019 08:47:41 ... 1086080 68 CKPOINT 1 23 0 1 The above CINDEX is for the db1:"jrenaut".t2#t2_geo index that says it was successfully shipped to the secondary. The transaction that creates the index commits prior to the checkpoint, but the index isn't shipped until after that checkpoint. So now when the server is brought online, we see the following messages and failure in the MSGPATH file: 08:48:30 IBM Informix Dynamic Server Started. ... 08:48:37 Physical Recovery Started at Page (1:12901). 08:48:37 Physical Recovery Complete: 1 Pages Examined, 1 Pages Restored. 08:48:37 DR: Trying to connect to primary server = rtpri 08:48:37 Dataskip is now OFF for all dbspaces 08:48:37 Restartable Restore has been ENABLED 08:48:37 Recovery Mode 08:48:38 DR: Secondary server connected 08:48:39 DR: Using default behavior of failure-recovering Secondary server 08:48:39 DR: Failure recovery from disk in progress ... 08:48:39 Logical Recovery Started. 08:48:39 10 recovery worker threads will be started. 08:48:39 Start Logical Recovery - Start Log 23, End Log ? 08:48:39 Starting Log Position - 23 0x1086080 08:48:43 Started processing open transactions on secondary during startup 08:48:43 Finished processing open transactions on secondary during startup. 08:48:43 Rollforward of log record failed. iserrno = 126 08:48:43 Log Record: log = 23, pos = 0x10891cc, type = OLDRSAM:HUPDATE(73), trans = 15 So when the server tries to recovery from the checkpoint after the index creation was committed, it hits a log record for that rtree index that unfortunately now that we are in recovery, hasn't been reshipped yet and it fails because it seems not all rtree operations are flagged as index operations that would normally be allowed to fail without then aborting recovery and bringing the server down. Here is a portion of the contents of the af file: 08:48:43 Log record (OLDRSAM:HUPDATE) failed, partnum 0x300002 rowid 0x801 iserrno 126 Log Record: log = 23, pos = 0x10891cc, type = OLDRSAM:HUPDATE(73), trans = 15 0000000058709060: 64000000 00004900 1e200000 00000000 d.....I. . ...... 0000000058709070: 00000000 00000000 0f000000 3c910801 ........ .... | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Users of Informix 11.70.xC8 and earlier versions. * **************************************************************** * PROBLEM DESCRIPTION: * * Txns blocked 0, Plog used 21, Llog used 0 * * * * 08:48:07 DR: Receiving index db1:"jrenaut".t2#t2_geo : * * Started * * 08:48:08 DR: Receiving index db1:"jrenaut".t2#t2_geo : * * Completed. * * 08:48:15 The Master Daemon Died * * * * So from this the sequence looks like a checkpoint happens * * and * * then HDR receives an index and then the server aborts. * * However, * * if you look at onlog output, you can see the CINDEX log * * record * * happens in a transaction prior to the checkpoint @ 8:47:57 * * logpos 0x1086080. * * * * Here's the onlog: * * * * 107f018 56 BEGIN 15 23 0 09/10/2019 * * 08:47:41 * * 1522 jrenaut * * ... * * 10841fc 136 CINDEX 15 0 1084034 300002 1 * * 1084284 136 CINDEX 15 0 10841fc 300002 1 * * ... * * 1085284 56 COMMIT 15 0 10851cc 09/10/2019 * * 08:47:41 * * * * ... * * * * 1086080 68 CKPOINT 1 23 0 1 * * * * The above CINDEX is for the db1:"jrenaut".t2#t2_geo index * * that * * says it was successfully shipped to the secondary. The * * transaction that creates the index commits prior to the * * checkpoint, but the index isn't shipped until after that * * checkpoint. * * * * So now when the server is brought online, we see the * * following * * messages and failure in the MSGPATH file: * * * * 08:48:30 IBM Informix Dynamic Server Started. * * ... * * 08:48:37 Physical Recovery Started at Page (1:12901). * * 08:48:37 Physical Recovery Complete: 1 Pages Examined, 1 * * Pages * * Restored. * * 08:48:37 DR: Trying to connect to primary server = rtpri * * 08:48:37 Dataskip is now OFF for all dbspaces * * 08:48:37 Restartable Restore has been ENABLED * * 08:48:37 Recovery Mode * * 08:48:38 DR: Secondary server connected * * 08:48:39 DR: Using default behavior of failure-recovering * * Secondary server * * * * 08:48:39 DR: Failure recovery from disk in progress ... * * 08:48:39 Logical Recovery Started. * * 08:48:39 10 recovery worker threads will be started. * * 08:48:39 Start Logical Recovery - Start Log 23, End Log ? * * 08:48:39 Starting Log Position - 23 0x1086080 * * 08:48:43 Started processing open transactions on secondary * * during startup * * 08:48:43 Finished processing open transactions on secondary * * during startup. * * 08:48:43 Rollforward of log record failed. iserrno = 126 * * 08:48:43 Log Record: log = 23, pos = 0x10891cc, type = * * OLDRSAM:HUPDATE(73), trans = 15 * * * * So when the server tries to recovery from the checkpoint * * after * * the index creation was committed, it hits a log record for * * that * * rtree index that unfortunately now that we are in recovery, * * hasn't been reshipped yet and it fails because it seems not * * all * * rtree operations are flagged as index operations that would * * normally be allowed to fail without then aborting recovery * * and * * bringing the server down. * * * * Here is a portion of the contents of the af file: * * * * 08:48:43 Log record (OLDRSAM:HUPDATE) failed, partnum * * 0x300002 * * rowid 0x801 iserrno 126 * * Log Record: log = 23, pos = 0x10891cc, type = * * OLDRSAM:HUPDATE(73), trans = 15 * * 0000000058709060: 64000000 00004900 1e200000 00000000 * * d.....I. * * . ...... * * 0000000058709070: 00000000 00000000 0f000000 3c910801 * * ........ * * .... | |
Local Fix: | |
Solution | |
Workaround | |
**************************************************************** * USERS AFFECTED: * * Users of Informix 11.70.xC8 and earlier versions. * **************************************************************** * PROBLEM DESCRIPTION: * * Txns blocked 0, Plog used 21, Llog used 0 * * * * 08:48:07 DR: Receiving index db1:"jrenaut".t2#t2_geo : * * Started * * 08:48:08 DR: Receiving index db1:"jrenaut".t2#t2_geo : * * Completed. * * 08:48:15 The Master Daemon Died * * * * So from this the sequence looks like a checkpoint happens * * and * * then HDR receives an index and then the server aborts. * * However, * * if you look at onlog output, you can see the CINDEX log * * record * * happens in a transaction prior to the checkpoint @ 8:47:57 * * logpos 0x1086080. * * * * Here's the onlog: * * * * 107f018 56 BEGIN 15 23 0 09/10/2019 * * 08:47:41 * * 1522 jrenaut * * ... * * 10841fc 136 CINDEX 15 0 1084034 300002 1 * * 1084284 136 CINDEX 15 0 10841fc 300002 1 * * ... * * 1085284 56 COMMIT 15 0 10851cc 09/10/2019 * * 08:47:41 * * * * ... * * * * 1086080 68 CKPOINT 1 23 0 1 * * * * The above CINDEX is for the db1:"jrenaut".t2#t2_geo index * * that * * says it was successfully shipped to the secondary. The * * transaction that creates the index commits prior to the * * checkpoint, but the index isn't shipped until after that * * checkpoint. * * * * So now when the server is brought online, we see the * * following * * messages and failure in the MSGPATH file: * * * * 08:48:30 IBM Informix Dynamic Server Started. * * ... * * 08:48:37 Physical Recovery Started at Page (1:12901). * * 08:48:37 Physical Recovery Complete: 1 Pages Examined, 1 * * Pages * * Restored. * * 08:48:37 DR: Trying to connect to primary server = rtpri * * 08:48:37 Dataskip is now OFF for all dbspaces * * 08:48:37 Restartable Restore has been ENABLED * * 08:48:37 Recovery Mode * * 08:48:38 DR: Secondary server connected * * 08:48:39 DR: Using default behavior of failure-recovering * * Secondary server * * * * 08:48:39 DR: Failure recovery from disk in progress ... * * 08:48:39 Logical Recovery Started. * * 08:48:39 10 recovery worker threads will be started. * * 08:48:39 Start Logical Recovery - Start Log 23, End Log ? * * 08:48:39 Starting Log Position - 23 0x1086080 * * 08:48:43 Started processing open transactions on secondary * * during startup * * 08:48:43 Finished processing open transactions on secondary * * during startup. * * 08:48:43 Rollforward of log record failed. iserrno = 126 * * 08:48:43 Log Record: log = 23, pos = 0x10891cc, type = * * OLDRSAM:HUPDATE(73), trans = 15 * * * * So when the server tries to recovery from the checkpoint * * after * * the index creation was committed, it hits a log record for * * that * * rtree index that unfortunately now that we are in recovery, * * hasn't been reshipped yet and it fails because it seems not * * all * * rtree operations are flagged as index operations that would * * normally be allowed to fail without then aborting recovery * * and * * bringing the server down. * * * * Here is a portion of the contents of the af file: * * * * 08:48:43 Log record (OLDRSAM:HUPDATE) failed, partnum * * 0x300002 * * rowid 0x801 iserrno 126 * * Log Record: log = 23, pos = 0x10891cc, type = * * OLDRSAM:HUPDATE(73), trans = 15 * * 0000000058709060: 64000000 00004900 1e200000 00000000 * * d.....I. * * . ...... * * 0000000058709070: 00000000 00000000 0f000000 3c910801 * * ........ * * .... | |
Comment | |
Upgrade to Informix 11.70.xC9W2. | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 10.09.2019 27.08.2020 31.08.2020 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |