DB2 - Problem description
Problem IC66646 | Status: Closed |
HADR PRIMARY REINTEGRATION WILL FAIL WITH PRIMARY/STANDBY MISMATCH AFTER THE PAIR REACHES PEER STATE | |
product: | |
DB2 FOR LUW / DB2FORLUW / 970 - DB2 | |
Problem description: | |
The problem can be seen after a takeover by force is issued and a) the old-primary is deactivated and brought up as a standby or b) the old-primary is killed and is brought up as a primary first instead of as a standby (which will fail),then trying to reintegrate it as a standby will cause a Primary/Standby lsn mismatch. The reason is that when the old-primary is deactivated or the old-primary is first brought up as a primary (which will eventually fail due to timeout). The last/current log file will be truncated and the minbufflsn, lowtranlsn and remote catchup start lsn will be moved to the start of next file, The same log record that is truncated on the old-primary is NOT truncated on the new Primary and so is used for writing more log records and so is used for writing more log records. When the old-Primary is reintegrated as a standby and if no log writes are done on the new-primary until this point a Peer connection is established between the Primary/Standby. After the peer state is established, when the new primary writes some logs, sends them to standby then it will result in a Primary/standby LSN mismatch on the standby server which will bring down the standby server. The error mssage "SQL1768N unable to start HADR. Reason code='7' " will be given. You may see the following log entries in the db2diag.log file. 2010-02-10-10.36.47.166177-360 E121063953A371 LEVEL: Event PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Peer (was S-NearlyPeer) 2010-02-10-10.36.51.574186-360 I121079812A498 LEVEL: Error PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAddDataBlock, probe:40012 MESSAGE : Primary/standby mismatch. RCUStartLSN 0000000224D4000C not on record boundary. RCU first page bytecount 4080, firstindex 16, pagelsn 0002230BCFFB. 2010-02-10-10.36.51.574321-360 I121080311A438 LEVEL: Severe PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAddDataBlock, probe:40012 RETCODE : ZRC=0x87800145=-2021654203=HDR_ZRC_VALIDATION_REJECT "HADR shuts down due to validation rejection" | |
Problem Summary: | |
The problem can be seen after a takeover by force is issued and a) the old-primary is deactivated and brought up as a standby or b) the old-primary is killed and is brought up as a primary first instead of as a standby (which will fail),then trying to reintegrate it as a standby will cause a Primary/Standby lsn mismatch. The reason is that when the old-primary is deactivated or the old-primary is first brought up as a primary (which will eventually fail due to timeout). The last/current log file will be truncated and the minbufflsn, lowtranlsn and remote catchup start lsn will be moved to the start of next file, The same log record that is truncated on the old-primary is NOT truncated on the new Primary and so is used for writing more log records and so is used for writing more log records. When the old-Primary is reintegrated as a standby and if no log writes are done on the new-primary until this point a Peer connection is established between the Primary/Standby. After the peer state is established, when the new primary writes some logs, sends them to standby then it will result in a Primary/standby LSN mismatch on the standby server which will bring down the standby server. The error mssage "SQL1768N unable to start HADR. Reason code='7' " will be given. You may see the following log entries in the db2diag.log file. 2010-02-10-10.36.47.166177-360 E121063953A371 LEVEL: Event PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-Peer (was S-NearlyPeer) 2010-02-10-10.36.51.574186-360 I121079812A498 LEVEL: Error PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAddDataBlock, probe:40012 MESSAGE : Primary/standby mismatch. RCUStartLSN 0000000224D4000C not on record boundary. RCU first page bytecount 4080, firstindex 16, pagelsn 0002230BCFFB. 2010-02-10-10.36.51.574321-360 I121080311A438 LEVEL: Severe PID : 172306 TID : 7969 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 EDUID : 7969 EDUNAME: db2hadrs (sample) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAddDataBlock, probe:40012 RETCODE : ZRC=0x87800145=-2021654203=HDR_ZRC_VALIDATION_REJECT "HADR shuts down due to validation rejection" | |
Local Fix: | |
Backup the new primary database and restore it on the standby machine and enable HADR to bring it up as a standby. If the system is in HA (TSA) environment fixing the APAR IC65836 maybe avoid hitting this APAR | |
available fix packs: | |
DB2 Version 9.7 Fix Pack 3 for Linux, UNIX, and Windows | |
Solution | |
This issue is first fixed on DB2 V9.7fp3 | |
Workaround | |
Backup the new primary database and restore it on the standby machine and enable HADR to bring it up as a standby. If the system is in HA (TSA) environment fixing the APAR IC65836 maybe avoid hitting this APAR | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 25.02.2010 23.09.2010 23.09.2010 |
Problem solved at the following versions (IBM BugInfos) | |
9.7.FP3 | |
Problem solved according to the fixlist(s) of the following version(s) | |
9.7.0.3 | |
9.7.0.3 |