DB2 - Problem description
Problem IT03035 | Status: Closed |
RELAX CLUSTER COMMUNICATION GROUP SETTINGS IN A TSA/HADR ENVIRONMENT CONFIGURED USING DB2HAICU | |
product: | |
DB2 FOR LUW / DB2FORLUW / A10 - DB2 | |
Problem description: | |
In a TSA/HADR environment configured via db2haicu, the default tolerable network latency time between the two nodes is 8 seconds. After 8 seconds of no communication between the nodes, RSCT declares loss of communication between the two nodes and recovery actions follow. It was found that the default of 8 seconds was too restrictive and it is recommended that it is updated to 30 seconds instead. This relaxed value ensures that unnecessary recovery actions do not place in periods of high latency between the cluster nodes. Bullet 1 of the following technote has more details on this: http://www-01.ibm.com/support/docview.wss?uid=swg21624179 "1. Relaxing Heartbeat Sensitivity settings The default values of 4 (sensitivity) and 1 (period) allow for 8 seconds of network latency before RSCT decides that the heartbeat attempt between two nodes is unsuccessful and thus recover actions are necessary. We have found that in clusters where the servers are heavily utilized that the default heartbeat values are to stringent and need to be relaxed. Relaxing these settings can prevent unwanted behavior such as an unexpected reboot. We recommend changing the Sensitivity to 5 and the Period to 3 which will allow for 30 seconds before RSCT declares a problem. To determine your clusters "CommGroup Name" issue the "lscomg" command. To modify the settings to our recommended values, issue the following from any node: chcomg -s 5 -p 3 <CommGroup_Name> Apply the change to all configured communication groups listed in the "lscomg" output. " In addition to relaxing the cluster communication group settings, the CritRsrcProtMethod is being updated from 1 to 3 in order to allow a sync to disk from memory before a machine is rebooted for critical resource protection reasons. Bullet 3 of the following technote has more details on this: http://www-01.ibm.com/support/docview.wss?uid=swg21624179 "3. Change CritRsrcProtMethod setting from 1 to 3 By default, whenever RSCT invokes CritRsrcProtMethod it issues a kernel panic that causes a hard reset and reboot of the OS. Often, with DB2 clusters this happens when there is an extreme load on a server causing heartbeats to be missed making RSCT think that it is no longer communicating with the rest of the cluster and ending up with a reboot. When this happens, any in-memory log/trace data is lost because there is no opportunity to flush it to disk with the default CritRsrcProtMethod setting of 1. Changing this value to 3 allows for a sync of what is in memory to be written to the disk prior to the reboot occurring ... this means that valuable syslog, error report, trace and db2diag.log messages will be saved. chrsrc -c IBM.PeerNode CritRsrcProtMethod=3" | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Users using db2haicu in TSA/HADR setup * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Users can upgrade to DB2 Version 10.1 fix pack 5 or higher * * to avoid this defect * **************************************************************** | |
Local Fix: | |
Refer to bullets 1 and 3 of this technote: http://www-01.ibm.com/support/docview.wss?uid=swg21624179 | |
Solution | |
First fixed in DB2 Version 10.1 fix pack 5 | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 08.07.2014 12.08.2015 12.08.2015 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) | |
10.1.0.5 |