DB2 - Problem description
Problem IC65460 | Status: Closed |
DB2 HA FAILS MONITORING FILESYSTEMS WHEN I/O ERRORS PRESENT. | |
product: | |
DB2 FOR LUW / DB2FORLUW / 970 - DB2 | |
Problem description: | |
In integrated HA solution environment, when I/O problem occurs in the system, the mount may remain in Unknown state and no failover occur. The scenario is as follows: 1. I/O problem occurs in the system. 2. TSA calls the monitor script for the filesystem (registered as an IBM.Application). 3. The monitor script (provided by DB2 HA) attempts to touch a file on the filesystem (after verifying the fs is mounted.) 4. The touch generates an I/O error. 5. In the event of an I/O error. The monitor script then will issue a call to the stop script to attempt to make sure the fs is unmounted. 6. The stop script attempts to umount the fs. But in this case, there is a PID accessing the filesystem, preventing the umount. 7. The stop script will attempt to try 9 more times (sleeping for 10 seconds between each try.) 8. After the third try (29 seconds after TSA kicked off the monitor script), TSA kills the monitor script for exceeding the monitor script timeout period as registered for the resource. This also kills off the child process (stop script) before it can get through its 10 tries to umount. 9. Since TSA killed the monitor script, the resource state is 'Unknown'. 10. TSA takes no action on a resource with an unknown state. Instead it will start the cycle again by calling the monitor script. 11. On the affected node, this continues until the machine is rebooted. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * DB2/TSA user * **************************************************************** * PROBLEM DESCRIPTION: * * In integrated HA solution environment, when I/O * * problemoccurs in the system, the mount may remain in Unknown * * stateand no failover occur.The scenario is as follows:1. * * I/O problem occurs in the system.2. TSA calls the monitor * * script for the filesystem(registered as an * * IBM.Application).3. The monitor script (provided by DB2 HA) * * attempts totouch a file on the filesystem (after verifying * * the fs ismounted.)4. The touch generates an I/O error.5. * * In the event of an I/O error. The monitor script thenwill * * issue a call to the stop script to attempt to make surethe * * fs is unmounted.6. The stop script attempts to umount the * * fs. But in thiscase, there is a PID accessing the * * filesystem, preventingthe umount.7. The stop script will * * attempt to try 9 more times(sleeping for 10 seconds between * * each try.)8. After the third try (29 seconds after TSA * * kicked off themonitor script), TSA kills the monitor script * * for exceedingthe monitor script timeout period as registered * * for theresource. This also kills off the child process * * (stopscript) before it can get through its 10 tries to * * umount.9. Since TSA killed the monitor script, the resource * * stateis 'Unknown'.10. TSA takes no action on a resource * * with an unknownstate. Instead it will start the cycle again * * by calling themonitor script.11. On the affected node, this * * continues until the machineis rebooted. * **************************************************************** * RECOMMENDATION: * * Upgrade to v97fp2. * **************************************************************** | |
Local Fix: | |
available fix packs: | |
DB2 Version 9.7 Fix Pack 2 for Linux, UNIX, and Windows | |
Solution | |
The monitor timeout is only 30, fixes are to either increase the mount monitor timeout to some value larger than 30 to allow the soft unmounting to complete; OR add a force option to the mountV95_stop.ksh to bypass the soft unmounting in the case where there is an IO error in the mount monitor and the mount monitor calls the mount stop. The fix in v97fp2 will contain this new "force" option. | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 07.01.2010 25.05.2010 25.05.2010 |
Problem solved at the following versions (IBM BugInfos) | |
9.7.FP2 | |
Problem solved according to the fixlist(s) of the following version(s) | |
9.7.0.2 |