DB2 - Problem description
Problem IC87826 | Status: Closed |
THE PARALLEL MODE OF DB2_ALL MAY HANG IN A VERY LARGE DPF ENVIRONMENT | |
product: | |
DB2 FOR LUW / DB2FORLUW / A10 - DB2 | |
Problem description: | |
When running db2_all in parallel mode (i.e. ';' option), db2_all sends the user's command to the target partitions, and spawns waiter processes in the partition where the command was run from (i.e. sender partition). After the target partitions completes the user's command, they send a remote shell command back to the sender partition to inform that the command completed. Prior to the fix for this APAR, should this remote shell command fails for any reason, the sender partition exhibits a hang symptom, as it thinks that the user's command has not completed in the target partitions, when in fact it has. This is caused due to the lack of error handling of the failed remote shell command. This APAR addresses the error handling. The cause of the remote shell command failure varies, but a common known cause is the excessive number of remote shell running at the same time. For example, starting too many ssh at the same time may cause some of them to fail with the following error. ssh_exchange_identification: Connection closed by remote host Running too many rsh at the same time can fail with the following errors. socket: protocol failure in circuit setup. socket: All ports in use These types of capacity related failures may happen in a very large DPF environment. e.g. several hundred partitions. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Very large DPF environment * **************************************************************** * PROBLEM DESCRIPTION: * * When running db2_all in parallel mode (i.e. ';' option), * * db2_all sends the user's command to the target partitions, * * and spawns waiter processes in the partition where the * * command was run from (i.e. sender partition). After the * * target partitions completes the user's command, they send a * * remote shell command back to the sender partition to inform * * that the command completed. * * * * Prior to the fix for this APAR, should this remote shell * * command fails for any reason, the sender partition exhibits * * a hang symptom, as it thinks that the user's command has not * * completed in the target partitions, when in fact it has. * * * * This is caused due to the lack of error handling of the * * failed remote shell command. This APAR addresses the error * * handling. * * * * The cause of the remote shell command failure varies, but a * * common known cause is the excessive number of remote shell * * running at the same time. For example, starting too many * * ssh at the same time may cause some of them to fail with the * * following error. * * * * ssh_exchange_identification: Connection closed by remote * * host * * * * Running too many rsh at the same time can fail with the * * following errors. * * * * socket: protocol failure in circuit setup. * * socket: All ports in use * * * * These types of capacity related failures may happen in a * * very large DPF environment. e.g. several hundred * * partitions. * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 10.1 Fixpack 1 * **************************************************************** | |
Local Fix: | |
Running db2_all in serial mode (i.e. without ';') does not have this problem. | |
Solution | |
First fixed in DB2 10.1 Fixpack 1. | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 04.11.2012 21.11.2012 21.11.2012 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |