DB2 - Problem description
Problem IC81467 | Status: Closed |
WITH FILE SYSTEM CACHING ENABLED, SYSTEM OUTAGE DURING LOAD PROCESSING MIGHT RESULT IN CORRUPTION | |
product: | |
DB2 FOR LUW / DB2FORLUW / 980 - DB2 | |
Problem description: | |
(1) With file system caching enabled, IBM DB2 for Linux, UNIX, and Windows uses buffered disk writes for index rebuilds during LOAD operations. Buffered disk writes first go to the file system cache and after that when the buffered data needs to be physically written to disks, which is typically during the commit time, a sync operation must be called. As a result of an issue in tracking which files needs to be synchronized, DB2 mistakenly skips synchronizing some or all of the required files. If a machine or file system outage occurs, the writes or data that are currently residing in the disk buffer and have not yet been written to the disk are lost. The time period for which these writes and data are vulnerable is dependent on how aggressively the operating system and hardware flush file system cache. Under normal conditions, all writes will be sent to disks eventually. If an outage happens after the writes have been flushed from file system cache to disk, there will be no problems. For LOAD operations where the index creation phase of load is done in 'REBUILD' mode, an outage happening after the commit time (marking the LOAD as successful) and before the writes get physically written to disks, might lead to index corruption. This risk of index corruption only applies to DB2 running on AIX platforms. Note: In the above description, 'synchronizing' means calling the operating system function sync(). (2) LOAD operations write important information to binary load control files while it is running. If a LOAD operation is interrupted or fails for any reason, the load terminate operation relies on information stored in the load control files to be able to restore the load target table to its previous state. The load restart operation also relies on the information in the load control files to restart the load operation from the last consistency point. With file system caching enabled, the LOAD command also uses buffered disk writes for the load control files. If a machine or file system outage occurs during LOAD processing, which is before the load operation completes successfully, the writes to the load control files that are currently residing in the disk buffer and have not yet been written to the disk are lost. After the system comes back up, the load target table is in load pending state. Running LOAD TERMINATE or LOAD RESTART commands on the table might result in two different erroneous behavior: (a) LOAD TERMINATE or LOAD RESTART fails because it detects missing information in the load control files. (b) LOAD TERMINATE or LOAD RESTART is successful, but it fails to detect problem with the load control files, and restores incorrect information to the table. In this case, there is data corruption in the table. The risk of running into these two erroneous behaviors applies to DB2 running on all Linux, UNIX and Windows platforms. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * All LOAD users running on system with file system caching * * enabled * **************************************************************** * PROBLEM DESCRIPTION: * * (1) With file system caching enabled, IBM DB2 for Linux, * * UNIX, and Windows uses buffered disk writes for index * * rebuilds during LOAD operations. Buffered disk writes first * * go to the file system cache and after that when the buffered * * data needs to be physically written to disks, which is * * typically during the commit time, a sync operation must be * * called. * * * * As a result of an issue in tracking which files needs to be * * synchronized, DB2 mistakenly skips synchronizing some or all * * of the required files. If a machine or file system outage * * occurs, the writes or data that are currently residing in * * the disk buffer and have not yet been written to the disk * * are lost. The time period for which these writes and data * * are vulnerable is dependent on how aggressively the * * operating system and hardware flush file system cache. * * Under normal conditions, all writes will be sent to disks * * eventually. If an outage happens after the writes have been * * flushed from file system cache to disk, there will be no * * problems. For LOAD operations where the index creation phase * * of load is done in 'REBUILD' mode, an outage happening after * * the commit time (marking the LOAD as successful) and before * * the writes get physically written to disks, might lead to * * index corruption. * * * * This risk of index corruption only applies to DB2 running on * * AIX platforms. * * * * Note: In the above description, 'synchronizing' means * * calling the operating system function sync(). * * * * (2) LOAD operations write important information to binary * * load control files while it is running. If a LOAD operation * * is interrupted or fails for any reason, the load terminate * * operation relies on information stored in the load control * * files to be able to restore the load target table to its * * previous state. The load restart operation also relies on * * the information in the load control files to restart the * * load operation from the last consistency point. * * * * With file system caching enabled, the LOAD command also uses * * buffered disk writes for the load control files. If a * * machine or file system outage occurs during LOAD processing, * * which is before the load operation completes successfully, * * the writes to the load control files that are currently * * residing in the disk buffer and have not yet been written to * * the disk are lost. After the system comes back up, the load * * target table is in load pending state. Running LOAD * * TERMINATE or LOAD RESTART commands on the table might result * * in two different erroneous behavior: * * * * (a) LOAD TERMINATE or LOAD RESTART fails because it detects * * missing information in the load control files. * * * * (b) LOAD TERMINATE or LOAD RESTART is successful, but it * * fails to detect problem with the load control files, and * * restores incorrect information to the table. In this case, * * there is data corruption in the table. * * * * The risk of running into these two erroneous behaviors * * applies to DB2 running on all Linux, UNIX and Windows * * platforms. * **************************************************************** * RECOMMENDATION: * * Upgrade to IBM DB2 for Linux, Unix and Windows version 9.8 * * Fix Pack 5. * **************************************************************** | |
Local Fix: | |
Disable file system cache to prevent both issues from occurring. If you already have file system cache enabled and have hit a system outage during load processing, perform the following steps: For the first issue (1), mark the invalid indexes as bad using the db2dart command, and rebuild them. For the second issue (2), if the LOAD TERMINATE or LOAD RESTART command fails as described in (a), or if you have not issued a LOAD TERMINATE or LOAD RESTART command yet, the table can be restored to previous state (which is before the start of the LOAD operation that failed due to system outage), by deleting the corrupted load control files and then issuing a LOAD TERMINATE command. Notes for the second issue (2): (i) Some disk space that were used to store LOB or LF table objects might become orphaned, which means the space will not be storing any data and cannot be reused. (ii) Load control files reside in the [db_dir]/load/DB2xxxxx.PID/DB2yyyyy.OID directory, where [db_dir] is the database path, typically ends in .../NODEmmmm/SQLnnnnn where xxxxx is the pool id (tablespace id) of the load target table, in hexadecimal and yyyyy is the object id of the load target table, in hexadecimal and the load control file is loadmmmm.CT1 (where mmmm is the partition number in a partitioned database environment.) Before deleting the corrupted load control files, copy or move all files in the [db_dir]/load/DB2xxxxx.PID/DB2yyyyy.OID directory to a backup location, then delete all files in the [db_dir]/load/DB2xxxxx.PID/DB2yyyyy.OID directory, and then issue a LOAD TERMINATE command. For partitioned database environments, you must do this for all the database partitions, before issuing the LOAD TERMINATE command. Note that you only need to issue the LOAD TERMINATE command once in a partitioned database environment. | |
Solution | |
Problem first fixed in IBM DB2 for Linux, Unix and Windows version 9.8 Fix Pack 5. | |
Workaround | |
See Local Fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 15.02.2012 13.06.2012 13.06.2012 |
Problem solved at the following versions (IBM BugInfos) | |
9.8., 9.8.FP5 | |
Problem solved according to the fixlist(s) of the following version(s) | |
9.8.0.5 |