suche 36x36
Latest versionsfixlist
11.1.0.7 FixList
10.5.0.9 FixList
10.1.0.6 FixList
9.8.0.5 FixList
9.7.0.11 FixList
9.5.0.10 FixList
9.1.0.12 FixList
Have problems? - contact us.
Register for free anmeldung-x26
Contact form kontakt-x26

DB2 - Problem description

Problem IT40847 Status: Closed

ENDLESS ITERATION OF DB2 CLEANUP AND KILL PROCESSES MAKE DB2 PURESCALE
CLUSTER HANG

product:
DB2 FOR LUW / DB2FORLUW / B50 - DB2
Problem description:
On AIX operatin system, a process can be stuck in "EXITING"
state in the kernel.
In this state, it cannot be killed using kill signal.

If db2sysc process can not be terminated by SIGKILL signal,
db2rocm CLEANUP and KILL processes are interrupted by SIGALRM
signal (Time expired).

  In such a situation, TSA CLEANUP task will be repeatedly
issued until the system is rebooted and its member will not be
started on the other host as restart light.

  In the meanwhile, all applications will be getting stack to
wait for the database objects which are not cleaned up by the
member crash recovery during restart light.

  In this situation, similar messgaes are logged in db2diag.log
as below.

2019-05-05-20.00.56.369398+540 I58987522A827        LEVEL: Event
PID     : 19136798             TID : 1              PROC :
db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: member00
EDUID   : 1                    EDUNAME: db2rocm 0 [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
MESSAGE : Sending SIGKILL to the following process id
DATA #1 : signed integer, 4 bytes
-11337922
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
  [0] 0x090000000E0D5FE0 sqlossig + 0xA0
  [1] 0x00000001000203C0
sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0
  [2] 0x00000001000144DC sqlhaDB2KillNode + 0xE3C
  [3] 0x000000010000C120 rocmDB2Cleanup + 0x10A0
  [4] 0x0000000100004080 main + 0x1820
  [5] 0x00000001000002F8 __start + 0x70

2019-05-05-20.03.26.369026+540 I58998026A1507       LEVEL:
Warning
PID     : 19136798             TID : 1              PROC :
db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: member00
EDUID   : 1                    EDUNAME: db2rocm 0 [db2inst1]
FUNCTION: DB2 UDB, high avail services,
rocmSignalsForTimeoutOffline, probe:411
MESSAGE : Received signal during CLEANUP - exiting with return
code 12.
DATA #1 : String, 7 bytes
SIGALRM
DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes
action->version: 1
action->actor->actorType: DB2
action->actor->actorID: 0
action->actor->instName: db2inst1
action->actor->hostname: NOT_POPULATED
action->actor->options: NONE
action->command: CLEANUP
DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS,
3224 bytes
pgrpFile->iPgrpFileVersion : 2225
pgrpFile->iPgrpId : 11337922
pgrpFile->iWdogPgrpId : 12517570
pgrpFile->iSubPgrpId : NOT_INITIALIZED
pgrpFile->iIndex : 0
pgrpFile->iNumber : 0
pgrpFile->iMonitorOverride : 0
pgrpFile->crashCounter : 0
pgrpFile->firstCrashTimeSeconds : 1970-01-01 09:00:00.000000
pgrpFile->monitorTimeoutCounter : 0
pgrpFile->firstMonitorTimeoutSeconds : 1970-01-01
09:00:00.000000
pgrpFile->lastMonitorTimeoutSeconds : 1970-01-01 09:00:00.000000
pgrpFile->hostname : member00
pgrpFile->iNumHCAs : 0
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
  [0] 0x0000000100006EB4 rocmSignalsForTimeoutOffline + 0xAF4
  [1] 0x0000000000000000 ?unknown + 0x0

2019-05-05-20.03.26.623617+540 I59000696A890        LEVEL: Event
PID     : 46924020             TID : 1              PROC :
db2rocme 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: member00
EDUID   : 1                    EDUNAME: db2rocme 0 [db2inst1]
FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
MESSAGE : Sending SIGKILL to the following process id
DATA #1 : signed integer, 4 bytes
-11337922
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
  [0] 0x090000000E0D5FE0 sqlossig + 0xA0
  [1] 0x00000001001002C0
sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0
  [2] 0x00000001000FC6CC sqlhaDB2KillNode + 0xE4C
  [3] 0x000000010000FAD8 rocmDB2Notify + 0x2F8
  [4] 0x000000010010322C rocmCommandRetryUntilFailure + 0x162C
  [5] 0x0000000100003F00 main + 0x16A0
  [6] 0x00000001000002F8 __start + 0x70

2019-05-05-20.03.56.620065+540 I59003951A1646       LEVEL:
Warning
PID     : 46924020             TID : 1              PROC :
db2rocme 0 [db2inst1]
INSTANCE: db2inst1               NODE : 000
HOSTNAME: member00
EDUID   : 1                    EDUNAME: db2rocme 0 [db2inst1]
FUNCTION: DB2 UDB, high avail services,
rocmSignalsForTimeoutOffline, probe:426
MESSAGE : Received signal during KILL event - exiting with
return code 13.
DATA #1 : String, 7 bytes
SIGALRM
DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes
action->version: 1
action->actor->actorType: DB2
action->actor->actorID: 0
action->actor->instName: db2inst1
action->actor->hostname: NOT_POPULATED
action->actor->options: NONE
action->command: NOTIFY
action->notification->version: 1111
action->notification->eventType: KILL
action->notification->actor->actorType: DB2
action->notification->actor->actorID: 0
action->notification->actor->instName: db2inst1
action->notification->actor->hostname: member01
action->notification->actor->options: NONE
action->notification->sequenceNumber: 214 (0x00000000000000d6)
action->notification->eventWhitelistFlags: NONE
action->notification->bNotifSent: false
action->notification->retryNum: 0
action->notification->eventWhitelistFlagsToChange: 0
action->notification->options: FORCE
DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS,
3224 bytes
Object not dumped: Address: 0x0000000000000000 Size: 3224
Reason: Address is NULL
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
  [0] 0x00000001000084EC rocmSignalsForTimeoutOffline + 0xA2C
  [1] 0x0000000000000000 ?unknown + 0x0
...


if one of the event recorders are formatted using db2fdump
command the following message would
indicate that the process is stuck in exiting state:

7445    Event sequence number: 0      Time:
2019-05-05-13.03.26.350648433
        sqlhaVerifyProcessExists (3.115.49.0.748)
        PID:            TID:                      EDUID:
APPHDL:

        Data1        (PD_TYPE_SQLHA_ER_PDINFO,80) SQLHA Event
Recorder header data (struct sqlhaErPdInfo):
          m_pTimeStamp: N/A
          m_LogDestination: 0
          m_PdFlags: 1
          m_FunctionId: 462946353 (sqlhaVerifyProcessExists)
          m_ErrorCode: 0 = 0
          m_Probe: 748
          m_Level: 4

        Data2        (PD_TYPE_MESSAGE,46) Message String:
        Process is in EXITING state - returning ONLINE

        Data3        (PD_TYPE_PROCESS_ID,4) Process ID:
        11337922

        Data4        (PD_TYPE_STRING,9) String:
        db2sysc 0

        Data5        (PD_TYPE_UINT,8) unsigned integer:
        0

        Data6        (PD_TYPE_MESSAGE,39) Message String:
        Setting ROCM_ACTION_FLAGS_DUMP_HA_EVENT
Problem Summary:
****************************************************************
* USERS AFFECTED:                                              *
* AIX pureScale user                                           *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See Error Description                                        *
****************************************************************
* RECOMMENDATION:                                              *
* Upgrade to Db2 11.5m4fp0 or higher                           *
****************************************************************
Local Fix:
Reboot the system where never died processes exist with such
message logs in db2diag.log
Solution
Workaround
****************************************************************
* USERS AFFECTED:                                              *
* AIX pureScale user                                           *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See Error Description                                        *
****************************************************************
* RECOMMENDATION:                                              *
* Upgrade to Db2 11.5m4fp0 or higher                           *
****************************************************************
Comment
First fixed in Db2 11.5m4fp0
Timestamps
Date  - problem reported    :
Date  - problem closed      :
Date  - last modified       :
05.05.2022
19.05.2022
19.05.2022
Problem solved at the following versions (IBM BugInfos)
Problem solved according to the fixlist(s) of the following version(s)