Chains 'stuck in the past'

From: Ruan Linehan <ruandav_at_gmail.com>
Date: Fri, 12 Jul 2019 23:20:25 +0100
Message-ID: <CAP0kZ-0DavbDR+d8_OYjXFfH_9bEAEuC8CHv+z9W7-D8pbScSg_at_mail.gmail.com>



All, I am hoping someone can assist me in troubleshooting our CJQ coordinator background process.
Multitenant environment with ~200 PDBs on Exadata Oracle RAC (3 x node) - 12.1.0.2 EE with job_queue_processes = 100.

An issue cropped up a few months ago whereby arbitrary jobs (Chains) across PDB's would cease starting/running.
To be clear, the job (chains) are NOT just being delayed. They do not eventually execute after minutes, hours, days, weeks. The issue is cropping up for random job chains across different pluggable at completely varying times; we have chains that run every 10 minutes, every hour, daily, weekly etc; and many different PDB's will run the same chain code, but the problem will only show up in a subset of (Varying) PDBs. Its scarily inconsistent thus far.

The symptoms are that a job chain will run successfully for a period of time (Possibly weeks) but eventually, at completely inconsistent timeframes, without any change to the chain and for random PDB's, the chain will complete a run and reschedule for the 'next_run_date'. i.e. state=SCHEDULED. The job chain however then never again invokes so becomes "stuck in the past" so to speak, with a run date which has been bypassed.

For an example, the below hourly chain was scheduled for 03:30 on July 10th but just never kicked off. This is what we observe... SQL> list
  1* select owner, state, enabled, job_name, job_weight, job_priority, last_start_date, next_run_date, job_type, job_style from dba_scheduler_jobs SQL> /

OWNER     STATE     ENABL JOB_NAME  JOB_WEIGHT JOB_PRIORITY LAST_START_DATE
  NEXT_RUN_DATE    JOB_TYPE     JOB_STYLE
------------ --------------- ----- ------------------------------
---------- ------------ ----------------------------------------
------------------ -----------------
APP1     SCHEDULED     TRUE  JUSTATESTCHAIN   1 3 10-JUL-19 02.30 AM
PST8PDT   10-JUL-19 03.30 AM PST8PDT     CHAIN      REGULAR

What I know...
It is happening for completely different chains, with varying frequencies and with multiples of programs or very little programs. No discernible pattern here.
When a chain completes (But has the problem), there is nothing still
'running' from the perspective of

dba_scheduler_running_chains/dba_scheduler_running_jobs and it sets the next_run_date as expected so the last run of the chain looks to have fully completed.
Recreating the chain, program, steps and jobs (With the very same creation code) causes the chain to get created from scratch and it will run again for a period of time, and many iterations. This is what I have been doing over the past few months.
Once the job chains get "stuck in the past" - they never eventually run - I have left chains sit in this state for over 60 days and they do not eventually get picked up.
This leads me to believe that it is unlikely to be a resource capacity issue that is causing this, i.e. with job_queue_processes or slaves not getting an opportunity to pick the job up and initiate it, or I assume eventually, they would get picked up and run. Manipulating job chain weight or priority does not appear to have an impact.
I have turned on event tracing via event 27402 but nothing has stood out in terms of the trace file.
'Normal' regular type jobs are not being impacted notably, just job chains!
Within a PDB, when a "stuck in the past" chain occurs, other jobs and chains continue to invoke and run as usual in the PDB.

Here is the interesting bit though, for a chain that gets stuck like this... If I force that job chain to run right now, i.e. exec dbms_scheduler.run_job('JUSTATESTCHAIN');, then the chain runs but what is quite noticeable is that the REQ_START_DATE of that run from within DBA_SCHEDULER_JOB_RUN_DETAILS coincides to the originally defined run date.

i.e. for the JUSTATESTCHAIN job above, it would show REQ_START_DATE as being "10-JUL-19 03.30 AM". So I wonder, if the metadata of the job looks to be recording the real attempted submission request time of the job chain to run at a point in time, but that the CJQ is just not picking it up or allowing it to go?

This is annoying me with months and have an SR open and ongoing, little assistance coming through though.
Would anyone have any insight into examining how or tracing how the CJQ picks up from its queue of jobs to invoke?

Any and all help appreciated.
Regards,
Ruan

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Jul 13 2019 - 00:20:25 CEST

Original text of this message