RE: Scheduler Jobs are not distributed according to OS-load on RAC noes

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Wed, 12 Dec 2018 16:07:23 -0500
Message-ID: <018401d4925e$c5dea170$519be450$_at_rsiz.com>



BECAUSE YOU DO THIS:
> Every hour we are scheduling a lot of one-time jobs to run a lot of data loads. The Jobs are scheduled by a master which takes care of dependencies - so a job is only scheduled, when all it dependencies are met and should run as soon as resources (job processes) are available. (No dependencies are defined in dbms-scheduler framework).
> The jobs use a JOB_CLASS which as a dedicated SERVICE - this SERVICE is available on all 4 instances. Stop&Start of the service on the "idle" instance does not help.
 

you have an opportunity to work around what appears to me to be a bug. Here is my proposed flyswatter:  

Create four SERVICES, each with one node as primary and the other three as secondary, using round robin in the text.  

IF round robin is good enough, just do that by rotating the service name. Unless the primary is down, it should get the job, but you’ve lost no redundancy because the others are listed as secondary. (IF the primary doesn’t get the job when it IS up, that is a serious bug.)  

IF round robin is NOT good enough (which can easily be true if a few longer running jobs get allocated to one node), then just before you schedule ping a demon you have running on each node reporting the rolling window average cpu busy and allocate the next job to the node where a) the ping answers and b) the cpu busy is lowest.  

Now that is a bit of programming work you shouldn’t have to do, but it is a way to manage round robin and/or cpu load balanced job allocation, given that you are managing releasing these jobs already.  

mwf  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Martin Berger Sent: Wednesday, December 12, 2018 2:25 PM To: Mladen Gogala
Cc: Niall Litchfield; Oracle-L oracle-l
Subject: Re: Scheduler Jobs are not distributed according to OS-load on RAC noes  

Thank you Mladen & Niall,  

I set the service to CLB=> short, then restarted the service on one instance which refuses to run jobs - it didn't help. Also set job_queue_processes to 0 (scope=memory) and back to it's original value didn't help at all.  

If I find something additional/usefull, I'll share it.  

Martin  

Am Mi., 12. Dez. 2018 um 14:34 Uhr schrieb Mladen Gogala <gogala.mladen_at_gmail.com>:

You may want to try setting CLB to short

Mladen Gogala  

On Tue, Dec 11, 2018, 4:27 AM Martin Berger <martin.a.berger_at_gmail.com wrote:

Hi Niall,  

the service has these properties:  

Service name: OUR_SERVICE_NAME

Server pool:

Cardinality: 4

Service role: PRIMARY

Management policy: AUTOMATIC

DTP transaction: false

AQ HA notifications: false

Global: false

Commit Outcome: false

Failover type:

Failover method:

TAF failover retries:

TAF failover delay:

Failover restore: NONE

Connection Load Balancing Goal: LONG

Runtime Load Balancing Goal: NONE

TAF policy specification: NONE

Edition:

Pluggable database name:

Maximum lag time: ANY

SQL Translation Profile:

Retention: 86400 seconds

Replay Initiation Time: 300 seconds

Drain timeout:

Stop option:

Session State Consistency:

GSM Flags: 0

Service is enabled

Preferred instances: INST1,INST2,INST3,INST4

Available instances:

CSS critical: no    

it's worth to mention: the connections at which the jobs are scheduled come from another DB via DB-Link.  

thank you,

 Martin    

Am Di., 11. Dez. 2018 um 09:47 Uhr schrieb <niall.litchfield_at_gmail.com>:

Hi Martin

What are the load balancing properties of the service set to? On Tue, Dec 11, 2018 at 8:45 AM Martin Berger <martin.a.berger_at_gmail.com> wrote:
>
> Hi List,
>
> I have a strange situation with a 4-node RAC - 12.2 (July 2018) Oracle Linux 6.10:
>
> After some time, one (or several) instances stop executing jobs.
>
> Every hour we are scheduling a lot of one-time jobs to run a lot of data loads. The Jobs are scheduled by a master which takes care of dependencies - so a job is only scheduled, when all it dependencies are met and should run as soon as resources (job processes) are available. (No dependencies are defined in dbms-scheduler framework).
> The jobs use a JOB_CLASS which as a dedicated SERVICE - this SERVICE is available on all 4 instances. Stop&Start of the service on the "idle" instance does not help.
> NTP is fine according to cluvfy comp clocksync -n all .
> instance_stickiness is TRUE (the default) - but I don't think this will change anything as our jobs run one-time only.
>
> Does anyone know how to identify, why sometimes some instances refuse to run scheduled jobs?
> Who is doing this decision, and can it be traced somehow to identify based on which numbers the decision is done?
> Any other suggestions?
>
> A SR at MOs is open, but without any progress.
>
> related documents found so far:
>
> DBMS_SCHEDULER job doesn't fail-over across RAC instance ( Doc ID 2365434.1 )
> RAC Node X Is Seeing A Higher Session Load Than The Other Nodes For Scheduler Jobs ( Doc ID 1602581.1 )
> ENH 28592547 - REAL-TIME LOAD BALANCING FOR JOBS ACROSS RAC INSTANCES
>
> --
> Martin Berger Oracle ♠
> martin.a.berger_at_gmail.com _at_martinberx
> ^∆x http://berxblog.blogspot.com
>

-- 
Niall Litchfield
Oracle DBA
http://www.orawin.info



--
http://www.freelists.org/webpage/oracle-l
Received on Wed Dec 12 2018 - 22:07:23 CET

Original text of this message