Re: Why oracle RMAN full backup always hasn't been finished yet in recent two number of Sunday?

From: Tanel Poder <tanel_at_tanelpoder.com>
Date: Mon, 28 Nov 2022 14:19:23 -0500
Message-ID: <CAMHX9J+G09u-KFbsSnCzsnO+cWX7wSdWvNL2KCorqci2sXgdNg_at_mail.gmail.com>



Hello,

(For people who don't follow twitter - there's a conversation about this topic there too https://twitter.com/quanwen_zhao/status/1596850332910637056 ).

It looks to me that the large backup causes very long delays in your IO subsystem. And by I/O subsystem I mean likely something that's happening in your storage array or less likely also in the Linux kernel level.

The likely chain of events:

  1. Large amounts of writes due to the full backup cause some (or all) I/Os take a long time
  2. Many threads in D state in Linux (causing the load go up - Linux system load includes threads in synchronous Disk I/O wait states <https://tanelpoder.com/posts/high-system-load-low-cpu-utilization-on-linux/> too)
  3. The backup reads/writes not only large data/backup files, but also does lots of controlfile updates
  4. Controlfile I/O, both reads and writes, is a serialized bottleneck (no concurrent reads when any writes are happening)
  5. Control file enqueue (enq: CF) protects/serializes access to the controlfile
  6. When controlfile write/read is slow, the reader/writer holds controlfile enqueue for a long time
  7. Noone else can read/write controlfile if an incompatible operation (read->write or write->read) is going on
  8. This means that even LGWR can't do a log file switch, etc - all kinds of weird things will start (or stop from) happening
  9. To prevent hangs, controlfile enqueue gets can time out (if that happens to LGWR, your instance may crash, depending on the DB version).
  10. If the controlfile enqueue holder is not a critical/fatal background process, the holder can be killed by whoeever else waits for the enqueue and times out (there are *_controlfile_enqueue_timeout*, *_kill_controlfile_enqueue_blocker* parameter for example)

It is possible that someone just kills your RMAN job as its holding the controlfile enqueue for too long when the storage stops responding fast enough.

You need to monitor your storage I/O completion latencies and throughput:

  1. iostat
  2. biolatency (if you have bcc-tools and your kernel supports eBPF - which the stock RHEL kernel doesn't back in RHEL 6.9, I think)
  3. In that case you could use blktrace & blkiomon <https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/ch06s03> (blktrace has been around for a long time)

Does the I/O latency go through the roof and possibly the amount of I/Os completed (r/s, w/s) drop during the backup?

You need to either improve the storage performance or throttle your backup I/O rate (for example by reducing the number of RMAN channels or by using heavier compression so you'd be bottlenecked by CPU more...)

--
Tanel Poder
P.S. I just announced the Cyber Week sale for my training course videos:
https://tanelpoder.com/posts/cyber-week-sale-videos-only-classes-discounted/


On Sun, Nov 27, 2022 at 7:44 AM Quanwen Zhao <quanwenzhao_at_gmail.com> wrote:


> Hello my oracle friends :-),
>
> My client has a oracle database 11.2.0.4.0 (os is RedHat 6.9) that
> deployed on virtual machine server. This issue is very weird. The oracle
> RMAN full backup always hasn't been finished yet in recent two number of
> Sunday (but incremental rman backup is good and only need less than 1 hour
> or even just 10 minutes to finish).
> The volume of rman full backup has approximately 1.2 TB and it took 8
> hours finishing backup in each Sunday (the crontab has started from 01:00
> every day).
>
> But the rman full backup hasn't been finished yet both on Nov 20 and 27,
> 2022.
>
> Here is some data I've captured.
>
> tbm-oracle1:/home/oracle$w
>> 07:36:21 up 1 day, 22:39, 5 users, load average: *66.10, 66.04, 65.63*
>> USER TTY FROM LOGIN_at_ IDLE JCPU PCPU WHAT
>> root pts/2 192.168.254.6 07:11 23:30 0.00s 0.00s -bash
>> root pts/3 192.168.254.6 07:13 23:01 0.01s 0.01s -bash
>> root pts/4 192.168.254.6 07:17 3.00s 0.03s 0.01s w
>
>
> tbm-oracle1:/home/oracle/$free -g
>> total used free shared buffers
>> cached
>> Mem: 503 266 237 256 0 257
>> -/+ buffers/cache: 8 495
>> Swap: 15 0 15
>>
>
>
>>
-- http://www.freelists.org/webpage/oracle-l
Received on Mon Nov 28 2022 - 20:19:23 CET

Original text of this message