Re: Why never finished deploying oracle data guard physical standy by rman duplicate?

From: Quanwen Zhao <quanwenzhao_at_gmail.com>
Date: Mon, 22 May 2023 12:43:17 +0800
Message-ID: <CABpiuuTS1P-EkoJE6-jw1_M_AK+tDWfe=fybYO+2BtCh2uta8A_at_mail.gmail.com>



Hi Mladen :-),

Thanks for your suggestion, I've set "*vm.overcommit_memory = 1*" and "*vm.hugetlb_shm_group
= 501*" (501 is the group number that belongs to oracle user) on */etc/sysctl.conf*, at the same time adding "*numa=off*" on */etc/grub.conf* as well.

By the way the *sga_target* and *sga_max_size* is *300g* and *pga_aggregate_target* is *80g* (just occupied *75%* total physical memory *512g*) on oracle primary database. In other words still having *132g* available memory size has been allocated to OS.

Oops, terrible "*Enforcing*" on */etc/sysconfig/selinux*.

[root_at_xxxxxx ~]# getenforce
> *Enforcing*
> [root_at_xxxxxx ~]#
> [root_at_xxxxxx ~]# *cat /etc/sysconfig/selinux*
>

# This file controls the state of SELinux on the system.
> # SELINUX= can take one of these three values:
> # enforcing - SELinux security policy is enforced.
> # permissive - SELinux prints warnings instead of enforcing.
> # disabled - No SELinux policy is loaded.
> *SELINUX= enforcing*
> # SELINUXTYPE= can take one of these two values:
> # targeted - Targeted processes are protected,
> # mls - Multi Level Security protection.
> SELINUXTYPE=targeted
>

[root_at_xxxxxx ~]#

Now I've set it to disabled now. Hopefully it's able to perform it successfully this time (also increasing to 2 number of channels from the original 1 channel and adjusting each channel is limited to 30M from the original channel is limited to 40M).

Best Regards
Quanwen Zhao

Mladen Gogala <gogala.mladen_at_gmail.com> 于2023年5月22日周一 00:33写道:

> On 5/21/23 09:40, Quanwen Zhao wrote:
>
> Hello my oracle friends :-),
>
> One of my customers wanna delopy oracle data guard physical standby by
> rman duplicate using my current oracle 11.2.0.4.0 with single instance,
> after a series of the tedious DG parameters configuration both on primary
> and physical standby database I started using rman duplicate to push memory
> script from primary to physical standby.
>
> Yes, my current oracle 11.2.0.4.0 has the volume size of *2.0 t* for data
> files, the hardware resource is as below:
> Logic CPUs: 64c, Physical Memory: 512 GB, Disk IO for the read and write
> speed is probably a bit of bottleneck.
>
> Firstly, I use the following shell script to deploy it.
>
> cat duplicate_dg.sh
>> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
>> /home/oracle/duplicate_dg_`date +%Y%m%d_%H%M%S`.log
>> run {
>> allocate channel p1 type disk;
>> allocate channel p2 type disk;
>> allocate channel p3 type disk;
>> allocate channel p4 type disk;
>> allocate channel p5 type disk;
>> allocate channel p6 type disk;
>> allocate auxiliary channel d1 type disk;
>> allocate auxiliary channel d2 type disk;
>> allocate auxiliary channel d3 type disk;
>> allocate auxiliary channel d4 type disk;
>> allocate auxiliary channel d5 type disk;
>> allocate auxiliary channel d6 type disk;
>> duplicate target database for standby nofilenamecheck from active
>> database;
>> release channel p6;
>> release channel p5;
>> release channel p4;
>> release channel p3;
>> release channel p2;
>> release channel p1;
>> release channel d6;
>> release channel d5;
>> release channel d4;
>> release channel d3;
>> release channel d2;
>> release channel d1;
>> }
>> exit;
>> EOF
>
>
> But after *3.5 hours*, the log file of rman duplicate has the following
> error:
>
> ......
>> executing command: SET NEWNAME
>>
>
>
> Starting backup at 2023-05-19 23:14:31
>> channel p1: starting datafile copy
>> input datafile file number=00041
>> name=/oracle/oradata/xxxxxx/datafile/data_1.dbf
>> channel p2: starting datafile copy
>> input datafile file number=00042
>> name=/oracle/oradata/xxxxxx/datafile/data_2.dbf
>> channel p3: starting datafile copy
>> input datafile file number=00043
>> name=/oracle/oradata/xxxxxx/datafile/data_3.dbf
>> channel p4: starting datafile copy
>> input datafile file number=00044
>> name=/oracle/oradata/xxxxxx/datafile/data_4.dbf
>> channel p5: starting datafile copy
>> input datafile file number=00045
>> name=/oracle/oradata/xxxxxx/datafile/data_5.dbf
>> channel p6: starting datafile copy
>> input datafile file number=00046
>> name=/oracle/oradata/xxxxxx/datafile/data_6.dbf
>> output file name=/oracle/oradata/xxxxxxx/datafile/data_7.dbf
>> tag=TAG20230519T231432
>> channel p1: datafile copy complete, elapsed time: 00:42:08
>> channel p1: starting datafile copy
>> input datafile file number=00007
>> name=/oracle/oradata/xxxxxx/datafile/data_7.1005681629
>> output file name=/oracle/oradata/xxxxxx/datafile/data_9.dbf
>> tag=TAG20230519T231432
>> channel p5: datafile copy complete, elapsed time: 00:42:07
>> channel p5: starting datafile copy
>> input datafile file number=00008
>> name=/oracle/oradata/xxxxxx/datafile/data_8.1005682331
>> output file name=/oracle/oradata/xxxxxx/datafile/data_10.dbf
>> tag=TAG20230519T231432
>> channel p6: datafile copy complete, elapsed time: 00:42:07
>> ......
>>
>>
>> *RMAN-03009: failure of backup command on p1 channel at 05/20/2023
>> 02:45:38 ORA-00603: ORACLE server session terminated by fatal error
>> ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid
>> 8262' for more than 900 seconds*
>> ......
>
>
> My initial thought is reducing rman channel number and limit speed, so I
> changed the previous shell script to be as follows:
>
> cat duplicate_dg.sh
>> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
>> /home/oracle/duplicate_dg_`date +%Y%m%d-%H%M%S`.log
>> run {
>>
>> *allocate channel p1 type disk maxpiecesize 16g maxopenfiles 4 rate 40M;
>> allocate auxiliary channel d1 type disk maxpiecesize 16g maxopenfiles 4
>> rate 40M;*
>> duplicate target database for standby nofilenamecheck from active
>> database;
>> release channel d1;
>> release channel p1;
>> }
>> exit;
>> EOF
>
>
> Re-running that shell script, never showing the error ORA-00239 again, but
> after 8 hours (finished copying *1.3 t* for data files), my primary
> oracle database has been crashed directly, I've seen the following error
> from */var/log/messages*.
>
> ......
>> May 21 16:03:19 aaaaaa kernel: [112149.846996] *oracle invoked
>> oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0*
>> May 21 16:03:19 aaaaaa kernel: [112149.847013] oracle cpuset=/
>> mems_allowed=0
>> May 21 16:03:19 aaaaaa kernel: [112149.847036] CPU: 37 PID: 48996 Comm:
>> oracle Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2
>> May 21 16:03:19 aaaaaa kernel: [112149.847038] Hardware name: Xen HVM
>> domU, BIOS 4.7.4-1.16 07/13/2018
>> May 21 16:03:19 aaaaaa kernel: [112149.847040] 0000000000000000
>> ffff88011c157528 ffffffff816c6e40 ffff887dd8bb3800
>> May 21 16:03:19 aaaaaa kernel: [112149.847043] 0000000000000000
>> ffff88011c157578 ffffffff8118c15e ffff880100000000
>> May 21 16:03:19 aaaaaa kernel: [112149.847046] ffffffff000200da
>> ffff880148932a00 ffff880d0e7c3800 ffff880d0e7c42b0
>> May 21 16:03:19 aaaaaa kernel: [112149.847048] Call Trace:
>> May 21 16:03:19 aaaaaa kernel: [112149.847057] [<ffffffff816c6e40>]
>> dump_stack+0x63/0x83
>> May 21 16:03:19 aaaaaa kernel: [112149.847062] [<ffffffff8118c15e>]
>> dump_header+0x8e/0xe0
>> May 21 16:03:19 aaaaaa kernel: [112149.847065] [<ffffffff8118c767>]
>> oom_kill_process+0x1d7/0x3c0
>> May 21 16:03:19 aaaaaa kernel: [112149.847071] [<ffffffff812a8485>] ?
>> security_capable_noaudit+0x15/0x20
>> May 21 16:03:19 aaaaaa kernel: [112149.847083] [<ffffffff8108d707>] ?
>> has_capability_noaudit+0x17/0x20
>> May 21 16:03:19 aaaaaa kernel: [112149.847089] [<ffffffff8118cc18>]
>> __out_of_memory+0x2c8/0x370
>> May 21 16:03:19 aaaaaa kernel: [112149.847094] [<ffffffff8118ce09>]
>> out_of_memory+0x69/0x90
>> May 21 16:03:19 aaaaaa kernel: [112149.847099] [<ffffffff8119209f>]
>> __alloc_pages_slowpath+0x6af/0x760
>> May 21 16:03:19 aaaaaa kernel: [112149.847102] [<ffffffff81192401>]
>> __alloc_pages_nodemask+0x2b1/0x2d0
>> May 21 16:03:19 aaaaaa kernel: [112149.847105] [<ffffffff810b61c8>] ?
>> sched_clock_cpu+0xa8/0xc0
>> May 21 16:03:19 aaaaaa kernel: [112149.847109] [<ffffffff811de104>]
>> alloc_pages_vma+0xd4/0x230
>> May 21 16:03:19 aaaaaa kernel: [112149.847113] [<ffffffff811cceed>]
>> read_swap_cache_async+0xfd/0x160
>> May 21 16:03:19 aaaaaa kernel: [112149.847115] [<ffffffff811cd05e>]
>> swapin_readahead+0x10e/0x1c0
>> May 21 16:03:19 aaaaaa kernel: [112149.847118] [<ffffffff811a148e>]
>> shmem_swapin+0x5e/0x90
>> May 21 16:03:19 aaaaaa kernel: [112149.847121] [<ffffffff816c6f3d>] ?
>> io_schedule_timeout+0xdd/0x110
>> May 21 16:03:19 aaaaaa kernel: [112149.847124] [<ffffffff811fad5e>] ?
>> swap_cgroup_record+0x4e/0x60
>> May 21 16:03:19 aaaaaa kernel: [112149.847127] [<ffffffff8131c703>] ?
>> radix_tree_lookup_slot+0x13/0x30
>> May 21 16:03:19 aaaaaa kernel: [112149.847129] [<ffffffff81187d6e>] ?
>> find_get_entry+0x1e/0xa0
>> May 21 16:03:19 aaaaaa kernel: [112149.847132] [<ffffffff81189788>] ?
>> pagecache_get_page+0x38/0x1c0
>> May 21 16:03:19 aaaaaa kernel: [112149.847135] [<ffffffff811a47a0>]
>> shmem_getpage_gfp+0x540/0x820
>> May 21 16:03:19 aaaaaa kernel: [112149.847137] [<ffffffff811a54ba>]
>> shmem_fault+0x6a/0x1c0
>> May 21 16:03:19 aaaaaa kernel: [112149.847141] [<ffffffff8129a0ae>]
>> shm_fault+0x1e/0x20
>> May 21 16:03:19 aaaaaa kernel: [112149.847144] [<ffffffff811b877d>]
>> __do_fault+0x3d/0xa0
>> May 21 16:03:19 aaaaaa kernel: [112149.847149] [<ffffffff810ebfd7>] ?
>> current_fs_time+0x27/0x30
>> May 21 16:03:19 aaaaaa kernel: [112149.847153] [<ffffffff810c8783>] ?
>> __wake_up+0x53/0x70
>> May 21 16:03:19 aaaaaa kernel: [112149.847155] [<ffffffff811b89c5>]
>> do_read_fault+0x1e5/0x300
>> May 21 16:03:19 aaaaaa kernel: [112149.847157] [<ffffffff811b8cfc>] ?
>> do_shared_fault+0x19c/0x1d0
>> May 21 16:03:19 aaaaaa kernel: [112149.847159] [<ffffffff811bc703>]
>> handle_pte_fault+0x1e3/0x230
>> May 21 16:03:19 aaaaaa kernel: [112149.847163] [<ffffffff810738a0>] ?
>> pte_alloc_one+0x30/0x50
>> May 21 16:03:19 aaaaaa kernel: [112149.847165] [<ffffffff811b7e27>] ?
>> __pte_alloc+0xd7/0x190
>> May 21 16:03:19 aaaaaa kernel: [112149.847167] [<ffffffff811bc90c>]
>> __handle_mm_fault+0x1bc/0x330
>> May 21 16:03:19 aaaaaa kernel: [112149.847169] [<ffffffff811bcb32>]
>> handle_mm_fault+0xb2/0x1a0
>> May 21 16:03:19 aaaaaa kernel: [112149.847171] [<ffffffff8106ddf3>] ?
>> __do_page_fault+0xe3/0x480
>> May 21 16:03:19 aaaaaa kernel: [112149.847173] [<ffffffff8106de7c>]
>> __do_page_fault+0x16c/0x480
>> May 21 16:03:19 aaaaaa kernel: [112149.847175] [<ffffffff8106e337>]
>> do_page_fault+0x37/0x90
>> May 21 16:03:19 aaaaaa kernel: [112149.847177] [<ffffffff816c7b9a>] ?
>> schedule_user+0x1a/0x60
>> May 21 16:03:19 aaaaaa kernel: [112149.847181] [<ffffffff816cdb18>]
>> page_fault+0x28/0x30
>> May 21 16:03:19 aaaaaa kernel: [112149.847183] Mem-Info:
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] active_anon:18878013
>> inactive_anon:30402260 isolated_anon:576
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] active_file:4697
>> inactive_file:4559 isolated_file:613
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] unevictable:0 dirty:0
>> writeback:0 unstable:0
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] slab_reclaimable:242933
>> slab_unreclaimable:49152
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] mapped:16529750
>> shmem:47921098 pagetables:2963894 bounce:0
>> May 21 16:03:19 aaaaaa kernel: [112149.847193] free:532442 free_pcp:223
>> free_cma:1
>> May 21 16:03:19 aaaaaa kernel: [112149.847197] Node 0 DMA free:15828kB
>> min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_
>> file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
>> isolated(file):0kB present:15988kB managed:15900kB mlocked:0kB dirty:0kB
>> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
>> slab_unreclaimable:68kB kernel_stack:0kB pagetables:0kB unstable:0kB
>> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
>> pages_scanned:0 all_unreclaimable? yes
>> May 21 16:03:19 aaaaaa kernel: [112149.847202] lowmem_reserve[]: 0 3455
>> 515685 515685
>> May 21 16:03:19 aaaaaa kernel: [112149.847206] Node 0 DMA32
>> free:2049148kB min:436kB low:544kB high:652kB active_anon:0kB
>> inactive_anon:28kB active_file:0kB inactive_file:0kB unevictable:0kB
>> isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3540104kB
>> mlocked:0kB dirty:0kB writeback:0kB mapped:12kB shmem:8kB
>> slab_reclaimable:8240kB slab_unreclaimable:1700kB kernel_stack:32kB
>> pagetables:248kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
>> free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? yes
>> May 21 16:03:19 aaaaaa kernel: [112149.847210] lowmem_reserve[]: 0 0
>> 512229 512229
>> May 21 16:03:19 aaaaaa kernel: [112149.847214] Node 0 Normal free:64792kB
>> min:65092kB low:81364kB high:97636kB active_anon:75512308kB
>> inactive_anon:121608500kB active_file:18788kB inactive_file:18236kB
>> unevictable:0kB isolated(anon):2304kB isolated(file):2452kB
>> present:532930560kB managed:524523364kB mlocked:0kB dirty:0kB writeback:0kB
>> mapped:66118988kB shmem:191684384kB slab_reclaimable:963492kB
>> slab_unreclaimable:194840kB kernel_stack:19120kB pagetables:11855328kB
>> unstable:0kB bounce:0kB free_pcp:892kB local_pcp:0kB free_cma:4kB
>> writeback_tmp:0kB pages_scanned:312092 all_unreclaimable? yes
>> May 21 16:03:19 aaaaaa kernel: [112149.847218] lowmem_reserve[]: 0 0 0 0
>> May 21 16:03:19 aaaaaa kernel: [112149.847221] Node 0 DMA: 1*4kB (U)
>> 2*8kB (U) 2*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB
>> 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15828kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847233] Node 0 DMA32: 26*4kB (UM)
>> 25*8kB (UM) 33*16kB (UEM) 40*32kB (UEM) 17*64kB (UEM) 4*128kB (EM) 6*256kB
>> (UEM) 2*512kB (M) 5*1024kB (UEM) 3*2048kB (UMR) 496*4096kB (M) = 2049152kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847246] Node 0 Normal: 16336*4kB
>> (UMC) 81*8kB (UM) 11*16kB (UR) 9*32kB (UR) 1*64kB (R) 0*128kB 2*256kB (R)
>> 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 68568kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847258] Node 0 hugepages_total=0
>> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847259] Node 0
>> hugepages_total=153641 hugepages_free=153641 hugepages_surp=0
>> hugepages_size=2048kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847261] 48948485 total pagecache
>> pages
>> May 21 16:03:19 aaaaaa kernel: [112149.847262] 1018682 pages in swap cache
>> May 21 16:03:19 aaaaaa kernel: [112149.847264] Swap cache stats: add
>> 32771424, delete 31752742, find 10338170/12219018
>> May 21 16:03:19 aaaaaa kernel: [112149.847265] Free swap = 0kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847266] Total swap = 16498684kB
>> May 21 16:03:19 aaaaaa kernel: [112149.847267] 134215581 pages RAM
>> May 21 16:03:19 aaaaaa kernel: [112149.847268] 0 pages HighMem/MovableOnly
>> May 21 16:03:19 aaaaaa kernel: [112149.847269] 2191643 pages reserved
>> May 21 16:03:19 aaaaaa kernel: [112149.847270] 4096 pages cma reserved
>> May 21 16:03:19 aaaaaa kernel: [112149.847271] 0 pages hwpoisoned
>> ......
>
>
> Oh, my god!!! You know, I've set *Hugepage* on OS and disabled the *THP
> (Transparent Huge Pages)*.
>
> [root_at_aaaaaa ~]# *cat /proc/meminfo | grep Huge*
>> AnonHugePages: 0 kB
>>
>>
>> *HugePages_Total: 153641 HugePages_Free: 73222 HugePages_Rsvd:
>> 73182*
>> HugePages_Surp: 0
>> Hugepagesize: 2048 kB
>> [root_at_aaaaaa ~]#
>> [root_at_aaaaaa ~]# *cat /sys/kernel/mm/transparent_hugepage/enabled*
>> always madvise *[never]*
>> [root_at_aaaaaa ~]#
>
>
> About how to avoid out-of-memory? The most articles suggest to set *"vm.lower_zone_protection
> = 250"* on */etc/sysctl.conf *on x86_64 system or install *hugemem kernel
> rpm package* on x86 system.
>
> Could you help me troubleshooting the incomprehensible issue? Thanks
> beforehand!
>
> Best Regards
> Quanwen Zhao
>
> You can turn off out of memory murderer by setting vm.overcommit_memory =
> 1 (echo 1>/proc/sys/vm/overcommit_memory). If you have enough memory and
> swap, some pages will be swapped out, but if you don't, the machine will
> crash. If not, you'll have to decrease your SGA and have more memory
> managed by Linux.
>
> Regards
>
> --
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217https://dbwhisperer.wordpress.com
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon May 22 2023 - 06:43:17 CEST

Original text of this message