Re: Troubleshooting log shipping

From: Nassyam Basha <nassyambasha_at_gmail.com>
Date: Sat, 9 Jan 2016 20:22:12 +0530
Message-ID: <CAABGLuKmv8Z6fYsJqTSwWDU1vGS9v3fBC1MK+ahfeqS292Z7RQ_at_mail.gmail.com>



Hello Upendra,

Real-Time apply is always considerable and not going to overhead for the primary/production database. I hope you are in maximum performance and in that case primary never cares standby whether the redo is transferred or written into standby redo log files. I would say must implement.

As Tony said, the redo log size changes not going to impact any way. If for example you have redo log size of 1GB then to transfer over network it splits into small chunk of sizes 8192, As per your earlier reply you have mentioned there is no issue from bandwidth but have you tested the required bandwidth?
The required bandwidth can be calculated with the predefined formula based on the generated redo size per second. You can refer below links for more information.
http://www.oracle.com/au/products/database/maa-wp-10gr2-dataguardnetworkbestpr-134557.pdf .
Optimizing and Tuning Network -->
http://www.toadworld.com/platforms/oracle/w/wiki/11366.let-s-optimize-data-guard-configuration-ii by ping or tnsping you may not get always consistent data, i would say measure the bandwidth speed and optimize the network tuning by larger SDU size.

Thanks.

On Sat, Jan 9, 2016 at 7:27 PM, Upendra nerilla <nupendra_at_hotmail.com> wrote:

> I just tested using ping. Both sides doesn't show any packet drops. Both
> sides show the same latency.
> Is there a way to set oradebug event/debug to get additional information
> on why the archivelog is being to the other side? How should I troubleshoot
> this further?
>
> Real-time apply sounds interesting. I am concerned about adding the
> dependency of the DR site to Prod.
>
> -Upendra
>
> ------------------------------
> Date: Sat, 9 Jan 2016 13:19:57 +1000
> From: dedba_at_tpg.com.au
> To: nupendra_at_hotmail.com; nassyambasha_at_gmail.com
> CC: oracle-l_at_freelists.org
>
> Subject: Re: Troubleshooting log shipping
>
> I agree with Nassyam, the errors clearly point to network problems between
> the production and standby sites. I have had the same on various occasions,
> especially when production traffic and log shipping used the same network
> devices. Congestion can take place on various levels in the OSI layers of
> the servers involved, not just the network devices.
>
> The solution is not to increase the file size, as the network traffic on
> copy only increases, but enabling real-time apply as suggested. When
> real-time apply is enabled, the production log writer will write
> transactions to the standby log directly, in tandem with the local redo
> log. When the log is archived, a complete copy already exists on the DG
> server and no further network copy is needed. In addition, changes are
> applied to the standby database as they occur, so activity on that server
> is steady and does not rise and fall as would be the case with log
> shipping. This does not increase the load on the production server, but you
> must take care that the production database will not hang if for any reason
> the log writer cannot write to the standby log.
>
> It can also not do any harm to check that the network configuration is
> sane, e.g. that a ping between the two servers does not show any lost
> packets. Ususally when this happens, a server is set up to use two or more
> NICs in tandem (i.e. teamed/bonded/etc.) but the switches they are
> connected to are not set up to support this. E.g. with 2 teamed/bonded/..
> NICs, you'd see about 50% packet loss if the switches are not configured
> correctly.
>
> Cheers,
> Tony
>
> On 09/01/16 12:49, Upendra nerilla wrote:
>
> I reviewed the network throughput and we aren't saturated from the network
> pipe perspective. We do have riverbed network gear which is compressing and
> optimizing the archive transport. I am trying to determine where the
> bottleneck is..
>
> Since the logs aren't getting shipped to the DR site, I wanted to get that
> resolved.
> BTW, I increased redolog size in Prod from 1G to 2G, I still see missing
> archivelogs in the DR site.
>
> Real-time apply is not really a requirement, I thought it has more
> overhead to the system compared to archive shipping.. With the amount of
> data I am pushing, would it help?
> I checked the SDU size, we are using the default. How should I determine
> if SDU needs to be increased?
> I haven't done anything with chunk size, any suggestions on where should I
> start?
>
> Thanks
> -Upendra
>
> ------------------------------
> Subject: Re: Troubleshooting log shipping
> From: nassyambasha_at_gmail.com
> Date: Sat, 9 Jan 2016 05:44:08 +0530
> CC: oracle-l_at_freelists.org
> To: nupendra_at_hotmail.com
>
> Okay this information clears me Data Guard have challenge from network
> speed/bandwidth.
> One simple question, why not Real-Time Apply ? No need to depend until the
> archive log archived and transmitted.
>
> If it is already implemented then, you must consider optimizing the
> network by SDU size with large chunk.
>
> Sent from my iPhone
>
> On 08-Jan-2016, at 11:47 PM, Upendra nerilla <nupendra_at_hotmail.com> wrote:
>
> Hi Nassyam et al -
>
> Thanks for your suggestion in setting the archive_dest_state to defer and
> enable. That cleared the error. I don't see any errors now.
>
> From looking at the v$archived_log I see that we are behind by a few
> hundred logs..
> When I check those files with your second query, I do see that a bunch of
> archivelogs haven't reached standby.
>
> A couple of followup questions:
> 1. This is a busy system, redolog size is 1G and in peak we generate about
> 100-120 logs in an hour. Is it possible that archiver gets too busy and
> can't handle this.
> Is there a way to determine that? Should I consider increasing the redo
> log size to handle this?
>
> 2. Does anyone have a script or suggestions on a way to automate copying
> the missing archive logs to the Data guard site? Oracle is automatically
> pulling these files at the time of logapply. But because of the
> latency/throughput between the sites, I would like to copy them ahead of
> time so the logapply is going through faster.
>
>
> Thanks
> -Upendra
>
> ------------------------------
> From: nassyambasha_at_gmail.com
> Date: Fri, 8 Jan 2016 19:49:33 +0530
> Subject: Re: Troubleshooting log shipping
> To: nupendra_at_hotmail.com
> CC: oracle-l_at_freelists.org
>
> Hello Upendra,
> How about the network between primary and standby site? For the best
> practice you can do like this to refresh.
> Primary> alter system set log_archive_dest_state_3='defer';
> Primary> alter system set log_archive_dest_state_3='enable';
> Primary> alter system archive log current;
> Then check again the above query for the latest errors and also standby
> alert log if any sequences are catching up.
>
> From the alert log those errors, i have seen many times the situation can
> be vary like network issue or unable to communicate with the background
> processes Arch --> (LNS) --> RFS.
>
> Provide the below information first to understand what archive gap you
> have.
> A) Primary> select thread#,max(sequence#) from v$archived_log group by
> thread#;
> B) Standby>select thread#,max(sequence#) from v$archived_log where
> applied='YES' group by thread#;
>
> If there is any GAP on standby database then we need to check if the
> archives whether reached to standby or not.
> standby> select thread#,sequence#,name, applied from v$archived_log where
> sequence#=&outoutBThread1seq and thread#=1;
> standby> select thread#,sequence#,name, applied from v$archived_log where
> sequence#=&outoutBThread2seq and thread#=2;
>
> If they are unavailable on standby then we have to check those missing
> archives are available on primary using the same above queries. if
> available then troubleshooting is required why unable to ship to standby.
> If archives unavailable then you know what to do :)
> Thanks.
>
>
>
> On Fri, Jan 8, 2016 at 7:24 PM, Upendra nerilla <nupendra_at_hotmail.com>
> wrote:
>
> oops.. sorry about that.. still waking up.. :)
>
> Here is the output from the second query:
>
> SEVERITY ERROR_CODE timestamp MESSAGE
> --------------- ---------- --------------------
> ----------------------------------------------------------------------
> Error 16198 08-JAN-2016 07:23:37 WARN: ARC3: Terminating
> pid 18564 hung on an I/O operation
> Error 16198 08-JAN-2016 07:23:39 ARC3: Error 16198 due to
> hung I/O operation to LOG_ARCHIVE_DEST_3
>
>
> I see the following alert.log entry in Prod:
> Thu Jan 07 19:09:15 2016
> LNS: Standby redo logfile selected for thread 1 sequence 435322 for
> destination LOG_ARCHIVE_DEST_2
> krsv_proc_kill: Killing 1 processes (Process by index)
> krsv_proc_kill: Killing 1 processes (Process by index)
> ARC1: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_3
> ARC1: Detected ARCH process failure
> ARC1: Detected ARCH process failure
> ARC1: STARTING ARCH PROCESSES
>
> Thanks
> -Upendra
>
> ------------------------------
> From: nassyambasha_at_gmail.com
> Date: Fri, 8 Jan 2016 18:50:04 +0530
>
> Subject: Re: Troubleshooting log shipping
> To: nupendra_at_hotmail.com
> CC: oracle-l_at_freelists.org
>
> Hello Upendra,
> I understand. But you missed to provide the first query i.e.
> set line 120 pages 100
> col severity for a15
> col message for a70
> col timestamp for a20
> select severity,error_code,to_char(timestamp,'DD-MON-YYYY HH24:MI:SS')
> "timestamp" , message from v$dataguard_status where dest_id=3;
> Thanks.
>
> On Fri, Jan 8, 2016 at 6:45 PM, Upendra nerilla <nupendra_at_hotmail.com>
> wrote:
>
> Hi Nassyam,
> I believe the configuration is correct, otherwise I wouldn't be receiving
> any archivelogs at the destination. The DB_unique_name is set to "SID_DR"
> in the data guard site and there is a corresponding TNS entry in the PROD
> site (with the same name) pointing to the data guard site.
>
> The issue is I am receiving SOME logs but not ALL the archivelogs at the
> data guard site.
> As I pasted the output of the archivelogs in my earlier message, see the
> sequence numbers are all over the place..
> This is from ASM:
> 512 1860501 952576512 954204160 thread_1_seq_*434986*
> .5778.900455621
> 512 1883168 964182016 965738496 thread_1_seq_*434991*
> .8962.900456295
> 512 1850346 947377152 948961280 thread_1_seq_*434997*
> .9060.900456899
> 512 1865067 954914304 956301312 thread_1_seq_*435003*
> .3202.900457539
> 512 134582 68905984 70254592 thread_1_seq_*435008*
> .1471.900457605
> 512 1848325 946342400 947912704 thread_1_seq_*435009*
> .3990.900458259
> 512 1863147 953931264 955252736 thread_1_seq_*435017*
> .6484.900458793
> 512 1834310 939166720 940572672 thread_1_seq_*435023*
> .8233.900459079
> 512 1438234 736375808 738197504 thread_1_seq_*435024*
> .11250.900460605
>
>
> When the logapply hits the point where the archivelog is missing it is
> copying from Prod, which is slowing the log apply significantly.
> From alert.log at data guard:
> Media Recovery Waiting for thread 1 sequence 434987 (in transit)
>
>
> Here is the output of the query you sent (taken from Prod):
> ID STATUS DB_MODE TYPE RECOVERY_MODE
> PROTECTION_MODE SRLs ACTIVE
> ---------- --------- --------------- ---------- -----------------------
> -------------------- ---------- ----------
> ARCHIVED_SEQ#
> -------------
> 1 VALID OPEN ARCH IDLE
> MAXIMUM PERFORMANCE 0 0
> 355351
>
> 2 VALID MOUNTED-STANDBY LGWR MANAGED
> MAXIMUM PERFORMANCE 14 3
> 435651
>
> 3 VALID MOUNTED-STANDBY LGWR MANAGED
> MAXIMUM PERFORMANCE 14 3
> 353763
>
>
> Thanks
> -Upendra
>
> ------------------------------
> From: nassyambasha_at_gmail.com
> Date: Fri, 8 Jan 2016 15:08:10 +0530
> Subject: Re: Troubleshooting log shipping
> To: nupendra_at_hotmail.com
> CC: oracle-l_at_freelists.org
>
>
> Hello Upendra,
>
> If the configuration is perfect then the redo/archive should receive on
> remote destinations.
> If you see below, you have mentioned service=SID_DR, can you confirm you
> have used the Oracle net service which points to the related standby
> database? and what about db_unique_name also you have mentioned SID_DR, it
> should be the unique name but not SID, so it depends on configuration
> however what the db_unique_name,sid, service configured.
>
> "*service="SID_DR", LGWR ASYNC NOAFFIRM delay=0 optional
> compression=disable max_failure=0 max_connections=1 reopen=15*
> * db_unique_name="SID_DR"
> net_timeout=30, valid_for=(all_logfiles,primary_role)*"
>
> Can you send output of this below query? Run them from primary.
>
> set line 120 pages 100
> col severity for a15
> col message for a70
> col timestamp for a20
> select severity,error_code,to_char(timestamp,'DD-MON-YYYY HH24:MI:SS')
> "timestamp" , message from v$dataguard_status where dest_id=3;
> select ds.dest_id id
> , ad.status
> , ds.database_mode db_mode
> , ad.archiver type
> , ds.recovery_mode
> , ds.protection_mode
> , ds.standby_logfile_count "SRLs"
> , ds.standby_logfile_active active
> , ds.archived_seq#
> from v$archive_dest_status ds
> , v$archive_dest ad
> where ds.dest_id = ad.dest_id
> and ad.status != 'INACTIVE'
> order by
> ds.dest_id;
>
>
> Thanks.
>
> On Fri, Jan 8, 2016 at 1:22 PM, Upendra nerilla <nupendra_at_hotmail.com>
> wrote:
>
> Hello everyone -
> Happy new year!
>
> Oracle RAC 11.2.0.3
> OEL 5.x
>
> I have an environment in which the primary is shipping logs to a local
> standby and a remote dataguard database.
> What I am seeing is that, somehow not all the archivelogs are being
> shipped to the remote site. At the time of log apply these logs are being
> transmitted over.
>
> Here is an example on how the redologs appear in the remote site:
> 512 1860501 952576512 954204160
> thread_1_seq_434986.5778.900455621
> 512 1883168 964182016 965738496
> thread_1_seq_434991.8962.900456295
> 512 1850346 947377152 948961280
> thread_1_seq_434997.9060.900456899
> 512 1865067 954914304 956301312
> thread_1_seq_435003.3202.900457539
> 512 134582 68905984 70254592
> thread_1_seq_435008.1471.900457605
> 512 1848325 946342400 947912704
> thread_1_seq_435009.3990.900458259
> 512 1863147 953931264 955252736
> thread_1_seq_435017.6484.900458793
> 512 1834310 939166720 940572672
> thread_1_seq_435023.8233.900459079
> 512 1438234 736375808 738197504
> thread_1_seq_435024.11250.900460605
>
> Similar gaps for thread_2 as well..
>
> Here is the configuration of the log_archive_dest_3 on primary for the
> remote site:
>
> *service="SID_DR", LGWR ASYNC NOAFFIRM delay=0 optional
> compression=disable max_failure=0 max_connections=1 reopen=15
> db_unique_name="SID_DR"
> net_timeout=30, valid_for=(all_logfiles,primary_role)*
>
> I do not see any errors for the redo transport in the alert log on the
> primary site or on the remote DG site.
> Any thoughts on how should I troubleshoot this? Are there any known bugs
> around this?
>
> Your help is appreciated.
>
> Thanks
> -Upendra
>
>
>
>
> --
> Nassyam Basha.
> *Oracle Database Consultant*| *Pythian * <http://www.pythian.com/>
> [image: ACED Profile]
> <https://apex.oracle.com/pls/apex/f?p=19297:4:::NO:4:P4_ID:13140>
> Oracle 11g Certified Master
> <http://education.oracle.com/education/otn/NassyamBasha.htm>
> Co-Author: Oracle Data Guard 11gR2
> <http://www.amazon.in/Oracle-Guard-11gR2-Administration-Beginners/dp/1849687900>
> Co-founder of Oraworld-team <http://www.oraworld-team.com>
>
> [image: Facebook] <https://www.facebook.com/nassyambasha> [image:
> Twitter] <https://twitter.com/oracle_ckpt> [image: LinkedIn]
> <https://in.linkedin.com/in/nassyambasha> [image: Google +]
> <https://plus.google.com/+NassyamBasha> [image: CKPT Blog]
> <http://www.oracle-ckpt.com/>
>
>
>
>
>
> --
> Nassyam Basha.
> *Oracle Database Consultant*| *Pythian * <http://www.pythian.com/>
> [image: ACED Profile]
> <https://apex.oracle.com/pls/apex/f?p=19297:4:::NO:4:P4_ID:13140>
> Oracle 11g Certified Master
> <http://education.oracle.com/education/otn/NassyamBasha.htm>
> Co-Author: Oracle Data Guard 11gR2
> <http://www.amazon.in/Oracle-Guard-11gR2-Administration-Beginners/dp/1849687900>
> Co-founder of Oraworld-team <http://www.oraworld-team.com>
>
> [image: Facebook] <https://www.facebook.com/nassyambasha> [image:
> Twitter] <https://twitter.com/oracle_ckpt> [image: LinkedIn]
> <https://in.linkedin.com/in/nassyambasha> [image: Google +]
> <https://plus.google.com/+NassyamBasha> [image: CKPT Blog]
> <http://www.oracle-ckpt.com/>
>
>
>
>
>
> --
> Nassyam Basha.
> *Oracle Database Consultant*| *Pythian * <http://www.pythian.com/>
> [image: ACED Profile]
> <https://apex.oracle.com/pls/apex/f?p=19297:4:::NO:4:P4_ID:13140>
> Oracle 11g Certified Master
> <http://education.oracle.com/education/otn/NassyamBasha.htm>
> Co-Author: Oracle Data Guard 11gR2
> <http://www.amazon.in/Oracle-Guard-11gR2-Administration-Beginners/dp/1849687900>
> Co-founder of Oraworld-team <http://www.oraworld-team.com>
>
> [image: Facebook] <https://www.facebook.com/nassyambasha> [image:
> Twitter] <https://twitter.com/oracle_ckpt> [image: LinkedIn]
> <https://in.linkedin.com/in/nassyambasha> [image: Google +]
> <https://plus.google.com/+NassyamBasha> [image: CKPT Blog]
> <http://www.oracle-ckpt.com/>
>
>
>
>

-- 
Nassyam Basha.
*Oracle Database Consultant*| *Pythian * <http://www.pythian.com/>
[image: ACED Profile]
<https://apex.oracle.com/pls/apex/f?p=19297:4:::NO:4:P4_ID:13140>
Oracle 11g Certified Master
<http://education.oracle.com/education/otn/NassyamBasha.htm>
Co-Author: Oracle Data Guard 11gR2
<http://www.amazon.in/Oracle-Guard-11gR2-Administration-Beginners/dp/1849687900>
Co-founder of Oraworld-team <http://www.oraworld-team.com>

[image: Facebook] <https://www.facebook.com/nassyambasha>  [image: Twitter]
<https://twitter.com/oracle_ckpt>  [image: LinkedIn]
<https://in.linkedin.com/in/nassyambasha>  [image: Google +]
<https://plus.google.com/+NassyamBasha>   [image: CKPT Blog]
<http://www.oracle-ckpt.com/>
<https://www.facebook.com/nassyambasha>

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Jan 09 2016 - 15:52:12 CET

Original text of this message