Re: DCD dead connection detection in 12c

From: April Sims <aprilcsims_at_gmail.com>
Date: Fri, 22 Aug 2014 10:26:53 -0500
Message-ID: <CAK+cZDcZDsk1aYZubBqSwzrMZKonBkQcHYwB5knfitO7nNsCJw_at_mail.gmail.com>



Thanks to all who chimed in what I thought was a DCD issue. It turned out to be network all along, it just coincided with our switchover to new hardware, huge pages, change in initialization parameters, new version of the oracle listener, etc.

What I learned along the way is that the original error gave a real clue as to where it was failing...looks for the lowest error number. Which in our case was the 110 , standard Linux OS TIMEDOUT message, if that had been an ORACLE error then it would have been at the ORACLE level.

Fatal NI connect error 12170.

  VERSION INFORMATION:

        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 -
Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 -
Production
  Time: 18-AUG-2014 03:25:46
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out

    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out

    nt secondary err code: 110
    nt OS err code: 0

A week of torture because this was an oracle forms application that required a dedicated connection for the entire workday. Any sort of network flakiness shows up immediately.
Rebooting quite a few of the routers, switches along the path fixed things.

On Fri, Aug 15, 2014 at 11:52 AM, Jeremy Schneider < jeremy.schneider_at_ardentperf.com> wrote:

> probably most people wouldn't notice but i mistakenly got that example
> output from the database server... you actually should run it on the client
> server. Here's what it looks like from two different clients.
>
> with enable=broken in jdbc connect string:
> tcp 0 0 ::ffff:192.168.1.216:43940 ::ffff:192.168.1.130:1521
> ESTABLISHED keepalive (2271.79/0/0)
> tcp 0 0 ::ffff:192.168.1.216:42615 ::ffff:192.168.1.130:1521
> ESTABLISHED keepalive (4657.81/0/0)
> tcp 110 0 ::ffff:192.168.1.216:40552 ::ffff:192.168.1.130:1521
> ESTABLISHED keepalive (1074.00/0/0)
>
> without enable=broken in jdbc connect string:
> tcp 2970 0 ::ffff:192.168.1.181:60553 ::ffff:192.168.1.170:1521
> ESTABLISHED off (0.00/0/0)
> tcp 2910 0 ::ffff:192.168.1.181:59678 ::ffff:192.168.1.170:1521
> ESTABLISHED off (0.00/0/0)
> tcp 0 0 ::ffff:192.168.1.181:60610 ::ffff:192.168.1.170:1521
> ESTABLISHED off (0.00/0/0)
> tcp 2980 0 ::ffff:192.168.1.181:59744 ::ffff:192.168.1.170:1521
> ESTABLISHED off (0.00/0/0)
>
> OS settings are identical on these two servers.
>
> -J
>
>
> --
> http://about.me/jeremy_schneider
>
>
> On Fri, Aug 15, 2014 at 12:42 PM, Jeremy Schneider <
> jeremy.schneider_at_ardentperf.com> wrote:
>
>> Adding just two more points, since I have been recently working on DCD
>> with RH linux myself.
>>
>> strace is quite detailed, but a much easier way to do the job is just use
>> "netstat -nto|grep 1521" or replace 1521 with your listener port if it's
>> non-default. The "o" option is the magic one for keepalive. In the far
>> right column you should see the string "keepalive" rather than "off" and it
>> will tell you the actual amount of time remaining on each keepalive
>> connection.
>>
>> Example Output:
>> tcp 0 0 192.168.1.130:1521 192.168.1.130:22335
>> ESTABLISHED keepalive (2380.58/0/0)
>> tcp 0 0 192.168.1.130:1521 192.168.1.104:56698
>> ESTABLISHED off (0.00/0/0)
>> tcp 0 0 192.168.1.130:1521 192.168.1.146:56850
>> ESTABLISHED off (0.00/0/0)
>> tcp 0 0 192.168.1.130:1521 192.168.1.130:31120
>> TIME_WAIT timewait (13.21/0/0)
>>
>> Notice that the connection from the db server to itself (130) above has
>> keepalive enabled, but the clients (104 and 146) do not have keepalive
>> enabled. Which brings up a second point. We were using the thin jdbc
>> client in some cases and discovered that keepalive was not enabled by this
>> driver unless you switched to the long format and explicitly specified
>> "(enable=broken)" in the long TNS entry. This is in addition to the kernel
>> settings which must be correctly configured.
>>
>> -Jeremy
>>
>>
>> --
>> http://about.me/jeremy_schneider
>>
>>
>> On Thu, Aug 14, 2014 at 12:43 PM, Riyaj Shamsudeen <
>> riyaj.shamsudeen_at_gmail.com> wrote:
>>
>>> Hello April,
>>> Since you have set the sqlnet.expire_time to 10 minutes, every 10
>>> minutes a TCP/IP packet is sent to that client port. If a TCP ACK is
>>> received in a short interval, then both tcp_keepalive and SQLNET timers are
>>> reset. If the TCP ACK is not received , then TCP retransmission code kicks
>>> in, TCP packet is retransmitted tcp_retries2 (15 default) times with an
>>> exponential back off controlled by tcp retransmission interval.
>>> So, in your case, tcp shouldn't kill the connection in 2 hours at
>>> all, from the host side. However, I have seen port level timeouts in the
>>> switch/firewall configurations that is kept at 2 hours normally. Check with
>>> network group to see if that is happening.
>>> Also conduct this test:
>>> a. create a sqlplus connection from that client machine connecting to
>>> the database.
>>> b. Identify the dedicated server process for that connection. Strace
>>> the dedicated server process:
>>> strace -tttT -o /tmp/dcd.lst -p <pid>
>>> c. Just keep the sqlplus connection idle during this period. not even
>>> an enter.
>>> Reading the /tmp/dcd.lst file, you should see packets every 10
>>> minutes. If it dies after 2 hours, then check with firewall/network group.
>>>
>>> Hope this helps,
>>>
>>> Cheers
>>>
>>> Riyaj Shamsudeen
>>> Principal DBA,
>>> Ora!nternals - http://www.orainternals.com - Specialists in
>>> Performance, RAC and EBS
>>> Blog: http://orainternals.wordpress.com/
>>> Oracle ACE Director and OakTable member <http://www.oaktable.com/>
>>>
>>> Co-author of the books: Expert Oracle Practices
>>> <http://tinyurl.com/book-expert-oracle-practices/>, Pro Oracle SQL,
>>> <http://tinyurl.com/ahpvms8> <http://tinyurl.com/ahpvms8>Expert RAC
>>> Practices 12c. <http://tinyurl.com/expert-rac-12c> Expert PL/SQL
>>> practices <http://tinyurl.com/book-expert-plsql-practices>
>>>
>>> <http://tinyurl.com/book-expert-plsql-practices>
>>>
>>>
>>>
>>> On Thu, Aug 14, 2014 at 7:30 AM, April Sims <aprilcsims_at_gmail.com>
>>> wrote:
>>>
>>>> Need some help in resolving our new idle timeouts seen since going to
>>>> 12c.
>>>> I have a document
>>>>
>>>> Oracle Net 12c: New Implementation of Dead Connection Detection (DCD)
>>>> (Doc ID 1591874.1)
>>>>
>>>> We are on Linux RH 64-bit so this is applicable.
>>>>
>>>> Our current OS settings look like the following:
>>>>
>>>> # cat /proc/sys/net/ipv4/tcp_keepalive_time
>>>> 7200
>>>>
>>>> # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
>>>> 75
>>>>
>>>> # cat /proc/sys/net/ipv4/tcp_keepalive_probes
>>>> 9
>>>>
>>>>
>>>> sqlnet.ora
>>>> SQLNET.EXPIRE_TIME = 10
>>>>
>>>> SQLNET.INBOUND_CONNECT_TIMEOUT = 120
>>>>
>>>> listener.ora
>>>>
>>>> INBOUND_CONNECT_TIMEOUT_LISTENER_listenername = 120
>>>>
>>>> Any suggestions on the changes I need to make to prevent a 2 hour idle
>>>> timeout?
>>>>
>>>> thanks,
>>>>
>>>>
>>>> --
>>>> April C. Sims
>>>> http://aprilcsims.wordpress.com
>>>> Twitter, LinkedIn
>>>> Oracle Database 11g – Underground Advice for Database Administrators
>>>> https://www.packtpub.com/oracle-11g-database-implementations-guide/book
>>>> OCP 8i, 9i, 10g, 11g DBA
>>>> Southern Utah University
>>>> aprilcsims_at_gmail.com
>>>>
>>>
>>>
>>
>

-- 
April C. Sims
IOUG SELECT Journal Editor
http://aprilcsims.wordpress.com
Twitter, LinkedIn
Oracle Database 11g – Underground Advice for Database Administrators
<http://www.amazon.com/Oracle-Database-Underground-Advice-Administrators/dp/1849680000/ref=sr_1_1?ie=UTF8&s=books&qid=1272289339&sr=8-1#noop>
https://www.packtpub.com/oracle-11g-database-implementations-guide/book
OCP 8i, 9i, 10g, 11g DBA
Southern Utah University
aprilcsims_at_gmail.com

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 22 2014 - 17:26:53 CEST

Original text of this message