Re: AWS RDS Detecting failover

From: Schneider <schneider_at_ardentperf.com>
Date: Wed, 4 Oct 2017 15:20:31 -0700
Message-ID: <CA+fnDAa0fwH=7w0Mp5MPrxcyUPuxrkGkc_xb6Qk6N=R81n7WHQ_at_mail.gmail.com>


On Mon, Oct 2, 2017 at 12:48 PM, Steve T. Baldwin <stbaldwin_at_msts.com> wrote:
> In my testing, when a client is connected to a multi-az RDS instance and I
> force failover, that client doesn't 'see' it. If it makes any DB request
> after or during the failover it ends up timing out - eventually.
> Unfortunately this timeout is controlled by the tcp keepalive setting which
> defaults to 2 hours. Not very helpful when the actual failover can be
> complete in a couple of minutes.

FWIW, I remember having a similar "hanging dead connections" issue after several planned maintenance operations on RAC clusters (rolling updates that restarted each instance) a couple years ago. We were unable for some reason to use ONS and FCF to automatically cleanup cached connections. If I recall correctly, we ended up updating the linux keepalive parameters across our fleet to improve the situation. One positive was that much of our fleet was configured through ansible at the time so we were able to roll updates out in a controlled and yet automated manner across multiple data centers.

So this problem might not be limited to cloud deployments; maybe this is something you could run into with other HA/failover setups too?

-Jeremy

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Oct 05 2017 - 00:20:31 CEST

Original text of this message