RE: Overhead of load-balanced microservices architecture

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Thu, 13 Aug 2020 11:28:39 -0400
Message-ID: <1f1901d67186$6c1f1120$445d3360$_at_rsiz.com>



Keeping in mind the possible death by inches problem of TOO many open Oracle sessions (clarified if you need by Graham Wood’s realworld demos and videos), the 1980’s implementation of leaving a service daemon running with an open Oracle connection is a fast response, low cost way to do this. Back in the day, we programmed these in OCI (Oracle Call Interface) to make it easier to implement the daemons in C, which was the natural programming language for UNIX and things built by copying the architecture and design of UNIX.  

So your health check would:

  1. Ping the service daemon to see if it is running, and only if it is not running do a login to Oracle to start it.
  2. Fire an action code at the service daemon with a return vector for the answer.

In olden days each question code had to be built as a C program subroutine included in the daemon code. Now, of course, you would make stored packaged PL/SQL procedures and/or functions, so the C harness would be quite simple.  

One of the action codes probably should be to disconnect and stop. While it is possible to have a supervisor at the OS level to check for and restart the service daemon, that creates the potential for a vampire that won’t die when you are trying to do routine maintenance on the database. (This is the same logic as having a quick swap-in url picture for web application logins that say: Please bear with us. Services are expected to resume at YYMMDD HH24:MI [reading the expected return time from a file you control.])

Eliminating incessant restart attempt traffic by persistent machines and frustrated humans is an important thing to do, both for your convenience and for the worldwide zeitgeist.  

Conversely if you have a start script for a database with a flag whether or not to start your list of daemons, you can save a lot of time and energy and still allow individual requests to start the daemon if it is not running. A configuration file you control would indicate whether to attempt to honor the start code, and you would slap that to “NO” when you also swap in the out of service url entry point screens.  

A small number of daemons (often 1 per database) will help you avoid death by inches problems of TOO many open Oracle sessions. Especially if the queries are pre-parsed and ready to execute. IF any of the queries can be lengthy or the system is prone to service request storms, then you might need to build FIFO (First In, First Out) queueing of service request messages into the daemon harness.  

The alternative of starting a whole bunch of dedicated listeners to avoid queueing delays is unappealing, at least to me, and you can skip the argument about whether or not you really need the extra listeners, which you won’t until the storm hits your aforementioned radar.  

I don’t know whether this software harness is available off the shelf, and yeah, you need to be able to control DOS (Denial Of Service) attacks if your health checks face the public internet. (Hint: they probably don’t, but if you might have to troubleshoot them, an off-“LAN_CAMPUS” VPN or something might save you going to the office.)    

Good luck. Whether or not your team can reduce the frequency of any particular check is a reasonable question. Running through the full login, security check, session start overhead of an RDBMS session multiple times per second is begging for a storm to hit your radar. Being lucky that you never will hit the radar might be the cheapest solution, but do you feel lucky?  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Clay Jackson (cjackson) Sent: Thursday, August 13, 2020 1:03 AM
To: dougk5_at_cox.net; oracle-l_at_freelists.org Subject: RE: Overhead of load-balanced microservices architecture  

I’m by no means an expert on either F5 or Exadata hardware, and things have changed in the last 10 years.  

That said; what you might run into (and what I DID run into almost 10 years ago with F5s and Oracle in “another life”) is network queuing. At the network and OS level (“below” Oracle), the (Oracle) listener tells the OS to start listening for connections on a specified port. 7/second is not THAT large; but, if one considers what happens when each connect request is received (several network round trips as TCP negotiates the higher level connections, a message “up” to the Oracle process at some point that tells the Oracle listener process to actually set up the database connection), some of which are “single threaded”; you may start to see queuing for some of those connection requests, and when that happens, it can “cascade” very quickly.  

I’ll dig back in my notes and if can find something that specifically relates to what happened, I’ll post it  

Clay Jackson    

From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> On Behalf Of DOUG KUSHNER Sent: Wednesday, August 12, 2020 9:34 PM To: oracle-l_at_freelists.org
Subject: Overhead of load-balanced microservices architecture  

CAUTION: This email originated from outside of the organization. Do not follow guidance, click links, or open attachments unless you recognize the sender and know the content is safe.  

Our dev team recently rolled out an application using an F5 load-balanced microservices architecture. There are several miscroservices, each load balanced on up to 4 servers each, and each with a health-check api that hits the database. While this may have looked good on paper, just the overhead of the health-checks with no work being processed has resulted in roughly 7 connection attempts per second to the database. This results in a version check query about 40K times per hour. The database is on an Exadata (2-node RAC) with several other production databases.

Of course the Exadata has been handling it, so unless you are looking for anomalies (which I always am), this will fly under the radar until it doesn't. :)

I'm wondering if anyone knows how to determine the theoretical max connections/sec that a listener can handle based on the number of cores licensed in the system?

Also wondering if anyone here has encountered this scenario before and how they dealt with it. I'm also looking for a good reference on the subject.

My immediate focus will be on determining why these health check connections do not appear to be utilizing the services' connection pools, while the dev team determines whether they can relax the frequency of these health checks.

Regards,

Doug

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Aug 13 2020 - 17:28:39 CEST

Original text of this message