Re: OOM killer terminating database on AWS EC2

From: Sandra Becker <sbecker6925_at_gmail.com>
Date: Fri, 17 Jan 2020 16:08:38 -0700
Message-ID: <CAJzM94DB-aBPoS57BQqXv7DfPzpUdHp1xFjKrL13WR-8frsbFQ_at_mail.gmail.com>



Thanks to everyone for their suggestions. We have adjusted sga & pga, increased swap, and *correctly* configured HugePages. I also was got the DBA team access to CloudWatch for this instance. We were able to bounce the server late yesterday afternoon. I've been closely monitoring both the server and the database. Everything definitely looks better today, but it's been only one day. I will be reviewing all the suggestions again to see what else we can do to make/keep the system stable. A couple of times during heavy load today, we did see swap usage that exceeded the 5G the old swap had defined. I did get paged at 4:57am today regarding "connection issues, must be related to work done yesterday". I doubted that, but it's my job to check it out. Both the server and database were humming along nicely. I also saw a recent connection, just not from the client server they were having problems with. Turns out their ssh tunnel was down. It was a good feeling to know that it wasn't a database issue.

Regarding AWS training. I asked for training for the team 3 months ago. I was told we would get it when we needed it. Yeah, right. I'm looking into what I can do on my own. Signed up for a couple of Udemy courses and I'm looking at books, googling, etc. to build my knowledge and skills. Book recommendations would be welcome.

Again, thank you very, very much for your helpful suggestions.

Sandy

On Thu, Jan 16, 2020 at 6:59 AM Frits Hoogland <frits.hoogland_at_gmail.com> wrote:

> All in all, if you run redhat you have a linux instance, and thus you can
> apply all linux practices that you know of.
>
> The first thing to do is save the SAR reports from around the time of the
> issue in the following way:
> LC_ALL=C sar -A -f /var/log/sa/saXX > /tmp/sarXX.data.txt
> (where XX is your date)
> This will put everything that sar collects for you in a single file.
> Now you can use ksar (https://github.com/vlsi/ksar) to visualise your sar
> data, which makes it way easier to assess these statistics.
>
> 1. Postfix
> Nobody can tell you that. What you should do, is monitor memory usage
> closely.
> I’ve never heard of postfix taking huge amounts of memory (which doesn’t
> mean it doesn’t do that, I am saying here that in my opinion that is not
> likely), but maybe you are talking about it ‘leaking’ memory. Memory leaks
> are typically visible by a steady increase of memory (the leak), or an
> increase based on the usage interval.
>
> Please mind that at the time when I answered oracle’s linux forum, every
> few months there would be a sysadmin or an oracle dba come up saying that
> they detected that several oracle processes leaked memory, most noteworthy
> the databasewriter. Whilst not unthinkable (oracle is really complex, as we
> all know), this is probably in most cases not true, but comes because of
> how Linux (actually all modern OSes I think) work: when a process is
> created, it is made to load an executable, libraries, allocate memory, etc.
> you might think it does all that, but in reality it doesn’t. It will
> execute, it does what you tell it to do, but AT THE BARE MINIMUM. This
> (roughly) means it will load the first (4k) page of the executable you
> asked it to execute, and it goes from there. If it has dynamically loaded
> libraries, it will not actually load them, unless it absolutely has to do
> so. If the code says to allocate a (let’s say) gigabyte of memory, it will
> say yes, but it will not actually do that, it will do that along the way as
> you truly start using it. In other words: everything is done lazy.
> Back to the databasewriter: because the databasewriter is made to read all
> the blocks from the cache as they are dirtied by other processes, it looks
> like a process that is slowly increasing in size from the operating system
> side, becuase it will gradually touch all the blocks in the buffer cache,
> increasing its RSS (resident set size).
> Back even further to postfix: this process will not do that in the same
> way, but there’ll be a ramp up period for memory usage.
>
> 2. Monitoring memory usage
> The simplest way to do is to look at the sar output, that is also
> something that should be available if you have installed the sysstat
> package. I do believe it’s an oracle requirement. Even if it isn’t you
> should.
> A reasonable good overview of memory usage and breakdown is in
> /proc/meminfo. You could use something like grafana (+prometheus) to graph
> that. I build a dashboard in grafana that does that (
> https://grafana.com/grafana/dashboards/2747).
>
> 3. Disable OOM killer
> The OOM killer is not the problem. The memory usage is. What the OOM
> killer does, is kill a process with the idea to make room for the thing you
> are doing now.
> If total memory (physical memory+swap) is exhausted, and more memory is
> necessary, the OOM killer will choose a process that it thinks takes a lot
> of memory and kills it with the hope to free memory for whatever you are
> doing. Guess what, if you read my above story about leaks, the
> databasewriter touches almost the entire (oracle) buffercache and thus has
> a high RSS, and thus might be considered a process with a lot of memory.
>
> The true solution is to make sure you don’t use more memory that is
> available. With hugepages that is a bit more hairy, hugepages are a
> distinct area of memory that is pre-allocated and not usable for normal
> page-size memory. So your SGA has to fit in hugepages, and then can’t be
> swapped. any excess allocation is taken from regular page-size memory (can
> be seen in X$KSMSSINFO, also remark in alert.log).
>
> Anything that not explicitly uses hugepages is allocated in the
> (remaining) smallpages memory. That is probably anything except for the
> Oracle SGA. This includes the oracle PGA.
> But first, with your memory settings, you should look if your system is
> not actively swapping when the database not is running. (vmstat si/so
> columns).
> And then when the database is running again; there should be no si/so.
> In fact, when a system starts swapping, it normally is “the beginning of
> the end”. Performance is gone.
>
> The PGA memory is “special” in oracle. Part of it is controllable
> (sort,hash,bitmap memory) and part is not controllable, any developer can
> allocate as much memory implicitly with collections or index-by arrays as
> the developer likes.
> Please mind pga_aggregate_Target is exactly that: a target. If oracle
> requires less, it will take less, if oracle requires more, it will take
> more. Sadly, I still see that as a means of ‘sizing’ oracle, which, as I
> just told, is nonsense.
> With modern oracle versions, there is pga_aggregate_limit. I haven’t
> looked at it with recent versions, but what I’ve seen is it acts very slow
> in on PGA oversubscription, meaning you could easily excessively
> oversubscribe, and then would act on any sessions actively allocating
> memory, and throw these an ORA 3036. So a process could allocate huge
> amounts of PGA, and then sit idle, and then a process that would only take
> a tiny bit could get the ORA 3036.
>
> So in short, make sure that even with PGA allocated and in use, your
> system doesn’t show si/so, or otherwise it will get a severe performance
> hit. If the memory usage escalates further, the OOM killer will try to save
> you.
>
> Frits Hoogland
>
> http://fritshoogland.wordpress.com
> frits.hoogland_at_gmail.com
> Mobile: +31 6 14180860
>
> On 13 Jan 2020, at 20:31, Sandra Becker <sbecker6925_at_gmail.com> wrote:
>
> Server: AWS EC2
> RHEL: 7.6
> Oracle: 12.1.0.2
>
> We have a database on an AWS EC2 server that the OOM killer has terminated
> twice in the last 5 days, both times it was the ora_dbw0_dwprod process.
> On 1/8 postfix was enabled to allow us to email the DBA team through an AWS
> relay server when a backup failed. We stopped running daily backups and
> cronjobs that did a quick check for expired accounts. We've left postfix
> enabled for sending emails. We are searching for answers but have none yet
> as to why this is happening. We also no longer have Oracle support
> available to us. (management saving money again).
>
> Questions:
>
> 1. Could postfix be related to the memory issues even though we
> haven't sent any emails since the first crash 5 days ago?
> 2. How can we monitor the memory usage of an EC2 instance?
> 3. How do you disable the OOM killer in EC2 should we decide to go
> that route? (we have it disabled on our on-prem servers) The docs I've
> found so far have not been helpful.
>
> I appreciate any help you can give us or pointing us in the right
> direction.
>
> Thank you,
> --
> Sandy B.
>
>
>

-- 
Sandy B.

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Jan 18 2020 - 00:08:38 CET

Original text of this message