Re: OOM killer terminating database on AWS EC2

From: Frits Hoogland <frits.hoogland_at_gmail.com>
Date: Thu, 16 Jan 2020 14:59:49 +0100
Message-Id: <5D2C69C8-346D-42CA-8A68-2DCAFD353D43_at_gmail.com>



All in all, if you run redhat you have a linux instance, and thus you can apply all linux practices that you know of.

The first thing to do is save the SAR reports from around the time of the issue in the following way: LC_ALL=C sar -A -f /var/log/sa/saXX > /tmp/sarXX.data.txt (where XX is your date)
This will put everything that sar collects for you in a single file. Now you can use ksar (
https://github.com/vlsi/ksar <https://github.com/vlsi/ksar>) to visualise your sar data, which makes it way easier to assess these statistics.

  1. Postfix Nobody can tell you that. What you should do, is monitor memory usage closely. I’ve never heard of postfix taking huge amounts of memory (which doesn’t mean it doesn’t do that, I am saying here that in my opinion that is not likely), but maybe you are talking about it ‘leaking’ memory. Memory leaks are typically visible by a steady increase of memory (the leak), or an increase based on the usage interval.

Please mind that at the time when I answered oracle’s linux forum, every few months there would be a sysadmin or an oracle dba come up saying that they detected that several oracle processes leaked memory, most noteworthy the databasewriter. Whilst not unthinkable (oracle is really complex, as we all know), this is probably in most cases not true, but comes because of how Linux (actually all modern OSes I think) work: when a process is created, it is made to load an executable, libraries, allocate memory, etc. you might think it does all that, but in reality it doesn’t. It will execute, it does what you tell it to do, but AT THE BARE MINIMUM. This (roughly) means it will load the first (4k) page of the executable you asked it to execute, and it goes from there. If it has dynamically loaded libraries, it will not actually load them, unless it absolutely has to do so. If the code says to allocate a (let’s say) gigabyte of memory, it will say yes, but it will not actually do that, it will do that along the way as you truly start using it. In other words: everything is done lazy. Back to the databasewriter: because the databasewriter is made to read all the blocks from the cache as they are dirtied by other processes, it looks like a process that is slowly increasing in size from the operating system side, becuase it will gradually touch all the blocks in the buffer cache, increasing its RSS (resident set size). Back even further to postfix: this process will not do that in the same way, but there’ll be a ramp up period for memory usage.

2. Monitoring memory usage
The simplest way to do is to look at the sar output, that is also something that should be available if you have installed the sysstat package. I do believe it’s an oracle requirement. Even if it isn’t you should. A reasonable good overview of memory usage and breakdown is in /proc/meminfo. You could use something like grafana (+prometheus) to graph that. I build a dashboard in grafana that does that (https://grafana.com/grafana/dashboards/2747 <https://grafana.com/grafana/dashboards/2747>).

3. Disable OOM killer
The OOM killer is not the problem. The memory usage is. What the OOM killer does, is kill a process with the idea to make room for the thing you are doing now. If total memory (physical memory+swap) is exhausted, and more memory is necessary, the OOM killer will choose a process that it thinks takes a lot of memory and kills it with the hope to free memory for whatever you are doing. Guess what, if you read my above story about leaks, the databasewriter touches almost the entire (oracle) buffercache and thus has a high RSS, and thus might be considered a process with a lot of memory.

The true solution is to make sure you don’t use more memory that is available. With hugepages that is a bit more hairy, hugepages are a distinct area of memory that is pre-allocated and not usable for normal page-size memory. So your SGA has to fit in hugepages, and then can’t be swapped. any excess allocation is taken from regular page-size memory (can be seen in X$KSMSSINFO, also remark in alert.log).

Anything that not explicitly uses hugepages is allocated in the (remaining) smallpages memory. That is probably anything except for the Oracle SGA. This includes the oracle PGA. But first, with your memory settings, you should look if your system is not actively swapping when the database not is running. (vmstat si/so columns). And then when the database is running again; there should be no si/so. In fact, when a system starts swapping, it normally is “the beginning of the end”. Performance is gone.

The PGA memory is “special” in oracle. Part of it is controllable (sort,hash,bitmap memory) and part is not controllable, any developer can allocate as much memory implicitly with collections or index-by arrays as the developer likes. Please mind pga_aggregate_Target is exactly that: a target. If oracle requires less, it will take less, if oracle requires more, it will take more. Sadly, I still see that as a means of ‘sizing’ oracle, which, as I just told, is nonsense. With modern oracle versions, there is pga_aggregate_limit. I haven’t looked at it with recent versions, but what I’ve seen is it acts very slow in on PGA oversubscription, meaning you could easily excessively oversubscribe, and then would act on any sessions actively allocating memory, and throw these an ORA 3036. So a process could allocate huge amounts of PGA, and then sit idle, and then a process that would only take a tiny bit could get the ORA 3036.

So in short, make sure that even with PGA allocated and in use, your system doesn’t show si/so, or otherwise it will get a severe performance hit. If the memory usage escalates further, the OOM killer will try to save you.

Frits Hoogland

http://fritshoogland.wordpress.com <http://fritshoogland.wordpress.com/> frits.hoogland_at_gmail.com <mailto:frits.hoogland_at_gmail.com> Mobile: +31 6 14180860

> On 13 Jan 2020, at 20:31, Sandra Becker <sbecker6925_at_gmail.com> wrote:
>
> Server: AWS EC2
> RHEL: 7.6
> Oracle: 12.1.0.2
>
> We have a database on an AWS EC2 server that the OOM killer has terminated twice in the last 5 days, both times it was the ora_dbw0_dwprod process. On 1/8 postfix was enabled to allow us to email the DBA team through an AWS relay server when a backup failed. We stopped running daily backups and cronjobs that did a quick check for expired accounts. We've left postfix enabled for sending emails. We are searching for answers but have none yet as to why this is happening. We also no longer have Oracle support available to us. (management saving money again).
>
> Questions:
> Could postfix be related to the memory issues even though we haven't sent any emails since the first crash 5 days ago?
> How can we monitor the memory usage of an EC2 instance?
> How do you disable the OOM killer in EC2 should we decide to go that route? (we have it disabled on our on-prem servers) The docs I've found so far have not been helpful.
> I appreciate any help you can give us or pointing us in the right direction.
>
> Thank you,
> --
> Sandy B.
>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jan 16 2020 - 14:59:49 CET

Original text of this message