Re: To Swap, or not to Swap

From: kyle Hailey <kylelf_at_gmail.com>
Date: Mon, 3 Apr 2023 09:35:10 -0700
Message-ID: <CADsdiQjEOhC0yGtGY4Kw5XJpqVBOe4JEP98sfoYYawTE0ar1QA_at_mail.gmail.com>



was just thinking, there so many folks here who know unix internals way better than me ... what metric do you use to track memory pressure? Let me frame the question with some similar context ... CPU %Utilization is a bit of a crappy metric ... If we are at 100% I don't know if that is meeting the demand exactly or if there is a huge backlog of executable code wanting to run. Run queue is a bit crappy because it has I/O waiters and other issues mixed ... including internal OS locks, waits on memory allocation etc. AAS on the other hand is pretty cool. It tells me if 8 process/queries want to run on 8 vCPUs which might be fine, or if 80 process want to run on those 8 vCPUs which is some serious CPU concurrency. What can one use to monitor memory demand? Similiarly to CPU %utilization being bad, I'd say MemAvailable is similarly bad and even weaker. Page out is a canary in the coal mine. Page in is just bad. But that requires swap. Working on systems without swap, I've have seen scan rates mentioned over the years and I never used it because I used page in/out. Then at RDS working on test suite that had major memory issues I used perf and noticed before the system when down, that the perf showed the top function calls to be scans of memory lists.

Perf Top:

 21.90%  [kernel]                 [k] shrink_inactive_list
  3.28%  libjvm.so                [.] BacktraceBuilder::push
  3.26%  [kernel]                 [k] shrink_page_list
  3.25%  [kernel]                 [k] __lock_text_start
  2.00%  libjvm.so                [.] CodeHeap::find_blob_unsafe
  1.46%  [kernel]                 [k]
__raw_callee_save___pv_queued_spin_unlock
  1.39%  [kernel]                 [k] finish_task_switch



On Mon, Apr 3, 2023 at 12:56 AM Timur Akhmadeev <timur.akhmadeev_at_gmail.com> wrote:

> Just an example for zero swap from Netflix:
> https://www.brendangregg.com/Slides/AWSreInvent2017_performance_tuning_EC2.pdf
>
> Usage: - Swappiness is set to zero to disable swapping and favor ditching
>> the file system page cache first to free memory. (This tunable doesn’t make
>> much difference, as swap devices are usually absent.)
>
>
> On Mon, Apr 3, 2023 at 2:13 AM Jared Still <jkstill_at_gmail.com> wrote:
>
>> So, I would like to devise some testing for this, with and without swap.
>>
>> Suggestions for metrics to track?
>>
>> There are certain things I would like to track, mostly from an app
>> perspective.
>>
>> But also I would like to see how responsive the system is under severe
>> memory pressure, both with and with without swap.
>>
>>
>>
>> On Sat, Apr 1, 2023 at 06:29 Frits Hoogland <frits.hoogland_at_gmail.com>
>> wrote:
>>
>>> I too keep on coming across systems with no swap. Our YugabyteDB systems
>>> are setup with no swap.
>>> And my first reaction was identical to most others: what?! No swap?!
>>>
>>> Since then, my position to swap or no swap is much more seeing the
>>> benefit, and not being fiercely against it.
>>> Of course the only right answer to swap or not is: it depends.
>>>
>>> The way I see it, is that swap is like a soft pillow. If you are running
>>> into memory shortage, swap will soften the landing, and make the system
>>> increasingly slower, then coming to a standstill and then still kill the
>>> system. And therefore the question to ask is: do we want to get into a
>>> situation of unpredictable slowness before the OOM kill? The latter is how
>>> I look at it now: there have been countless hours spent on trying to make
>>> sense of swap, trying to understand and tune swap and swapping, whilst
>>> there always is, and must be an actual problem that caused swap. So
>>> removing it removes that discussion and gets you more straight into facing
>>> the problem.
>>>
>>> There are two thing that I see additionally:
>>> - you might argue that it’s swap will only mildly be used (…in your
>>> case). I would argue that if you cannot control the server to only take the
>>> actual memory, and it swaps, how the hell can you control it to just mildly
>>> swap?
>>> - many servers perform some swapping whilst memory pressure is never
>>> seen. One common reason for that is that buffered IO is treated with equal
>>> priority as memory allocations. That means that if you start performing
>>> lots of IOs using buffered calls, the OS might, and will, start paging out
>>> existing memory allocations that have not recently been touched, such as
>>> bootstrap code for an application because the buffered IO gotten higher
>>> priority. One extremely common case that such a case happens is with most
>>> common backups. (I hope this will be an “aha” moment for lots of people
>>> asking why their database server starts to allocate some swap, whilst it
>>> never did get over allocated)
>>>
>>>
>>> *Frits Hoogland*
>>>
>>>
>>>
>>>
>>> On 1 Apr 2023, at 15:10, Mark W. Farnham <mwf_at_rsiz.com> wrote:
>>>
>>> “Are there no longer any scenarios where the swapfile allows the system
>>> to recover, without failing or hanging?”
>>>
>>> First, good ask Jared, excellent analysis Tim, from my viewpoint.
>>>
>>> I would slightly alter Tim’s question:
>>>
>>> “For the goal of the server in question, are there any scenarios where a
>>> swapfile allows the system to recover without failing or hanging?”
>>>
>>> For a server with a primary goal of providing the support of one or more
>>> instances of Oracle which are allocated within the bounds of the server, I
>>> can imagine some “clients” of the database services being allowed to run
>>> directly on the database server to eliminate all the latencies that occur
>>> between servers.
>>> With a very fast swapfile AND a decently implemented sniping monitor, I
>>> further imagine the database services continue to deliver within the
>>> planned service quality while the rogue client is paused or killed with
>>> data and logs for analysis. (Hint, if the rogue client is holding a system
>>> lock or an application lock that needs to be shared, I’m thinking pausing
>>> is not an option.)
>>>
>>> There are certainly other scenarios where fail as fast as possible and
>>> recover is better for the goal than even trying to recover.
>>>
>>> So I think Clay is also right that “it depends,” begging the question of
>>> what is the best solution for someone supporting a fleet of generically
>>> configured servers. Frankly, it would never have occurred to me to *
>>> *NOT** have swap on the popular OS copied from UNIX, mostly because I
>>> don’t know whether lack of a certain amount of swap still tosses warnings
>>> that freak out customers (or actually fail the install) when installing my
>>> favorite RDBMS.
>>>
>>> Seymour Cray, of course, used to say things about only implementing
>>> virtual memory if you want things to be slower than they need to be.
>>>
>>> For database services there is probably room for an OS built for a
>>> direct addressing cpu/memory complex. In that case programs would only
>>> start if real space declared to be needed is available and addresses are
>>> resolved to real addresses at program load time. I’m not even sure whether
>>> modern chip technology could be faster with direct addressing than with
>>> virtual addressing.
>>>
>>> I suppose quantum computing has problems if you try to use virtual
>>> memory and/or swapping….
>>>
>>> mwf
>>>
>>>
>>>
>>> *From:* oracle-l-bounce_at_freelists.org [mailto:
>>> oracle-l-bounce_at_freelists.org] *On Behalf Of *Tim Gorman
>>> *Sent:* Thursday, March 30, 2023 8:25 PM
>>> *To:* jkstill_at_gmail.com; Oracle-L Freelists
>>> *Subject:* Re: To Swap, or not to Swap
>>>
>>>
>>> Jared,
>>>
>>> You've made a good point with your testing. In essence, *fail fast*.
>>> If it is just *fail fast* versus *fail slow*, then of course we all
>>> choose to *fail fast* and then recover.
>>>
>>> The only question that comes to my mind is whether the presence of a
>>> swapfile always means slow failure.
>>>
>>> Are there no longer any scenarios where the swapfile allows the system
>>> to recover, without failing or hanging?
>>>
>>> For example, in Azure, VMs can use remote storage (a.k.a. OsDisk) for
>>> the swapfile, or VMs can locate the swapfile on optional direct-attached
>>> SSD storage that is considered "temporary" or ephemeral, because when the
>>> VM is stopped and deallocated, the direct-attached storage has to be
>>> erased, because another VM may be allocated to it in future. It is not
>>> quality of storage that makes it "ephemeral", just the use-case. Anyway,
>>> the OsDisk has I/O latency averaging 0.70 ms for both reads and writes, but
>>> the so-called "ephemeral" disk provides less than 0.05 ms I/O latency,
>>> which is about 14x faster.
>>>
>>> Clearly the performance of the storage on which the swapfile resides is
>>> going to make a difference in its usefulness. If your testing involved
>>> slow storage, then I can see where the machine would take 7-8 mins to
>>> fail. I'm not trying to denigrate the resources you used, but I'm trying
>>> to ask if the swapfile is on fast storage, then perhaps could it be more
>>> helpful, even in extreme situations?
>>>
>>> In other words, shouldn't we ensure that a swapfile is fast, as well as
>>> big enough? Wouldn't more performant storage allow the swapfile to recover
>>> the situation?
>>>
>>> Thanks so much for the thought exercise!
>>>
>>> -Tim
>>>
>>> On 3/30/2023 10:46 AM, Jared Still wrote:
>>>
>>> I was recently asked by a colleague this same question.
>>>
>>> He had been asked by a client, with a fairly well regarded sysadmin team.
>>>
>>> They wanted to eliminate swap: here's why.
>>>
>>> If a process is consuming memory at a prodigious rate, then the OOM (out
>>> of memory) killer is going to catch up to it and kill it eventually.
>>>
>>> Their position was that with a swap partition, this process was
>>> prolonged far too long.
>>>
>>> Without swap, the process gets killed relatively quickly.
>>>
>>> With swap, it can take many minutes. The CPU spends so much time
>>> managing memory on swap (remember, we are at an OOM condition), which is
>>> slow, that the time to kill the process is prolonged to many minutes.
>>>
>>> At first my position was "what, no swap! we can't do that!"
>>>
>>> But, I decided to test it a bit.
>>>
>>> A small physical server, 4 cores and 32G of RAM, is running Oracle 19.3.
>>>
>>> A swingbench test is running, 10 sessions per core.
>>>
>>> When I cause an OOM condition with the 16G swap partition enabled, it
>>> took the system between 7.5-8 minutes to kill the process.
>>>
>>> (For the client, the amount of time was 20+ minutes.)
>>>
>>> And during that time, it was impossible to logon to the server. The CPU
>>> was too busy thrashing around in the swap partition.
>>>
>>> The next step of course is to disable the swap.
>>>
>>> Same OOM condition caused. Time to resolution is now 7 seconds.
>>>
>>> There is no swap to manage as if it were RAM.
>>>
>>> That is quite a bit difference.
>>>
>>> Of course I wondered 'what about paging in memory for new processes?',
>>> as that often uses a page in swap.
>>>
>>> Without swap, it just takes place in memory.
>>>
>>> Swap is also a landing place for some pages used to initialize
>>> processes, as they can only be used once.
>>>
>>> This is a minimal amount, and can just be left in memory.
>>>
>>> If one really wants to conserve, there is a thing called ZRAM
>>> (compressed memory) where those pages can be parked, instead of swap.
>>>
>>> So, does anyone see any other need for a swap partition?
>>>
>>> It seems to have outlived its usefulness.
>>>
>>> Jared Still
>>> Certifiable Oracle DBA and Part Time Perl Evangelist
>>> Principal Consultant at Pythian
>>> Oracle ACE Alumni
>>> Pythian Blog http://www.pythian.com/blog/author/still/
>>> Github: https://github.com/jkstill
>>>
>>> Personality: http://www.personalitypage.com/INTJ.html
>>>
>>>
>>>
>>> On Thu, Mar 30, 2023 at 9:24 AM Jared Still <jkstill_at_gmail.com> wrote:
>>>
>>> That is the question.
>>>
>>> I am curious about current thoughts on having or not having a swap
>>> partition on Linux based Oracle servers.
>>>
>>> Let's assume typical production standard servers with a reasonable
>>> amount of RAM, sway 256G or more.
>>>
>>> I have some thoughts on this myself, but would like to see others'
>>> thoughts on this.
>>>
>>>
>>> Jared Still
>>> Certifiable Oracle DBA and Part Time Perl Evangelist
>>> Principal Consultant at Pythian
>>> Oracle ACE Alumni
>>> Pythian Blog http://www.pythian.com/blog/author/still/
>>> Github: https://github.com/jkstill
>>>
>>> Personality: http://www.personalitypage.com/INTJ.html
>>>
>>>
>>> --
>> Jared Still
>> Certifiable Oracle DBA and Part Time Perl Evangelist
>> Principal Consultant at Pythian
>> Oracle ACE Alumni
>> Pythian Blog http://www.pythian.com/blog/author/still/
>> Github: https://github.com/jkstill
>> Personality: http://www.personalitypage.com/INTJ.html
>>
>>
>>
>
> --
> Regards
> Timur Akhmadeev
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Apr 03 2023 - 18:35:10 CEST

Original text of this message