Re: Query Performance issue

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Thu, 31 Dec 2020 11:49:16 +0000
Message-ID: <CAGtsp8n4vnU6HHUheqO-+_wLJnJbBMnAD5ZJM1E0wBDP6xZ3Bw_at_mail.gmail.com>



It looks as if there's something going wrong - but possibly it's the nature of the algorithm rather than a bug.

Arguments for and against:

For: My guideline on 20% of (parameter - allocated) .... which should also be roughly 20% of "aggregate PGA auto target" .... is based on previous experience and what I would like to call rational inference, so reporteing 5GB usage suggests there's still 10GB to play with and allocating 500MB shouldn't be a problem, and the global memory bound could easily be 2GB (although there seems to be a 1GB limit). This suggests there's a bug in the code that trying to re-evaluate the global memory bound as PGA is allocated and released.

Against: My observations may have been based on fairly "typical" used of workareas in the systems i was looking at, and I've just checked the 11.2 definition of the "global memory bound" and there's a hint there of an algorithm that I hadn't conidered in my "rational inference". The desrcription includes the sentence: " The global memory bound generally decreases when the number of active work areas is increasing in the system." I've been thinking about SIZE of memory, what if Oracle is simply counting NUMBER of allocations - pottentially a very bad algorithm but imagine the following scenario:

You have allocated 5GB out of 15GB, leaving 10GB available. There are currently 1,000 workareas active, but they are all tiny sorts. What if the code recalculate the boundary (which runs every 3 seconds, typically) says something like:
  There are 1,000 workareas active, so I'd better allow for another thousand to be created

   10GB / 1000 = 10MB.
Push that 1,000 up to 10,000 - which might be a side effect of some other Oracle bug, or some other code change in the application that is now leaving cursors open, and you hit the limit of 1MB even though you've got a huge amount of memory available.

A possible test of this threat is to query v$sql_workarea_active to see how many workareas are still active.
Another thought that hash cross my mind relating to this (possible) algorithm is that your report of v$sql_workarea for the two tests of the query that you rand showed FIVE workareas, of which three were for "SEGMENT SCAN", and I was only expecting to see the two for the hash joins (maybe workareas for segment scans is an exadata thing). If you've got lots of small tablescans that might give you a large count of v$sql_workarea_active.

Another little check that worth doing as a check for "background noise" is to start an SQL*Plus session, find it's PID and query v$process_memory for that PID to see how much memory a simple connection to the database takes. In my case it's about 1MB (with about 0.5MB allocated then freed). Possibly it's much larger with your larger setup.

You've also mentioned that the "total PGA returned to OS" shows a sinusoidal pattern over time. It shouldn't do that, it's supposed to be cumulative over time so it should alway increased until you restart the instance. So that's another indication of some sort of accounting error. (Maybe there's some code that gets mixed up between 32-bit signed and unsigned values, or something of that sort, that's introducing a problem.)

It would be interesting to see a couple of reports of a simple "select * from v$pgastat" (with a suitable linesize and format). Run the query twice with 10 seconds between executions, and "set time one" so we can see the timestamp in the SQL prompt for each run.

Bottom line: I think you're going to have to raise an SR to find out how you can get a global memory bound of 1MB when you have 10GB of PGA unallocated, but there's a slight chance that a couple of v$pgastats queries could give us a better clue.

Regards
Jonathan Lewis

On Thu, 31 Dec 2020 at 09:15, Lok P <loknath.73_at_gmail.com> wrote:

>
> Thank You Jonathan.
>
> Seeing multiple things , I am not sure if i am mixing things and thus
> missing the real issue
>
> I found the "total PGA allocated" is currently showing as ~5GB with
> "global memory bound" as ~1MB in v$pgastat. And then i went to the
> dba_hist_pgastat to see the max it touched in past(at least in last ~40days
> awr) , and i see its showing around max ~9GB during some point in time, and
> in that case too as per the calculation, 20% of (pga_aggregate_target i.e.
> 15GB- total pga allocated i.e. 9GB)= >1GB, so that should not make the
> "global memory bound" go down till ~1MB. Why is it happening?
>
> Then I went to see the gv$sesstat using the below query , and seeing its
> summing up to ~14GB.
>
> select sn.name ,sum(st.value)/1024/1024/1024 Total
>
> from gv$sesstat st ,gv$statname sn
>
> where st.STATISTIC# = sn.STATISTIC# and sn.inst_id= st.inst_id and
> sn.inst_id=1
>
> and lower(sn.name) like 'sess%mem%'
>
> group by sn.name
>
> order by sn.name
>
> /
>
> NAME TOTAL(In GB)
> session pga memory 3.96
> session pga memory max 6.53
> session uga memory 1.51
> session uga memory max 1.85
>
>
> Then digging into the top sessions consuming the pga from gv$sesstat, i
> see below top 10 consumers, three of them seem GGS related and other one
> application related but all are inactive only at the moment.
>
> NAME Value/1024/1024 UNITS RNK SID SERIAL# USERNAME EVENT MODULE
> session pga memory 145 MB 1 1464 1 SYS LogMiner reader: redo (idle)
> GoldenGate
> session pga memory 143.8 MB 2 2806 4195 XXXXXX SQL*Net message from client
> DDTek.Oracle.dll
> session pga memory 139.8 MB 3 1378 15 SYS LogMiner reader: redo (idle)
> GoldenGate
> session pga memory 139.7 MB 4 1421 3 SYS LogMiner reader: redo (idle)
> GoldenGate
> session pga memory 120.7 MB 5 99 4041 XXXXXX SQL*Net message from client
> DDTek.Oracle.dll
> session uga memory 138.7 MB 1 2806 4195 XXXXXX SQL*Net message from client
> DDTek.Oracle.dll
> session uga memory 115.5 MB 2 99 4041 XXXXXX SQL*Net message from client
> DDTek.Oracle.dll
> session uga memory 91.8 MB 3 390 25633 YYYYYY SQL*Net message from client
> APP1SERVICE
> session uga memory 91.1 MB 4 305 5667 YYYYYY SQL*Net message from client
> APP1SERVICE
> session uga memory 89.5 MB 5 3060 2773 YYYYYY SQL*Net message from client
> APP1SERVICE
>
>
> Then I tried peeking into the v$process, v$process_memory for the top two
> sessions i.e. 1464 and 2806, they are as below. i do see a significant
> allocation to OTHER category, but not sure if that is really pointing to
> any bug.
>
> SID USERNAME SPID TOTAL_ALLOCATED(In MB) PID SERIAL# CATEGORY
> ROUND(PM.ALLOCATED/1024/1024) ROUND(PM.USED/1024/1024)
> ROUND(PM.MAX_ALLOCATED/1024/1024)
> 1464 SYS 188271 146 106 1 Other 145 215
> 1464 SYS 188271 146 106 1 Freeable 1 0
> 1464 SYS 188271 146 106 1 PL/SQL 0 0 0
> 1464 SYS 188271 146 106 1 SQL 0 0 1
>
> SID USERNAME SPID TOTAL_ALLOCATED(MB) PID SERIAL# CATEGORY
> ROUND(PM.ALLOCATED/1024/1024) ROUND(PM.USED/1024/1024)
> ROUND(PM.MAX_ALLOCATED/1024/1024)
> 2806 XXXXXX 348186 148 209 223 SQL 83 77 87
> 2806 XXXXXX 348186 148 209 223 Other 60 107
> 2806 XXXXXX 348186 148 209 223 Freeable 4 0
> 2806 XXXXXX 348186 148 209 223 PL/SQL 1 0 3
>
> I also verified the trend of resource "processes" in
> dba_hist_resource_limit, and seeing there is increase in MAX_UTIZATION
> value for of processes from ~917 to ~1400, but i see that increase
> trend/timeline is not exactly matching with the "global memory bound"
> trend/timeline which dropped from ~1GB to 1MB.
>
> Is this really pointing to a bug or are we really suffering because of the
> high number of processes, as because i am not able to pinpoint any odd one
> session which is causing this issue.
>
>
> Regards
>
> Lok
>
> On Thu, Dec 31, 2020 at 1:32 AM Jonathan Lewis <jlewisoracle_at_gmail.com>
> wrote:
>
>>
>> The global memory bound is the immediate cause of the multipass hash
>> joins with a huge number of passes.
>> There are two main reasons why it might get very small
>>
>> a) lots of processes demanding lots of memory - for your version of
>> Oracle the global memory bound eventually drops to the 20% of
>> (pga_aggregate_target (parameter) - (v$pgastat) "total PGA allocated") and
>> that value is the memory allowed per workarea.
>>
>> b) a memory leak, possibly with incorrect feedback to some of the
>> PGA-related views. There's a blog note here:
>> https://jonathanlewis.wordpress.com/2009/06/07/pga-leaks/ which is
>> probably not relevant to 11.2.0.4 but the access to the view (v$sesstat,
>> v$process, v$process_memory) should give you some idea of where the PGA has
>> gone and where it might be incorrectly accounted.
>>
>> Regards
>> Jonathan Lewis
>>
>>
>>
>>
>> On Wed, 30 Dec 2020 at 19:28, Lok P <loknath.73_at_gmail.com> wrote:
>>
>>> Talking of the root cause, I am seeing something odd with respect to the
>>> overall system. As highlighted in one of the threads by Jonathan the
>>> difference between the estimated optimal size and memory_used in
>>> gv$sql_workarea seems to be very high. And then he tossed two statistics "
>>> global memory bound, and PGA freed back to OS ", though I am not used to
>>> checking these two in the past. But when i tried to see the trend " PGA
>>> freed back to OS" seems to be varying in a sinusoidal wave fashion
>>> throughout the day and i am seeing some change but not seems to be
>>> significant change in that pattern, however the " global memory bound"
>>> statistics in dba_hist_pga_stats has been dropped suddenly from ~1GB to
>>> ~1MB around 24th November(the same time when this query started to crawl)
>>> and few times i see its fluctuating between ~10 to 50MBs but again i.e. too
>>> small as compared consistent ~1GB in past. Same thing happened for Node-2,
>>> but it dropped around 4th Dec on node-2. So trying to understand what could
>>> be the cause of this fall in statistics and this may be the reason why we
>>> are seeing exponential growth in the number of multipasses figures in
>>> dba_hist_sql_workarea_hstgrm and for this sql slowness too.
>>>
>>>
>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Dec 31 2020 - 12:49:16 CET

Original text of this message