Re: Oracle RAC - Any formula to calculate/preempt the right number of LMS processes?

From: PD Malik <pdthedba_at_gmail.com>
Date: Wed, 22 Jun 2011 22:15:08 +0100
Message-ID: <BANLkTi=JeUaUQNmEEPwveoqc+hCsq4R_TQ_at_mail.gmail.com>



And you say ... you are not an expert ;-)

Thanks Riyaj - You've given me the best possible leads that I could ever get ! Really appreciated! I'll get on with the job now ...

Many Thanks again!

On Wed, Jun 22, 2011 at 8:09 PM, Riyaj Shamsudeen < riyaj.shamsudeen_at_gmail.com> wrote:

> Hello Malik
>
> While I agree with you that you should consider reducing
> gcs_server_processes to 5 as a general advice, you should also review why
> LMS process is consuming CPU (if they are indeed the culprit). Since LMS
> process are running from the start of an instance, topoz or other utility
> might simply show them as top consumer of cpu time.
>
> So, first of all, I would say, do the basic performance analysis. Review
> AWR reports to find SQLs and processes consuming CPU time. Tune those
> statements to reduce CPU usage. This is probably where you will gain much.
> Consider this scenario too: Let's just say that execution plan changes
> increased GC workload leading to busier LMS processes. Now, if you reduce
> the LMS processes, you will not gain anything. One could argue that this can
> even affect performance.
>
> Then, check out the interconnect traffic. Graph the interconnect
> traffic using AWR data to see if the traffic has gone up. If LMS is suddenly
> consuming more time, then, it is possible that interconnect traffic has gone
> up. In that case, you need to identify the segments causing increase in
> interconnect traffic. gv$segment_statistics or AWR tables can be used to
> identify the segments generating more blocks to be transferred. Identify why
> the blocks are transferred aggressively now, may be that, execution plan
> changes caused the increase in interconnect traffic.
>
> Another reason is that working set of buffers for your application has
> gone up inducing more Global cache transfers. Is your buffer cache properly
> sized? Is Automatic Shared Memory Management is causing problems, refer
> v$sga_dynamic_components.
>
> If AWR report doesn't show any of these items, review OS level
> statistics of LMS process. Take many pstack and errorstate of LMS process to
> review why LMS is consuming CPU.
>
> About dynamic remastering, version 10.1 used file based remastering and
> from version 10.2+ onwards it is object based.
>
> Let me summarize this long email:
> 1. Check if the execution plan changes are causing increased cpu time.
> 2. Check if the interconnect traffic has gone up. Identify the segments
> causing increased traffic.
> 3. Check the SQLs associated with the segments to see if you can see
> obvious issues.
> 4. Check if the buffer cache is good enough. Review ASMM views.
> 5. Review OS statistics on that LMS process.
>
>
> Cheers
>
> Riyaj Shamsudeen
> Principal DBA,
> Ora!nternals - http://www.orainternals.com - Specialists in Performance,
> Recovery and EBS11i
>
> Blog: http://orainternals.wordpress.com
> OakTable member http://www.oaktable.com
>
> Co-author of the books: Expert Oracle Practices<http://tinyurl.com/book-expert-oracle-practices/>,
> Pro Oracle SQL, Expert PL/SQL Practices<http://tinyurl.com/book-expert-plsql-practices>
>
>
>
> On Wed, Jun 22, 2011 at 6:52 AM, PD Malik <pdthedba_at_gmail.com> wrote:
>
>> Riyaj & Gopal - Thanks for responding to my query.
>>
>> Riyaj Bhai - You are an expert - no denying there - for yourself you may
>> have the entire ocean to cover for people like me who are still learning to
>> swim - you are an expert :-)
>>
>> Just to answer your & Gopal's Que on RT to start with : We are on 10gR2 on
>> AIX 6 and LMS is in _high_priority_processes and _os_sched_high_priority
>> is also set to default 1 so I'd assume that it is set on high priority.
>>
>> Gopal, Is this somehow can be done in AIX as well pls do you know?
>> "One of the most common practices is to schedule the number of RT
>> processes < # of CPUs to avoid spin/hang on CPUs."
>> and also how to check whether these settings are working fine or not?
>>
>> A bit more explanation of our CPU behviour (very broadly) - We've seen
>> that in past some months our CPU usage has gone very high compared to the
>> incremental user load on the system and our capacity management departement
>> has highlighted LMS being one of the top consumers (amongst other
>> candidates). Its not that it always used to behave earlier no apparanely its
>> always been bad and has come to our attention now because of this CPU issue.
>>
>> While also looking at other things it seems that this LMS setting is
>> anyway too high and looks like Oracle is 'correcting' it in 11g so this is
>> something we shuold look at doing anyway.
>>
>> I had a bit of google digging on this CR Fabrication lead that you
>> provided (Thanks to Gopal and his answer he posted in one of the forums)
>> and this is probably you meant :
>>
>> SQL> select inst_id, cr_requests, light_works, data_requests,
>> fairness_down_converts, (fairness_down_converts/data_requests)*100
>> INST_ID CR_REQUESTS LIGHT_WORKS DATA_REQUESTS
>> FAIRNESS_DOWN_CONVERTS RATIO
>> ---------- ----------- ----------- ------------- ----------------------
>> ----------
>> 2 58831118 74411475 58831114 9720672
>> 16.523012
>> 1 350485652 884185071 350485608 59529981
>> 16.9850001
>> 6 14175101 20025830 14175101 2019579
>> 14.2473694
>> 4 13457882 19142318 13457880 1960022
>> 14.5641215
>> 3 440964324 909842437 440964274 79782554
>> 18.0927478
>>
>> And the ratio looks OK to me (less than 40% - even though aggregated but
>> still indicative I think) and it seems there isn't much of CR Fabrication
>> going on BUT I may be looking totally at wrong thing and if thats the case
>> then I'd request if you can point me to any writeups/whitepaperts etc. that
>> I can go thru to understand it a bit better - that'll be a great help.
>>
>> I know, to know RAC I'll need to read Gopal's book which is already in my
>> To Do list but unfortunately thats not an overnight thing.
>>
>> So our current approach is take these new settings (5 most probably) thru
>> a Performance Test and see how these settings are bahiving by looking and gc
>> waits and stats. I know its not the best approch as its 'trying to fix'
>> without getting to the root cause (due to lack of total mastery in this
>> area) but hopefully a performance test will give us a good confidence. Any
>> comments about our approach pls?
>>
>> The only other thing that I've been thinking is that in 11g as you know
>> remastering is at file level instead on object level so is it that this file
>> block remastering is more 'efficient' than object remastering hence Oracle
>> has massively downsized the formula to calculate the default number of LMS?
>> This along with other optimizations in building/pinning/sending a block etc.
>> in 11g might have reduced the pressure on LMS hence the lower number may be
>> OK in 11g but it may not be a good idea in 10g yet? Just thinking ...
>>
>> Sorry this email has gone a bit long ..
>>
>> Thanks.
>>
>>
>> On Tue, Jun 21, 2011 at 5:14 PM, Riyaj Shamsudeen <
>> riyaj.shamsudeen_at_gmail.com> wrote:
>>
>>> Hello Malik
>>>
>>> I am no RAC expert, seems like a ocean more to learn.
>>>
>>> I would agree with KG's comments with one enhancement. I would go with
>>> an odd number, 3 or 5, prior to 11g version. There is an internal bug (10.2)
>>> where the distribution of workload is non-uniform if # of LMS processes is
>>> an even number. ( I know, it is weird, but I don't have any more details,
>>> can't reproduce it, a RAC developer told me this few years ago, and seems
>>> like it is worth a risk to take making it a prime number. This comment does
>>> not apply to 11g and above. )
>>>
>>> As you have figured out, 11.2 by default, sets this parameter to more
>>> meaningful value. If you are looking for an algorithm, why not use 11.2
>>> default algorithm, then? In this case, for 80 CPUs, formula will be
>>> 2+(80/32)= 4.5, round that up to 5.So, set gcs_server_processes to 5.
>>>
>>>
>>> http://download.oracle.com/docs/cd/E14072_01/server.112/e10820/initparams088.htm
>>>
>>> Remember that just because LMS is running in RT priority (Solaris
>>> lingo) doesn't mean that it consumes CPU. RT simply bumps the LMS process to
>>> the front of the CPU run queue for immediate scheduling, if LMS needs to use
>>> CPU. You should also review to understand why LMS consuming CPU? CR
>>> fabrication or Current mode buffer creation (most common reasons) or is it
>>> due to some other reason?
>>>
>>> Cheers
>>>
>>> Riyaj Shamsudeen
>>> Principal DBA,
>>> Ora!nternals - http://www.orainternals.com - Specialists in
>>> Performance, RAC and EBS11i
>>> Blog: http://orainternals.wordpress.com
>>> OakTable member http://www.oaktable.com
>>>
>>> Co-author of the books: Expert Oracle Practices<http://tinyurl.com/book-expert-oracle-practices/>,
>>> Pro Oracle SQL, Expert PL/SQL Practices<http://tinyurl.com/book-expert-plsql-practices>
>>>
>>>
>>>
>>> On Tue, Jun 21, 2011 at 10:02 AM, K Gopalakrishnan <kaygopal_at_gmail.com>wrote:
>>>
>>>> Malik,
>>>>
>>>> Are you running them in real time priority? If yes, you can reduce it to
>>>> 4 or (# of instances-1) and monitor the latencies of the RAC wait
>>>> events/CPU usage of the LMS processes. AFAIK there is no magic formula for
>>>> calculating the # of LMSs.
>>>>
>>>> One of the most common practices is to schedule the number of RT
>>>> processes < # of CPUs to avoid spin/hang on CPUs. If your current settings
>>>> works fine for your workload, I would stick to it.
>>>>
>>>> -Gopal
>>>>
>>>>
>>>> On Tue, Jun 21, 2011 at 6:02 AM, PD Malik <pdthedba_at_gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are on 10gR2 5 node RAC (a huge and very busy DB) and also very high
>>>>> number of CPUs (80ish) hence the default number of LMS processes that Oracle
>>>>> generates for us (num_cpu dependant) is way too high and is causing CPU
>>>>> consumption issue (due to their high priority) for some time now so we need
>>>>> to look to reduce (and fix) them to a sensible number and going by RAC
>>>>> expert Riyaz's advice we shud reduce them down to 4 (one for each 'other'
>>>>> RAC node) which makes sense to us particularly becs if we were going 11g and
>>>>> left this parameter to Oracle's default calculation it'll come down to 3.
>>>>> (Current number of LMS processes : (Due to the varied number of CPUs ) - On
>>>>> 2 nodes, 10, on 1 12 and on the other 2 : 8).
>>>>>
>>>>> Moreover, we've done the analysis and there are no major RAC wait
>>>>> events (as in we are not much suffering from RAC) and we "think" we should
>>>>> be fine by reducing these and obviously we'll take them thru to Performance
>>>>> Test to gain some confidence.
>>>>>
>>>>> However the question is, is there any other mathematical way of getting
>>>>> to just the right number of LMS processes for a RAC instance pls? Is there
>>>>> any formula to predict them or even to predict if there will be any latency
>>>>> impact by reducing these?
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>
>>>>
>>>
>>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jun 22 2011 - 16:15:08 CDT

Original text of this message