What is your Operationg system ...?
What is the version of operating system ..?
Does ps -elf show any growth in dispatchers...?
*****************************( Source:
BUGDB )*******************************
Article-ID: <Bug:513974> Related: <BugMatrix:513974> Base: <Bug:>
Customer: Created: 09-JUL-97
Component: NET Comp Ver: 2.3.3 Rel St: P Updated:
21-JUL-97
Sub Comp: RDBMS Ver: 7.3.3 By: RWESSMAN
Status: 80,Q/A To Development
Sup Pri: 1,Complete Loss of Service Fixed In Ver: 2.3.4
O/S: 453 Sun Solaris V2 Sparc
PL Group: UNIX Gen/Port: G Assigned: RWESSMAN Error #: ORA 4030 Pub: Y
Abstract: POSSIBLE MEMORY LEAK IN DISPATCHER PROCESS - SWAP GRADUALLY
DECREASES
*
- MDEJESUS 07/09/97 01:03 pm ***
client's system crashed on evening of the 7/3/97 with ora 4030 and ora 7324.
rebooted system but could find no obvious problems; nothing being run should
have taken large amounts of space (ct has 2.1 gig ram and 7.x gig swap at
this
time)
filed tar with australia and was advised on setting trace events which would
provide more info if this occurred again
.
not acceptable to client as system is 24x7 prod and they cannot afford many
outages
.
further monitoring of the system show swap space decreasing at a rate of
about
150-200M per hour
.
investigation showed three dispatcher processes over over 1 gig in size;
these
were d000, d003 and d004
.
last night ct was down to 220M out of 7+gig; ct added another gig of swap
and
scheduled sowntime so that they could reboot and clear the swap (which the
did)
.
they reset the rlim_fd_max and rlim_fd_cus system aparameters which had been
set out of bounds but this did no good; same growth in dispatchers was seen
.
ps -elf shows:
...
8 S marks 989 983 0 41 20 51eb1330 109 51eb1500 21:21:42
pts/0 0:00 tail -100f /oracle/trace_logs/odin/
8 S oracle 1001 993 0 41 20 51e62668 143 51e782ee 21:21:58
pts/1 0:00 -csh
8 S oracle 1292 1 0 41 20 51f56ce0 58690 51f5203c 21:23:40
? 0:22 ora_pmon_odin
8 S oracle 1304 1 0 41 20 51f56020 58719 51f5205c 21:23:41
? 0:08 ora_arch_odin
8 S marks 993 990 0 47 20 51eb0670 140 51eb0840 21:21:51
pts/1 0:00 -csh
8 S oracle 1554 1 0 41 20 521cacd0 58866 521a6826 21:27:34
? 5:23 oracleodin (LOCAL=NO)
8 S root 990 503 0 41 20 51e62008 188 51e8bbf6 21:21:51
? 0:00 in.telnetd
8 S oracle 1035 1 2 47 20 51e63988 508 51e8bd86 21:23:22
? 46:21 /oracle/7.3.3/bin/tnslsnr LISTENER
8 S root 981 503 0 41 20 51676cc8 188 51e8bc6e 21:21:34
? 0:00 in.telnetd
8 S root 1473 503 0 41 20 51f27998 188 51e8b53e 21:24:15
? 0:00 in.telnetd
8 S oracle 1316 1 0 41 20 5029a020 58694 51f5207c 21:23:42
? 0:04 ora_smon_odin
8 S oracle 1308 1 1 41 20 51eb0cd0 58701 51f5204c 21:23:41
? 12:01 ora_lgwr_odin
8 S root 4016 603 0 41 20 5298b9a0 219 51e8a4d6 10:51:58
? 0:00 /opt/USAssh/sbin/sshd
8 S oracle 1417 1 1 41 20 52104660 66722 51f5219c 21:23:45
? 20:49 ora_d004_odin
8 S oracle 1300 1 1 41 20 51e1e000 58693 51f5202c 21:23:41
? 17:36 ora_dbwr_odin
8 S oracle 1330 1 0 68 20 51f57340 58744 51f5209c 21:23:42
? 34:35 ora_snp0_odin
8 O oracle 1336 1 3 51 20 51eb0010 58747 21:23:42
? 37:44 ora_snp1_odin
8 S oracle 1313 1 0 40 20 51f579a0 58699 51f5206c 21:23:41
? 0:18 ora_ckpt_odin
8 S oracle 1324 1 0 41 20 51f26cd8 58692 51f5208c 21:23:42
? 0:01 ora_reco_odin
8 S oracle 5105 3527 0 41 20 524bb328 58703 516ae0d4 11:29:11
? 0:00 oracleodin (DESCRIPTION=(LOCAL=YES)
8 S oracle 1457 1 0 40 20 52148ce0 58757 51f522ec 21:23:48
? 0:16 ora_p003_odin
8 S oracle 4036 4028 0 41 20 529c4cc0 142 52619406 10:52:09
pts/6 0:00 -csh
8 S oracle 4501 1 3 41 20 5211a670 58772 30397a60 11:05:33
? 5:23 ora_s007_odin
8 S oracle 1411 1 0 41 20 51f26678 66647 51f5217c 21:23:44
? 19:44 ora_d002_odin
8 S oracle 1404 1 1 41 20 5210e668 68941 51f5215c 21:23:44
? 36:57 ora_d000_odin
8 S oracle 1406 1 0 41 20 51bb6678 63666 51f5216c 21:23:44
? 13:33 ora_d001_odin
8 S oracle 1415 1 1 41 20 52105980 70459 51f5218c 21:23:45
? 29:34 ora_d003_odin
8 S oracle 1419 1 1 45 20 5212d998 67018 51f521ac 21:23:45
? 21:56 ora_d005_odin
8 S oracle 1421 1 1 41 20 5211b330 67570 51f521bc 21:23:45
? 23:02 ora_d006_odin
8 S oracle 1423 1 3 56 20 5212ccd8 65518 51f521cc 21:23:45
? 20:40 ora_d007_odin
8 S oracle 1425 1 1 41 20 5212c678 64448 51f521dc 21:23:45
? 15:24 ora_d008_odin
8 S oracle 1427 1 1 45 20 5212c018 63207 51f521ec 21:23:45
? 12:30 ora_d009_odin
....
.
files from ct are stored on: wrvms
Directory BUG$$:[BUG.BUG513974]
ALERT_ODIN.LOG;1 ODIN_ORA_2274.TRC;1 SYSTEM.STUFF;1
.
The alert log is from before the 7/3/97 crash through yesterday; there is
nothing unusual pertaining to the dispatcher in there. It does show a
series
of ora 4030 and ora 7324 before the system gave up and died on the 3rd, but
these are really just a by-product of the dispatcher problem
.
The file system stuff contains: vmstat 5, ps -elf, ipcs -b, pkginfo,
showrev -p
/etc/system, init.ora, config.ora, tnsnames.ora, listener.ora, and
sqlnet.ora
.
ct has agreed to 24x7 committment.....
.
ct will get you dial-in access
*** MDEJESUS 07/09/97 01:04 pm ***
Also, We have been talking with Mike Jaffee of Sun regarding this issue....
*** PTURNER 07/09/97 01:32 pm *** (CHG: G/P->G Asg->NETREP)
As of 733 all solaris bugs are to be fixed by the base dev groups.
We can work if it is a solaris problem but this looks like someone needs to
ru
purify on 733 dispatcher.
*** HNELLORE 07/09/97 01:41 pm *** (CHG: Asg->NWOO)
- MDEJESUS 07/10/97 06:56 am ***
ct called in and stated that the swap space is decreasing at an ever
increasing
rate; i.e. where it was taking about 4 days to eat 7 gig, it is now taking 3
days (this may be due to increased call volume on their part)
.
they desparately need some kind of workaround as they CANNOT afford to
reboot
frequently and there is a limit to the amount of disk space they can
reallocate to swap...
- RSHAPIRO 07/10/97 10:36 am *** (CHG: Asg->RDBMSREP Prod->5)
- RSHAPIRO 07/10/97 10:36 am *** (CHG: Asg->HBERGH)
- HBERGH 07/10/97 10:40 am ***
- HBERGH 07/10/97 10:45 am *** (CHG: Sta->30)
Immediate and drastic workaround would be to kill the big dispatcher
process.
It will disconnect the users that were connected through this dispatcher,
and
the dispatcher will be restarted automatically. If they can do this during a
time where users aren't too busy, then it shouldn't affect too many users.
- HBERGH 07/10/97 11:49 am ***
Since I haven't heard back from support, I'll continue to update the bug.
Can you repeat the ps command several times, about 30 seconds apart, to show
the process growth?
How are the dispatchers being used? Do users connect often for a short time,
or
do they connect a few times and remain connected for a long time?
Does the application use dblinks?
If yes, how are they defined?
Do the connections get spread over all dispatchers? Does it help to start
more
of them?
Is the application receiving any kind of errors, beside the ones you
mentioned?
- HBERGH 07/10/97 11:52 am ***
Why are they using MTS?
- MDEJESUS 07/10/97 01:23 pm ***
1. Can you repeat the ps command several times, about 30 seconds apart, to
show the process growth?
ct has done this and file is ps.data on wrvms
2. How are the dispatchers being used?
Dispatchers are being used for brief connections to support
international email.
3. Does the application use dblinks? No
4. Do the connections get spread over all dispatchers? Does it help to start
more of them?
Mostly the lower numbered dispatchers are being used, however all of them
are growing to extremely large sizes. d000, d003 and d004 appear to be the
ones used the most.
no
5. Is the application receiving any kind of errors, beside the ones you
mentioned?
In the alert log on wrvms there are ora 6052 and ora 600 [17285] errors
but we do not feel that they are related and they are being worked as
separate issues.separate issues.
.
Client is willing to shutdown mts briefly until the dispatchers are
no longer pending jobs and then kill them, but wants to be sure that
no one is using them before killing them.
- MDEJESUS 07/10/97 01:30 pm ***
after hours numbers: Venkat - home: 719-380-8462; pager: 719-329-2192
- MDEJESUS 07/10/97 01:30 pm *** (CHG: Sta->11)
- MDEJESUS 07/10/97 03:42 pm ***
daytime number: 719-520-0852 ex 2802 (Venkat)
- MDEJESUS 07/10/97 03:42 pm ***
Venkat can get you dial-in if you want it
- HBERGH 07/10/97 05:37 pm *** (CHG: Sta->30)
Advised support to have customer terminate dispatchers using the command
alter system set mts_dispatchers = 'tcp, 0'
and have new connections routed to dedicated servers, until all dispatchers
have finished working on existing connections. Then they will exit.
After all dispatchers have exited, use the command
alter system set mts_dispatchers = 'tcp, 20'
to start them up again.
We believe that the memory leak may have been introduced in 7.3.3. If at all
possible, we would suggest the customer downgrades to 7.3.2.3.
- HBERGH 07/11/97 07:12 am ***
Please get 3 heap dumps of a growing dispatcher process:
SVRMGR> oradebug setospid <big_dispatcher_pid>
SVRMGR> oradebug dump heapdump 3
wait a while, let dispatcher grow some more, then repeat the dump command
- MDEJESUS 07/11/97 10:44 am ***
ct will send requested info and I will put on wrvms...
.
meawhile, back at the site....
when they set mts_dispatchers = 'tcp, 0', Venkat noticed that two
dispatchers did not die off; he tried killing one but that also did not
work; when they set mts_dispatchers = 'tcp, 20' to start them up again,
everything recovered fine. The second time they did the workaround, 2 other
dispatchers did not die.... appears to be random.
1st time it was dispatchers 5 and 3; 2nd time, 2 and 18.
.
- MDEJESUS 07/11/97 10:44 am *** (CHG: Sta->11)
- HBERGH 07/13/97 12:38 pm *** (CHG: Sta->30)
- HBERGH 07/13/97 12:38 pm ***
Please check to see if these dispatchers still have active sessions. If you
do
a select from v$session and see any sessions where server is shared,
dispatcher
or none, then there are still some (idle) mts sessions around. Even alter
syste
kill session won't get rid of them, so it would be up to the dba to decide
if
these sessions can be terminated by doing an OS level kill of the dispatcher
processes. Is the workaround ok otherwise?
Please set back to 11 if the requested info is available.
- MDEJESUS 07/14/97 07:39 am *** (CHG: Sta->11)
- MDEJESUS 07/14/97 07:39 am ***
ct called and let me know that the debug commands are not working
SVRMGR> oradebug setospid <big_dispatcher_pid>
SVRMGR> oradebug dump heapdump 3
otcsun2% ps -ef | grep d0
osupport 28872 1 22 14:28:44 ? 0:00 ora_d000_V723
osupport 1935 25829 3 18:42:59 pts/45 0:00 grep d0
osupport 28873 1 21 14:28:46 ? 0:00 ora_d001_V723
...
SVRMGR> oradebug setospid 28872;
oradebug setospid 28872
*
ORA 900: invalid SQL statement
SVRMGR> exit
not working on otcsun2 either....
what did you leave out?
as per my email ct did make a series of ps -elf and lsnrctl services dumps
this is out on wrvms and named more.stuff
ct also made a series of sar dumps and this is there under sar.stuff
- HBERGH 07/14/97 10:17 am *** (CHG: Sta->30)
- HBERGH 07/14/97 10:17 am ***
Don't type a ";" after the oradebug command.
- MDEJESUS 07/14/97 10:28 am *** (CHG: Sta->11)
- MDEJESUS 07/14/97 10:28 am ***
If you don't put a ";", you get a number "2>".
- HBERGH 07/14/97 10:54 am *** (CHG: Sta->30)
- HBERGH 07/14/97 10:54 am ***
Did you connect internal first?
- HBERGH 07/14/97 10:55 am ***
Also, try typing 'oradebug help' to see if it accepts the oradebug command.
- HBERGH 07/14/97 11:03 am ***
- HBERGH 07/14/97 11:32 am ***
I think I may be able to reproduce this. I have 2 dispatchers and
8 users connecting, select from dual, disconnecting. I see some
growth in the dispatcher processes. I'll leave this running for an hour to
see
if it is going to stabilize, or continues to grow.
- HBERGH 07/14/97 11:40 am ***
- MDEJESUS 07/14/97 01:22 pm ***
ran through steps on tcsun2 and got it to work; was using a different
version of oracle so that may have been why it did not work for me before
.
walked through it with client following your steps....
he gets the following error on the larger of his processes....
ora 72: process "unix process: ...." is not active
we tried on several and finally found one small enought that it processed
however, when we went to to his bdump directory, there were trace files for
all
processes we attempted the oradebug command
.
ct will send two .trc files from each of 2 or 3 processes at a 10 min
interval
.
interestingly, while we we doing this we noticed on process jump from about
99m
to 150m in the 5 or 10 min interval we were watching
.
- MDEJESUS 07/14/97 01:22 pm *** (CHG: Sta->11)
- HBERGH 07/14/97 01:47 pm ***
- MDEJESUS 07/15/97 11:55 am ***
the oradebug files you requested are on wrvms in oradebug.stuff
- HBERGH 07/15/97 03:13 pm ***
Thanks. The output is not useful, because the dump failed somewhere halfway
through. I'm debugging this on my development system now, but am having a
bit
of trouble with other problems I'm running into. I'll keep you updated.
- MDEJESUS 07/16/97 01:34 pm ***
- HBERGH 07/16/97 02:22 pm ***
- KAREARDO 07/16/97 03:25 pm ***
- HBERGH 07/16/97 05:47 pm ***
- HBERGH 07/16/97 08:12 pm *** (CHG: Asg->NETREP Prod->115)
- HBERGH 07/16/97 08:13 pm ***
- ASWANG 07/17/97 02:26 pm *** (CHG: Asg->RWESSMAN)
- ASWANG 07/17/97 02:26 pm ***
Based on HBERGH's purify output and email, native authentication code
involved in the memory leak problem. Would you take care of this P1 ASAP?!
- SURMAN 07/17/97 02:51 pm ***
- RWESSMAN 07/17/97 04:06 pm *** (CHG: Sta->80)
- RWESSMAN 07/17/97 04:06 pm *** (CHG: FixBy->2.3.4)
- RWESSMAN 07/17/97 04:06 pm *** (CHG: Fixed->2.3.4)
- RWESSMAN 07/17/97 04:06 pm ***
Bug was caused by an incorrect workaround for a compiler bug.
Fix is in /vobs/network_src/
nsna.c@@/main/st_network_2.3_dev/st_network_rwessman_bug-513974/LATEST
- RWESSMAN 07/21/97 09:37 am ***
Created backport label ST_NETWORK_2.3.3_BACKPORT_513974
Contents of label:
nsna.c@@/main/st_network_2.3_dev/st_network_rwessman_backport_516462/1
Jan Dolman wrote in message <363f88e8.118373_at_news.a2000.nl>...
>Probably a Unix issue, but nevertheless:
>
>When my databases auto-started at 06:00 this mornig, the server
>console popped up (and kept repeating) this error message:
>
>WARNING: Sorry, no swap space to grow stack for pid 11985 (oracle)
>
>As we were unable to login at the console and we were unable to
>connect using a telnet session, I do not know what process 11985 is.
>
>We had to restart the server, after which everything was fine again.
>As I am not too keen about an enture server locking up, I would like
>to know if anybody out there is familiar with this problem.
>
>Regards,
>Jan
begin 666 Yassir Khogaly.vcf
M0D5'24XZ5D-!4D0-"E9%4E-)3TXZ,BXQ#0I..DMH;V=A;'D[66%S<VER#0I&
M3CI987-S:7(@2VAO9V%L>0T*5$E43$4Z4V5N:6]R($]R86-L92!$0D$-"E1%
M3#M(3TU%.U9/24-%.BLT-"@P*3$X,2 T-C T,#0R#0I414P[0T5,3#M63TE#
M13HK-#0T,3$Y,#8W-S8-"D%$4CM(3TU%.CL[.SM+96YT.SM%;F=L86YD#0I,
M04)%3#M(3TU%.T5.0T]$24Y'/5%53U1%1"U04DE.5$%"3$4Z2V5N=#TP1#TP
M045N9VQA;F0-"E523#IH='1P.B\O=W=W+FMH;V=A;'DN9G)E97-E<G9E+F-O
M+G5K#0I54DPZ:'1T<#HO+W=W=RYA965U+F]R9RYU:PT*14U!24P[4%)%1CM)
M3E1%4DY%5#IY87-S:7) :VAO9V%L>2YF<F5E<V5R=F4N8V\N=6L-"E)%5CHQ
<.3DX,3$P,U0R,30T,S=:#0I%3D0Z5D-!4D0-"@``
`
end
Received on Tue Nov 03 1998 - 15:44:38 CST