Re: Error:no swap space to grow stack (solaris)

From: B.Sc Yassir Khogaly <yassir_at_khogaly.freeserve.co.uk>
Date: Tue, 3 Nov 1998 21:44:38 -0000
Message-ID: <71ntgi$ut8$1@newsreader2.core.theplanet.net>

What is your Operationg system ...?
What is the version of operating system ..? Does ps -elf show any growth in dispatchers...?

*****************************( Source:

BUGDB )*******************************
Article-ID: <Bug:513974>   Related: <BugMatrix:513974>   Base: <Bug:>
Customer:                                                 Created: 09-JUL-97
Component: NET          Comp Ver: 2.3.3        Rel St: P   Updated:
21-JUL-97
Sub Comp:              RDBMS Ver: 7.3.3                         By: RWESSMAN
Status: 80,Q/A To Development
Sup Pri: 1,Complete Loss of Service                   Fixed In Ver: 2.3.4

O/S: 453 Sun Solaris V2 Sparc
PL Group: UNIX Gen/Port: G Assigned: RWESSMAN Error #: ORA 4030 Pub: Y

Abstract: POSSIBLE MEMORY LEAK IN DISPATCHER PROCESS - SWAP GRADUALLY DECREASES

MDEJESUS 07/09/97 01:03 pm *** client's system crashed on evening of the 7/3/97 with ora 4030 and ora 7324. rebooted system but could find no obvious problems; nothing being run should have taken large amounts of space (ct has 2.1 gig ram and 7.x gig swap at this time) filed tar with australia and was advised on setting trace events which would provide more info if this occurred again . not acceptable to client as system is 24x7 prod and they cannot afford many outages . further monitoring of the system show swap space decreasing at a rate of about 150-200M per hour . investigation showed three dispatcher processes over over 1 gig in size; these were d000, d003 and d004 . last night ct was down to 220M out of 7+gig; ct added another gig of swap and scheduled sowntime so that they could reboot and clear the swap (which the did) . they reset the rlim_fd_max and rlim_fd_cus system aparameters which had been set out of bounds but this did no good; same growth in dispatchers was seen . ps -elf shows: ... 8 S marks 989 983 0 41 20 51eb1330 109 51eb1500 21:21:42 pts/0 0:00 tail -100f /oracle/trace_logs/odin/ 8 S oracle 1001 993 0 41 20 51e62668 143 51e782ee 21:21:58 pts/1 0:00 -csh 8 S oracle 1292 1 0 41 20 51f56ce0 58690 51f5203c 21:23:40 ? 0:22 ora_pmon_odin 8 S oracle 1304 1 0 41 20 51f56020 58719 51f5205c 21:23:41 ? 0:08 ora_arch_odin 8 S marks 993 990 0 47 20 51eb0670 140 51eb0840 21:21:51 pts/1 0:00 -csh 8 S oracle 1554 1 0 41 20 521cacd0 58866 521a6826 21:27:34 ? 5:23 oracleodin (LOCAL=NO) 8 S root 990 503 0 41 20 51e62008 188 51e8bbf6 21:21:51 ? 0:00 in.telnetd 8 S oracle 1035 1 2 47 20 51e63988 508 51e8bd86 21:23:22 ? 46:21 /oracle/7.3.3/bin/tnslsnr LISTENER 8 S root 981 503 0 41 20 51676cc8 188 51e8bc6e 21:21:34 ? 0:00 in.telnetd 8 S root 1473 503 0 41 20 51f27998 188 51e8b53e 21:24:15 ? 0:00 in.telnetd 8 S oracle 1316 1 0 41 20 5029a020 58694 51f5207c 21:23:42 ? 0:04 ora_smon_odin 8 S oracle 1308 1 1 41 20 51eb0cd0 58701 51f5204c 21:23:41 ? 12:01 ora_lgwr_odin 8 S root 4016 603 0 41 20 5298b9a0 219 51e8a4d6 10:51:58 ? 0:00 /opt/USAssh/sbin/sshd 8 S oracle 1417 1 1 41 20 52104660 66722 51f5219c 21:23:45 ? 20:49 ora_d004_odin 8 S oracle 1300 1 1 41 20 51e1e000 58693 51f5202c 21:23:41 ? 17:36 ora_dbwr_odin 8 S oracle 1330 1 0 68 20 51f57340 58744 51f5209c 21:23:42 ? 34:35 ora_snp0_odin 8 O oracle 1336 1 3 51 20 51eb0010 58747 21:23:42 ? 37:44 ora_snp1_odin 8 S oracle 1313 1 0 40 20 51f579a0 58699 51f5206c 21:23:41 ? 0:18 ora_ckpt_odin 8 S oracle 1324 1 0 41 20 51f26cd8 58692 51f5208c 21:23:42 ? 0:01 ora_reco_odin 8 S oracle 5105 3527 0 41 20 524bb328 58703 516ae0d4 11:29:11 ? 0:00 oracleodin (DESCRIPTION=(LOCAL=YES) 8 S oracle 1457 1 0 40 20 52148ce0 58757 51f522ec 21:23:48 ? 0:16 ora_p003_odin 8 S oracle 4036 4028 0 41 20 529c4cc0 142 52619406 10:52:09 pts/6 0:00 -csh 8 S oracle 4501 1 3 41 20 5211a670 58772 30397a60 11:05:33 ? 5:23 ora_s007_odin 8 S oracle 1411 1 0 41 20 51f26678 66647 51f5217c 21:23:44 ? 19:44 ora_d002_odin 8 S oracle 1404 1 1 41 20 5210e668 68941 51f5215c 21:23:44 ? 36:57 ora_d000_odin 8 S oracle 1406 1 0 41 20 51bb6678 63666 51f5216c 21:23:44 ? 13:33 ora_d001_odin 8 S oracle 1415 1 1 41 20 52105980 70459 51f5218c 21:23:45 ? 29:34 ora_d003_odin 8 S oracle 1419 1 1 45 20 5212d998 67018 51f521ac 21:23:45 ? 21:56 ora_d005_odin 8 S oracle 1421 1 1 41 20 5211b330 67570 51f521bc 21:23:45 ? 23:02 ora_d006_odin 8 S oracle 1423 1 3 56 20 5212ccd8 65518 51f521cc 21:23:45 ? 20:40 ora_d007_odin 8 S oracle 1425 1 1 41 20 5212c678 64448 51f521dc 21:23:45 ? 15:24 ora_d008_odin 8 S oracle 1427 1 1 45 20 5212c018 63207 51f521ec 21:23:45 ? 12:30 ora_d009_odin .... . files from ct are stored on: wrvms Directory BUG$$:[BUG.BUG513974]

ALERT_ODIN.LOG;1 ODIN_ORA_2274.TRC;1 SYSTEM.STUFF;1 .
The alert log is from before the 7/3/97 crash through yesterday; there is nothing unusual pertaining to the dispatcher in there. It does show a series
of ora 4030 and ora 7324 before the system gave up and died on the 3rd, but these are really just a by-product of the dispatcher problem .
The file system stuff contains: vmstat 5, ps -elf, ipcs -b, pkginfo, showrev -p
/etc/system, init.ora, config.ora, tnsnames.ora, listener.ora, and sqlnet.ora
.
ct has agreed to 24x7 committment.....
.
ct will get you dial-in access
*** MDEJESUS 07/09/97 01:04 pm ***

Also, We have been talking with Mike Jaffee of Sun regarding this issue....
*** PTURNER 07/09/97 01:32 pm *** (CHG: G/P->G Asg->NETREP)
As of 733 all solaris bugs are to be fixed by the base dev groups. We can work if it is a solaris problem but this looks like someone needs to ru
purify on 733 dispatcher.
*** HNELLORE 07/09/97 01:41 pm *** (CHG: Asg->NWOO)

MDEJESUS 07/10/97 06:56 am *** ct called in and stated that the swap space is decreasing at an ever increasing rate; i.e. where it was taking about 4 days to eat 7 gig, it is now taking 3 days (this may be due to increased call volume on their part) . they desparately need some kind of workaround as they CANNOT afford to reboot frequently and there is a limit to the amount of disk space they can reallocate to swap...
RSHAPIRO 07/10/97 10:36 am *** (CHG: Asg->RDBMSREP Prod->5)
RSHAPIRO 07/10/97 10:36 am *** (CHG: Asg->HBERGH)
HBERGH 07/10/97 10:40 am ***
HBERGH 07/10/97 10:45 am *** (CHG: Sta->30) Immediate and drastic workaround would be to kill the big dispatcher process. It will disconnect the users that were connected through this dispatcher, and the dispatcher will be restarted automatically. If they can do this during a time where users aren't too busy, then it shouldn't affect too many users.
HBERGH 07/10/97 11:49 am *** Since I haven't heard back from support, I'll continue to update the bug. Can you repeat the ps command several times, about 30 seconds apart, to show the process growth? How are the dispatchers being used? Do users connect often for a short time, or do they connect a few times and remain connected for a long time? Does the application use dblinks? If yes, how are they defined? Do the connections get spread over all dispatchers? Does it help to start more of them? Is the application receiving any kind of errors, beside the ones you mentioned?
HBERGH 07/10/97 11:52 am *** Why are they using MTS?
MDEJESUS 07/10/97 01:23 pm *** 1. Can you repeat the ps command several times, about 30 seconds apart, to show the process growth? ct has done this and file is ps.data on wrvms 2. How are the dispatchers being used? Dispatchers are being used for brief connections to support international email. 3. Does the application use dblinks? No 4. Do the connections get spread over all dispatchers? Does it help to start more of them? Mostly the lower numbered dispatchers are being used, however all of them are growing to extremely large sizes. d000, d003 and d004 appear to be the ones used the most. no 5. Is the application receiving any kind of errors, beside the ones you mentioned? In the alert log on wrvms there are ora 6052 and ora 600 [17285] errors but we do not feel that they are related and they are being worked as separate issues.separate issues. . Client is willing to shutdown mts briefly until the dispatchers are no longer pending jobs and then kill them, but wants to be sure that no one is using them before killing them.
MDEJESUS 07/10/97 01:30 pm *** after hours numbers: Venkat - home: 719-380-8462; pager: 719-329-2192
MDEJESUS 07/10/97 01:30 pm *** (CHG: Sta->11)
MDEJESUS 07/10/97 03:42 pm *** daytime number: 719-520-0852 ex 2802 (Venkat)
MDEJESUS 07/10/97 03:42 pm *** Venkat can get you dial-in if you want it
HBERGH 07/10/97 05:37 pm *** (CHG: Sta->30) Advised support to have customer terminate dispatchers using the command alter system set mts_dispatchers = 'tcp, 0' and have new connections routed to dedicated servers, until all dispatchers have finished working on existing connections. Then they will exit. After all dispatchers have exited, use the command alter system set mts_dispatchers = 'tcp, 20' to start them up again. We believe that the memory leak may have been introduced in 7.3.3. If at all possible, we would suggest the customer downgrades to 7.3.2.3.
HBERGH 07/11/97 07:12 am *** Please get 3 heap dumps of a growing dispatcher process: SVRMGR> oradebug setospid <big_dispatcher_pid> SVRMGR> oradebug dump heapdump 3 wait a while, let dispatcher grow some more, then repeat the dump command
MDEJESUS 07/11/97 10:44 am *** ct will send requested info and I will put on wrvms... . meawhile, back at the site.... when they set mts_dispatchers = 'tcp, 0', Venkat noticed that two dispatchers did not die off; he tried killing one but that also did not work; when they set mts_dispatchers = 'tcp, 20' to start them up again, everything recovered fine. The second time they did the workaround, 2 other dispatchers did not die.... appears to be random. 1st time it was dispatchers 5 and 3; 2nd time, 2 and 18. .
MDEJESUS 07/11/97 10:44 am *** (CHG: Sta->11)
HBERGH 07/13/97 12:38 pm *** (CHG: Sta->30)
HBERGH 07/13/97 12:38 pm *** Please check to see if these dispatchers still have active sessions. If you do a select from v$session and see any sessions where server is shared, dispatcher or none, then there are still some (idle) mts sessions around. Even alter syste kill session won't get rid of them, so it would be up to the dba to decide if these sessions can be terminated by doing an OS level kill of the dispatcher processes. Is the workaround ok otherwise? Please set back to 11 if the requested info is available.
MDEJESUS 07/14/97 07:39 am *** (CHG: Sta->11)
MDEJESUS 07/14/97 07:39 am *** ct called and let me know that the debug commands are not working SVRMGR> oradebug setospid <big_dispatcher_pid> SVRMGR> oradebug dump heapdump 3 otcsun2% ps -ef | grep d0 osupport 28872 1 22 14:28:44 ? 0:00 ora_d000_V723 osupport 1935 25829 3 18:42:59 pts/45 0:00 grep d0 osupport 28873 1 21 14:28:46 ? 0:00 ora_d001_V723 ... SVRMGR> oradebug setospid 28872; oradebug setospid 28872
*
ORA 900: invalid SQL statement SVRMGR> exit not working on otcsun2 either.... what did you leave out? as per my email ct did make a series of ps -elf and lsnrctl services dumps this is out on wrvms and named more.stuff ct also made a series of sar dumps and this is there under sar.stuff
HBERGH 07/14/97 10:17 am *** (CHG: Sta->30)
HBERGH 07/14/97 10:17 am *** Don't type a ";" after the oradebug command.
MDEJESUS 07/14/97 10:28 am *** (CHG: Sta->11)
MDEJESUS 07/14/97 10:28 am *** If you don't put a ";", you get a number "2>".
HBERGH 07/14/97 10:54 am *** (CHG: Sta->30)
HBERGH 07/14/97 10:54 am *** Did you connect internal first?
HBERGH 07/14/97 10:55 am *** Also, try typing 'oradebug help' to see if it accepts the oradebug command.
HBERGH 07/14/97 11:03 am ***
HBERGH 07/14/97 11:32 am *** I think I may be able to reproduce this. I have 2 dispatchers and 8 users connecting, select from dual, disconnecting. I see some growth in the dispatcher processes. I'll leave this running for an hour to see if it is going to stabilize, or continues to grow.
HBERGH 07/14/97 11:40 am ***
MDEJESUS 07/14/97 01:22 pm *** ran through steps on tcsun2 and got it to work; was using a different version of oracle so that may have been why it did not work for me before . walked through it with client following your steps.... he gets the following error on the larger of his processes.... ora 72: process "unix process: ...." is not active we tried on several and finally found one small enought that it processed however, when we went to to his bdump directory, there were trace files for all processes we attempted the oradebug command . ct will send two .trc files from each of 2 or 3 processes at a 10 min interval . interestingly, while we we doing this we noticed on process jump from about 99m to 150m in the 5 or 10 min interval we were watching .
MDEJESUS 07/14/97 01:22 pm *** (CHG: Sta->11)
HBERGH 07/14/97 01:47 pm ***
MDEJESUS 07/15/97 11:55 am *** the oradebug files you requested are on wrvms in oradebug.stuff
HBERGH 07/15/97 03:13 pm *** Thanks. The output is not useful, because the dump failed somewhere halfway through. I'm debugging this on my development system now, but am having a bit of trouble with other problems I'm running into. I'll keep you updated.
MDEJESUS 07/16/97 01:34 pm ***
HBERGH 07/16/97 02:22 pm ***
KAREARDO 07/16/97 03:25 pm ***
HBERGH 07/16/97 05:47 pm ***
HBERGH 07/16/97 08:12 pm *** (CHG: Asg->NETREP Prod->115)
HBERGH 07/16/97 08:13 pm ***
ASWANG 07/17/97 02:26 pm *** (CHG: Asg->RWESSMAN)
ASWANG 07/17/97 02:26 pm *** Based on HBERGH's purify output and email, native authentication code involved in the memory leak problem. Would you take care of this P1 ASAP?!
SURMAN 07/17/97 02:51 pm ***
RWESSMAN 07/17/97 04:06 pm *** (CHG: Sta->80)
RWESSMAN 07/17/97 04:06 pm *** (CHG: FixBy->2.3.4)
RWESSMAN 07/17/97 04:06 pm *** (CHG: Fixed->2.3.4)
RWESSMAN 07/17/97 04:06 pm *** Bug was caused by an incorrect workaround for a compiler bug. Fix is in /vobs/network_src/ nsna.c@@/main/st_network_2.3_dev/st_network_rwessman_bug-513974/LATEST
RWESSMAN 07/21/97 09:37 am *** Created backport label ST_NETWORK_2.3.3_BACKPORT_513974 Contents of label: nsna.c@@/main/st_network_2.3_dev/st_network_rwessman_backport_516462/1

Jan Dolman wrote in message <363f88e8.118373_at_news.a2000.nl>...

>Probably a Unix issue, but nevertheless:
>
>When my databases auto-started at 06:00 this mornig, the server
>console popped up (and kept repeating) this error message:
>
>WARNING: Sorry, no swap space to grow stack for pid 11985 (oracle)
>
>As we were unable to login at the console and we were unable to
>connect using a telnet session, I do not know what process 11985 is.
>
>We had to restart the server, after which everything was fine again.
>As I am not too keen about an enture server locking up, I would like
>to know if anybody out there is familiar with this problem.
>
>Regards,
>Jan

begin 666 Yassir Khogaly.vcf
M0D5'24XZ5D-!4D0-"E9%4E-)3TXZ,BXQ#0I..DMH;V=A;'D[66%S<VER#0I& M3CI987-S:7(@2VAO9V%L>0T*5$E43$4Z4V5N:6]R($]R86-L92!$0D$-"E1% M3#M(3TU%.U9/24-%.BLT-"@P*3$X,2 T-C T,#0R#0I414P[0T5,3#M63TE# M13HK-#0T,3$Y,#8W-S8-"D%$4CM(3TU%.CL[.SM+96YT.SM%;F=L86YD#0I, M04)%3#M(3TU%.T5.0T]$24Y'/5%53U1%1"U04DE.5$%"3$4Z2V5N=#TP1#TP M045N9VQA;F0-"E523#IH='1P.B\O=W=W+FMH;V=A;'DN9G)E97-E<G9E+F-O M+G5K#0I54DPZ:'1T<#HO+W=W=RYA965U+F]R9RYU:PT*14U!24P[4%)%1CM) M3E1%4DY%5#IY87-S:7) :VAO9V%L>2YF<F5E<V5R=F4N8V\N=6L-"E)%5CHQ <.3DX,3$P,U0R,30T,S=:#0I%3D0Z5D-!4D0-"@`` `
end Received on Tue Nov 03 1998 - 15:44:38 CST