Re: ipcs problems - possibly

From: David Fitzjarrell <oratune_at_aol.com>
Date: Tue, 16 Jan 2001 20:06:51 GMT
Message-ID: <9429kk$8qe$1@nnrp1.deja.com>

In our last gripping episode David Fitzjarrell <oratune_at_aol.com> wrote:
> In article <979668449.24111.0.nnrp-10.c30bdde2_at_news.demon.co.uk>,
> "andrew_webby at hotmail" <spam_at_no.thanks.com> wrote:
> > Hi
> >
> > Can anyone give me a clue here?
> >
> > I've a couple of test databases recently upgraded to Oracle 8.1.6r2
on
> > Solaris 2.7.
> >
> > The test users are complaining that sometimes they can get in,
sometimes
> > they can't (we've checked the usual suspects like listener, alert
log
etc,
> > but no dice). As nothing was changing in between (neither client nor
> > database), my suspicions fell on unix itself.
> >
> > About the only thing that looks out of the ordinary is shared memory
ISM
> > attaches which worryingly looks like this:
> >
> > (ipcs -i)
> > T ID KEY MODE OWNER GROUP ISMATTCH
NATTCH
> > Shared Memory:
> > m 0 0x500e0dd9 --rw-r--r-- root root 1
> > m 1 0x790 --rw-rw-rw- root root 0
> > m 2 0x9ae7520 --rw-r----- oracle dba 10
11 -
> > RBTEST
> > m 3 0xbf67de8 --rw-r----- oracle dba 25
25 -
> > HOULIVE
> > m 4 0x98a5f10 --rw-r----- oracle dba 37
36 -
> > RBLIVE
> > m 5 0xb4e6bf8 --rw-r----- oracle dba 8
8 -
> > HOUTEST
> > m 6 0x1503facc --rw-r----- oracle dba
4,294,966,760 10 -
> > RBDEV
> > m 7 0xb45ab788 --rw-r----- oracle dba
4,294,964,292 7 -
> > HOUDEV
> > m 8 0x280267 --rw-r--r-- root root 0
> >
> > (sorry, here's hoping you use Courier font... :-)
> >
> > As you'll see, the ism attch column is mental. Also, it's going
down.
I
> > appreciate that this may be more of a solaris problem, so I've
posted
there
> > as well. The man pages don't give too much detail on why this might
occur
> > (though the large number leads me to believe an overflow of an
unsigned
> > integer or similar has occured).
> >
> > Any ideas if this is something for concern and if anyone can point
me
> > towards something useful on ISMATTCH (I've tried Sun search, general
> > web/usenet search etc), that would be top (no unix-pun intended).
I'm
> > certainly giving it a good go here, but our top unix man is off on
> > compassionate leave and there's danger of a crowd forming around my
desk...
> >
> > Andrew
> >
> >
>
> From Metalink:
>
> Center of Expertise Research Articles
> Solaris ISM and Oracle, Frequently asked Questions
>
> What is ISM?
>
> ISM, or Intimate Shared Memory, is a way of handling the page table
> entries by the Sun Solaris operating system.
>
> There is one memory structure in Sun Solaris that is used for keeping
> track of all the process page table information. This structure is
> essentially a series of hash chains, similar to Oracle's concept of
LRU
> latch chains or cache buffers chains. During memory operations, this
> structure is traversed by the operating system as it makes decisions
> about how to handle memory. Each process on the system has information
> in this memory structure. There are only 256 entry points into this
> structure and this number cannot be increased.
>
> Consequently, as more and more memory mappings and operations occur,
> the information stored behind each entry point grows (the "chains"
> lengthen). As the operating system responds to higher activity
> requiring memory mappings, it spends more and more time in kernel mode
> (shown as %sys in sar output) just walking around in this structure
> deciding what to do. Having large number of users and a large Oracle
> SGA further aggravates this situation.
>
> The use of ISM reduces this problem significantly because it allows
the
> processes to share the page table entries. Essentially, complex
> operations on these memory chains reduce to pointer operations
> involving small amounts of data.
>
> What are the issues related to ISM?
>
> ISM reduces the number of system calls therefore improving the overall
> performance when there are large memory allocations and large number
of
> users on the system. In theory, this should cause no adverse affects
> and increase the overall performance of the system. Unfortunately, due
> to various Solaris operating system bugs enabling ISM can cause Oracle
> data corruptions or crashes. The relevant Sun bug numbers are:
>
> 4244523: Data corruption in ISM shared memory segs with heavy
> load/multi-threaded apps
> 4255955: With enable_grp_ism=1 on E10000, 5.6 -15 KJP, oracle 7.3.4
> crashes
>
> The Sun base bug number is 4228856, which is not published.
>
> In addition to the above Sun bugs there are many Oracle bugs and
> duplicate Sun bugs pointing to the same problem:
>
> If ISM is enabled on Sun E10000 systems, Oracle data corruptions may
> occur. This is especially true when Domain Reconfiguration (DR)
feature
> is also enabled.
>
> There are three very important points one should be aware of:
>
> 1. The problem is specific to Sun E10000 models.
> 2. The problem is more prominent when DR is enabled
> 3. These problems are fixed in Sun Kernel patch level 16, a.k.a. Sun
OS
> 5.6 Patch ID# 105181-16.
>
> What causes the corruption?
>
> Sun E10000 models have a new processor model featuring a new type of
> CPU register. When ISM is turned on, the shared memory image coherency
> across processes becomes inconsistent under some circumstances. This
is
> caused by the CPU cache getting out-of-sync with the on-disk data and
> the register not being flushed. The effect of this problem is
reflected
> as corrupt Oracle blocks, where the block header does not match the
> tail. In almost all cases, the block header has a pattern that exists
> in all of the ORA-1578 trace files.
>
> It is very important to remember that data block corruption can be
> caused by many other factors, such as hardware failure, logical volume
> manager bugs or Oracle bugs. Encountering a corruption on a Sun E10000
> does not automatically imply it is caused by ISM, and should be
> reported to the appropriate channels for detailed analysis.
>
> How is ISM enabled?
>
> To make full use of ISM, it must be enabled both at the Solaris and
> Oracle level. The default behavior of Solaris 5.6 and Oracle is to
> enable the use ISM. However, due to the issues discussed above,
support
> analysts used to recommend turning off ISM usage, by setting the
> following parameters:
>
> In /etc/system:
>
> shmsys:ism_off=1
> shmsys:share_page_table=1
>
> In init.ora
>
> use_ism=false
>
> Note that the above parameters will turn OFF ISM. To enable ISM,
simply
> remove those lines from the configuration files. However, Oracle will
> not use ISM if the entire SGA cannot fit in a contiguous shared memory
> area and will not report any error messages. If the system memory is
> fragmented and a contiguous shared area cannot be allocated for the
> Oracle SGA, the system needs to be rebooted.
>
> How much performance gain does ISM give?
>
> The performance gain one can expect after enabling ISM depends largely
> on the utilization of the system, especially the number of users and
> amount of shared memory used. On busy Oracle systems with many
> concurrent users (>200) and large SGA (>1G), we have observed as much
> as 30% performance improvements. There are also cases where the system
> may seem like it's hung but the kernel CPU usage is so high that no
> other activity can take place. Simply enabling the ISM reduced the
> kernel CPU usage, eliminating the hanging situations.
>
> In general, the less memory the system allocates for process page
> tables, the less the overhead. Unfortunately, there is no direct
> interface to see the memory allocated for this purpose, other than
> the "crash" utility. The cache name for process page tables
> is "sfmmu8_cache" and can be checked by running the "crash" utility as
> the "root" user as follows:
>
> # crash
> dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> > kmastat
> buf buf buf memory #allocations
> cache name size avail total in use succeed fail
> ---------- ----- ----- ----- -------- ------- ----
> ...
> sfmmu8_cache 232 8297 43925 10280960 124873 0
> ...
> ---------- ----- ----- ----- -------- ------- ----
> permanent - - - 114688 1155 0
> oversize - - - 16433152 381109 0
> ---------- ----- ----- ----- -------- ------- ----
> Total - - - 122658816 1545645581 0
> > quit
> #
>
> If ISM is enabled, this cache should not grow as more users are
> connected and stay at the same level after reaching a stable value. We
> have observed less than 100M of memory usage on very busy systems,
with
> more than 300 users and an SGA size of 2G, as opposed to 1G of memory
> usage without ISM.
>
> What is the bottom line?
>
> With the introduction of Sun Kernel patch 16 (Sun OS 5.6 Patch ID#
> 105181-16), all known problems related to ISM has been fixed. We
should
> encourage Oracle customers to make use of this facility as it has a
> significant performance impact and definitely worthwhile. Customers
> should make sure they have applied the latest Sun Kernel patch and
> remove the parameters disabling the use of ISM, if they were set to
> prevent corruption problems in the past.
>
> Sun already published a note informing customers on how to make this
> change:
>
> Infodoc ID: 20823
> Synopsis : Information about hangs on E10k systems due to disabling
of
> Intimate Shared Memory (ISM) in /etc/system or Oracle's init.ora file
> Date :7 Oct 1999
>
> There was a kernel bug, 4244523, that required a temporary
> workaround which was to turn off ISM in /etc/system and
> in the database application, for example, Oracle's init.ora file.
> This was fixed in the Solaris 5.6 kernel update patch 105181-15.
>
> Unfortunately some customers may forget to remove the modifications
> to /etc/system and init.ora after upgrading their kernel. ISM
> is enabled by default. The following should NOT be in /etc/system:
>
> shmsys:ism_off=1
> shmsys:share_page_table=1
>
> In addition Oracle's init.ora should NOT have:
>
> use_ism=false
>
> To turn off ISM can cause severe performance degradation and
> cause what appears to be a hung state.
>
> Acknowledgements
>
> Most of the information in this document is compiled from internal
> mailing lists, Sun Microsystem's Support Web Page (sunsolve.sun.com)
> and author's field experience at various customer sites.
>
> I would also like to thank Vern E. Wagman for his case study on
Solaris
> 2.6, Veritas 3.3.1 and ISM.
>
> --
> David Fitzjarrell
> Oracle Certified DBA
>
> Sent via Deja.com
> http://www.deja.com/
>

Yet more related information:

Doc ID: Note:69016.1
Subject: Solaris Bug Causes Listeners To Stop Responding To Incoming Connection Requests
Type: ALERT
Status: PUBLISHED
Content Type: TEXT/PLAIN
Creation Date: 11-MAR-1999
Last Revision Date: 12-MAR-1999
Language: USAENG

Platform Affected

Solaris 2.51 and 2.6

Description

Solaris bug causes listeners to stop responding to incoming connection requests
from browsers.

There is a Solaris Bug, number 4089811, that causes the accept() C function call
to return multiple file handles when a control parameter, so_qlen, is set to
1, which is the means of specifying to the TCP driver that only one file handle
is to be returned per accept() call. When this happens, the socket structure
associated with the file handle is damaged, and the socket must be closed and
reopened before new connection requests will be honored. The damage is to the
operating system's copy of the socket, so this is the only possible recovery.
The bug will happen when two or more connection requests are received between
CPU quanta (while the process is asleep and other processes are running), which
implies that it will only be seen under conditions of heavy load.

This bug was introduced in Solaris 2.51 in patch 103588 beginning with version
-11. It was fixed in 103588 beginning with version -20. It was present in
Solaris 2.6 from inception, for programs that use the old-style socket structure (as we do) and is fixed in patch 105529 beginning with version -05.

Identifying the bug:

If the listener stops processing incoming browser requests, but the listener
process is still running and previous requests are honored even after it has
stopped servicing new ones, you may have encountered the bug.

Check the OS version with uname -a. Then check for 103588-11 through 103588-19
on 2.51, or 105529-01 through 105529-04, or 105529 missing completely, on 2.6.
You can check patches on Solaris with showrev -p. If you are running 2.51 with
103588-11 through -19, or 2.6 unpatched or with 105529-04 or less, then you may
be seeing this bug.

Fixing the bug:

You must download 103582-20 or later if you are running 2.51, or 105529- 05 or
later if you are running 2.6. As with many Solaris patches, this one must be
installed as root, and a reboot is required. If you have questions about these
patches, You should call Sun Microsystems for assistance.

Workaround

There is a workaround available if you need an immediate fix, while you are in
the process of downloading the patch, but this workaround has an associated
risk. The original problem, both in 103582-11 and in 2.6, was introduced as a
result of a fix for Sun Bug #1182957, the infamous SYN Bombing bug. To avoid
this bug, and protect your Internet servers, the socket structure was changed, but a mistake was made in the coding and bug #4089811 was introduced.
This workaround will temporarily undo the fix for #1182957, so you will be
vulnerable to SYN Bombing again. You must understand that Oracle cannot accept
responsibility for the results of a SYN Bombing attack on your system (s). The
results will be temporary denial of connection requests for a period of about 5
minutes following each attack. For complete details, you should visit Sunsolve web site (http://sunsolve.sun.com/) and examine both of these bugs,
with especial attention to bug #1182957.

If you are still anxious to implement the workaround after all of this, then
give the following command as root:

# ndd -set /dev/tcp tcp_conn_req_max_q0=0

This command must be repeated after each reboot of the system to reset the
parameter in the TCP driver.

--
David Fitzjarrell
Oracle Certified DBA


Sent via Deja.com
http://www.deja.com/

Received on Tue Jan 16 2001 - 14:06:51 CST

Re: ipcs problems - possibly - in oracle 8?