Re: Resend: performance problem with fetch using 8.1.7 32-bit client to access a 64-bit server

Home -> Community -> Usenet -> c.d.o.misc -> Re: Resend: performance problem with fetch using 8.1.7 32-bit client to access a 64-bit server

From: Ciaran Harron <harronc_at_attglobal.net>
Date: 2 Mar 2004 06:03:55 -0800
Message-ID: <3f71b97e.0403020603.7fd791b2@posting.google.com>

<harronc_at_attglobal.net> wrote in message news:<4028570c_3_at_news1.prserv.net>...
> "Sybrand Bakker" <gooiditweg_at_sybrandb.demon.nl> wrote in message
> news:33pf20t0ks9si86ln3ee646s6cnkc5r95c_at_4ax.com...
> > On Mon, 9 Feb 2004 15:36:25 +0700, <harronc_at_attglobal.net> wrote:
> >
> > >Please ignore previous message.
> > >
> > >We are running 8.1.7.4 64-bit Oracle on Solaris 8/sparc3. We have a
> client
> > >which was compiled on a 32-bit sparcII platform. We find that when we run
> > >multiple instances of the client we find that it doesn't scale very well,
> > >even though the product vendor believes that it should and we have
> evidence
> > >to prove this in different sites. We have already spent a lot of effort
> > >tuning the server and we now believe that it is well tuned. The server
> > >spends 90% of it's time idle. We have found that the client also spends
> > >about 90% of its time sleeping even though it has work to do, and there
> are
> > >plenty of free resources (only 25% of 24 CPUs are utilised, 40% of free
> > >memory, and IO is healthy).
> > >
> > >When we dig deeper into what the client is doing using pstack, we find
> that
> > >that it spend most of its time fetching from various cursor and
> specifically
> > >in the read function. An example stack trace is below:
> > >
> > >fea9e570 read (a, 8077ee, 810)
> > >
> > >fefc1a88 sntpread (810, 80560c, 804aec, 8077ee, 804a9c, 7f2544) + 10
> > >
> > >fefbf114 ntprd (804aa8, 805108, 7fb3d8, ff257314, 0, 7f9088) + 134
> > >
> > >fee1f490 nsprecv (8048c8, ffbe875f, ff285ce8, 0, 7f9088, 7fb3d8) + 58c
> > >
> > >fee24184 nsrdr (8048c8, 8080f0, ff285ce0, 0, 27, a) + 1dc
> > >
> > >fee09bc8 nsdo (7fcd08, 1, 0, 0, 7f9088, 8048c8) + f14
> > >
> > >fee91044 nioqrc (7f250c, ff257314, c27, 7fcd08, 0, 7f9088) + 6c0
> > >
> > >fefca86c ttcdrv (7f5c78, 7f5fbc, ff257314, 7f5fac, 7f5fae, 7f5fac) + b7c
> > >
> > >fee996ec nioqwa (7f250c, 0, fefc9cf0, 7f5c78, 7f2490, 0) + 48
> > >
> > >fed19330 upirtrc (7f2468, 7f250c, 0, 0, 7f1df4, 7f2544) + 7f8
> > >
> > >fed6ab60 kpurcsc (7f25e8, ffbe9cae, 7f242c, 7f5c78, 7f6804, 0) + 70
> > >
> > >fedccd7c kpufch0 (1f5b2c0, 2, 7f2468, 7f5b74, 1f5b308, 6340) + a68
> > >
> > >fedcdc28 kpufch (1f5b2c0, 7f1f0c, 0, ff257314, 7f242c, 2) + 92c
> > >
> > >fecebec4 sqlcucFetch (1f5a5d0, ff2848b4, 1, 1f5a5d0, 0, 7f1f0c) + 20
> > >
> > >fecda5b4 sqlall (ff2848b4, 1, 22a0c28, 16, 58, 16) + 1bf4
> > >
> > >fecd17a8 sqlnst (382a62, ffbedb50, ffbedb50, 1, ff2848b4, a7f0f0) + 7b44
> > >
> > >fecb3d3c sqlcmex (0, a7f0f0, ffbedb50, 37d918, ff22d13c, ff257314) + 488
> > >
> > >fecb4180 sqlcxt (0, 7615ec, ffbedb50, 37d918, 37d800, d0) + 58
> > >
> > >We run pstack repeatedly and the client is invariably in the above
> function.
> > >
> > >The client uses a local bequeath connection so I assume that the read()
> is
> > >reading from a pipe. I've noticed that the third parameter of read(), the
> > >number of bytes, is always 810. It is possible that this is the
> bottleneck.
> > >It is the sdu that controls the number of bytes read. We have not set the
> > >sdu and oracle documentation says that the default value is 2048. But
> then
> > >when in only 810 bytes read? (We have run pstack many time and it is
> always
> > >810 bytes).
> > >
> > >Even if the read size needs tuning this does not explain why the client
> > >processes do not scale. The performance increases reasonably linearly up
> to
> > >eight parallel processes but beyond this the increase in throughput
> slopes
> > >off, and beyond 14 processes there is no increase in throughput at all,
> even
> > >though the oracle server is idle and there are plenty of free resources.
> > >
> > >We were told that there was a bit when using a 32-bit client to access a
> > >64-bit server, but we have installed the relevant patch 2298968 and this
> did
> > >not improve the situation.
> > >
> > >We do not have source code for the client application but as it calls
> sqlcxt
> > >I assume it uses Pro*C. Any similar experience or any idea of what can be
> > >causing the problem?
> > >
> > >Ciaran
> > >
> >
> > Most likely your analysis is completely inadequate, and you are just
> > barking at the wrong tree. The influence of sqlnet on total
> > performance is completely negliceable. Also your assumption sqlnet
> > always should transmit 2048 bytes demonstrates you don't know anything
> > about the interaction between client and server. A sqlnet packet will
> > never ever cross statement boundaries. If you see 810 byte packets all
> > the time, the database only has 810 bytes to send for that statement.
> > For your sqlnet analysis you should also rather use the trcasst tool
> > than pstack, and if you insist on using pstack you should trace the
> > dedicated server process server your client instead of the client.
> > You would also check out to see what is going on in terms of
> > sqlstatements, and you should check disk I/O.
> > My crystall ball tells me, if you perform an adequate analysis, you
> > will find you are dealing with an unscalable application.
> >
>
> I would say our analysis in inadequate, it's more a last ditch desparate
> attempt to solve a problem that we have been unable to solve for the last
> six months. The pstack output always says that the client is doing a fetch,
> although each time for different sql.
>
> After 6 months of tuning the oracle server and various sql, we have the
> following typical statspack report for 1 hour:
>
>
> Avg
> Total Wait
> wait Waits
> Event Waits Timeouts Time (cs)
> (ms) /txn
> ---------------------------- --------------- -------------- -------------- -
> ------- ------
> db file sequential read 709,534 0 485,112
> 7 33.2
> latch free 34,194 29,875 47,577
> 14 1.6
> db file parallel write 13,493 0 24,937
> 18 0.6
> log file sync 22,394 0 21,367
> 10 1.0
> SQL*Net more data to client 848,722 0 16,448
> 0 39.7
> log file switch completion 181 37 13,305
> 735 0.0
> enqueue 75 23 13,091
> 1745 0.0
> log file parallel write 21,532 0 9,684
> 4 1.0
> log file sequential read 65,552 0 3,883
> 1 3.1
> buffer busy waits 2,499 0 1,422
> 6 0.1
> direct path read 15,594 0 704
> 0 0.7
> file identify 36 0 564
> 157 0.0
> file open 25,192 0 555
> 0 1.2
> control file parallel write 1,270 0 218
> 2 0.1
> db file scattered read 429 0 136
> 3 0.0
> LGWR wait for redo copy 367 7 26
> 1 0.0
> control file sequential read 697 0 18
> 0 0.0
> row cache lock 85 0 7
> 1 0.0
> direct path write 3,147 0 5
> 0 0.1
> log file single write 16 0 4
> 3 0.0
> refresh controlfile command 10 0 2
> 2 0.0
> SQL*Net break/reset to clien 2 0 0
> 0 0.0
> db file single write 2 0 0
> 0 0.0
> SQL*Net message from client 14,090,519 0 26,082,903
> 19 658.8
> PX Idle Wait 21,012 21,012 4,307,460
> 2050 1.0
> SQL*Net message to client 14,090,518 0 7,631
> 0 658.8
> SQL*Net more data from clien 442,221 0 2,047
> 0 20.7
>
> Note that 75% of wait time is for db file sequential read, and with an
> average wait time of 7ms this looks reasonable. Most of the read come from a
> single index cluster table with 300 million rows so there not much we can do
> to reduce the number of physical reads. Anyway the volumes where the index
> cluster is located are never more that 15% busy (were using multiple
> fibrechannel connection to an enterprise storage array with 24G internal
> cache, its fast).
>
> We were worried about the latch free waits which are consistently 7.5% of
> the total wait time. The latch stats are as follows:
> Get Spin
> &
> Latch Name Requests Misses Sleeps Sleeps
> 1->4
> -------------------------- ---------------- ----------- ----------- --------
> ----
> cache buffers chains 94,983,959 44,728 1,670
> 43094/1608/1
> 9/7/0
> library cache 8,691,838 34,134 10,298
> 27914/2268/3
>
> 829/123/0
> checkpoint queue latch 6,135,872 25,742 21,883
> 3868/21865/9
> /0/0
> redo allocation 5,090,786 16,847 111
> 16736/111/0/
> 0/0
> session idle bit 28,284,866 6,229 79
> 6150/79/0/0/
> 0
> dml lock allocation 1,284,074 1,529 15
> 1514/15/0/0/
> 0
> active checkpoint queue la 25,554 336 14
> 322/14/0/0/0
> enqueue hash chains 1,421,549 334 17
> 317/17/0/0/0
> undo global data 271,350 164 9
> 155/9/0/0/0
> redo writing 135,560 130 8
> 122/8/0/0/0
> messages 201,923 115 2
> 113/2/0/0/0
> cache buffers lru chain 676,086 69 45
> 24/45/0/0/0
> shared pool 105,216 50 40
> 29/5/13/3/0
> session allocation 109,383 41 3
> 38/3/0/0/0
>
> There is some library cache latch contention. Our application vendor has
> done a lot of work to reduce to an absolute minimum the amount of parsing
> done by our application. It always uses bind variables and some sql which
> unexplained caused high parsing have been rewritten. There's nothing more we
> can do in this area. The remaining contention on the library cache is now
> caused by different appliation accessing the same database.
>
> A big chunk of the latch free waits are on the checkpoint queue latch. I
> don't believe this directly impacts the user processes. We do plan however
> reduce the number of database writers to see if we can reduce this.
>
> So we not so worried about the latch free given the above analysis.
>
> Bear in mind that oracle spent a lot of its time idle during the statspack
> interval. See the number of SQL*Net Message from client wait events.
>
> So it is following the above that I turned to using pstack to trace what the
> application is doing. As I said oracle spends a lot of it time idle, and the
> client application spends a lot of time sleeping, so I think it makes sense
> trying to find out what it is sleeping on, if it is not waiting for oracle?
>
> Ciaran
>
> >
> > --
> > Sybrand Bakker, Senior Oracle DBA

Just a note to say that we have solved this problem. The problem was indeed in the Oracle client shared library, as we suspected. This is a Oracle bug(2142623 and 1239470) which is in Oracle 32-bit and 64-bit clients on Solaris 8 since 8.1.7.2 and which has been fixed in 9.2.0.2. For every fetch the client library opens, reads and closes the oraus.msb message file. This causes a large overhead in systems calls and the point on contention is in the multiple read/write locks on the file.

Oracle will not fix this bug. However the source code is available for a smart workaround on the Sun developers site at http://developers.sun.com/solaris/articles/oci_cache.html.

Once we installed the workaround our throughput increased four-fold. Anyone running a version of oracle on Sun in this range should be aware.

Sybrand Bakker, it's time to buy a new crystal ball.

Ciaran Received on Tue Mar 02 2004 - 08:03:55 CST