Re: DBWR - How Many is Too Many?
Date: Thu, 28 Feb 2008 12:18:21 +0000
Just a few ideas etc.
- I don't know the IBM DS4800 in any detail but you say your using a "128k element size". Is this the stripe width? If it is then maybe it's a bit small. I believe ASM uses 1MB for stripe width.
- Identify a known part of the application e.g. a specific report, that is performing poorly i.e. worse now with the new SAN than previously. Then trace it with a 10046 trace level 8. This will give you the actual wait times for a specific part of the application. You can then see what's taking the time, as Oracle sees it. Is it buffer busy waits, db file scattered reads etc. This will help you in finding the real cause. (I know it's SAP but under the covers it's still an Oracle database no matter what SAP say.) Keep an open mind about the cause of the problem while you gather good evidence that can lead you to the real cause.
- Try and get as much info as possible from the OS and IBM DS4800 about real physical I/O rates, disk utilisation, I/O queue lengths etc. This isn't easy as many I/Os are satisfied from the SAN cache. Compare all this with what's realistic in terms of physical I/O given your 12 15K RPM disks. Remember that queuing time increases exponentially with utilisation (see Cary's book).
- Is the data "stripped" over all the disks in such a way that the I/O load is evenly spread across all disks? If not then you may have 1 or 2 "hot" disks that are causing the problem.
- You say your using a SAN. Is this a real SAN or a direct attached disk array? If the former then you may also need to consider the actual storage network itself including the switches.
Quoting David Barbour <david.barbour1_at_gmail.com>:
> We recently moved our database to a new SAN. Performance has just tanked.
> Here's the environment:
> Oracle 184.108.40.206
> SAN - IBM DS4800
> We've got 8 filesystems for Oracle data files. Redo, Archive, Undo and Temp
> are all on seperate disk/filesystems from the data files.
> All the Oracle datafiles are on RAID5 LUNs with 12 15K RPM 73 (68 usable) GB
> drives. SAN Read and Write Caching are both enabled.
> A statspack (generally for any given interval - this was for a period of
> "light" processing) shows me our biggest hit is:
> Buffer wait Statistics for DB: PR1 Instance: PR1 Snaps: 12609 -12615
> -> ordered by wait time desc, waits desc
> Tot Wait Avg
> Class Waits Time (s) Time (ms)
> ------------------ ----------- ---------- ---------
> data block 278,194 20,811 75
> sar is scary (just a small portion)
> AIX r3prdci1 3 5 00CE0B8A4C00 02/27/08
> System configuration: lcpu=8
> 00:00:00 %usr %sys %wio %idle physc
> 02:15:01 19 19 42 19 4.00
> 02:20:00 21 25 40 14 4.00
> 02:25:00 19 18 43 20 4.00
> 02:30:00 18 18 43 21 4.00
> 02:35:00 20 24 40 16 4.00
> We're running JFS2 filesystems with CIO enabled, 128k element size on the
> SAN and AIO Servers are set at minservers = 220 and maxservers = 440
> We've got 32GB of RAM on the server and 4 CPUs (which are dual core for all
> intents and purposes - they show up as eight). We're running SAP which has
> it's own memory requirements. I've configured my SGA and PGA using
> Automatic Memory Management and the SGA currently looks like:
> SQL> show sga
> Total System Global Area 1.0739E+10 bytes
> Fixed Size 757152 bytes
> Variable Size 8589934592 bytes
> Database Buffers 2147483648 bytes
> Redo Buffers 1323008 bytes
> filesystemio_options = setall
> I'm thinking the data block waits is the result of too many modified blocks
> in the buffer cache. Solution would be to increase the number of
> db_writer_processes, but we've already got 4. Metalink, manuals, training
> guides, Google, etc. seem to suggest two answers.
> 1. One db writer for each database disk - in our case that would be 8
> 2. CPUs/8 adjusted for multiples of CPU groups - in our case that would be
> Any thoughts?