Re: Anyone has information regarding Oracle7 and Raid configuration ???

From: Roland Knapp <rknapp_at_de.oracle.com>
Date: 1998/04/15
Message-ID: <35346728.BFDD89CA@de.oracle.com>

Hi,

i attache you a note with some explanations !

RO
Senior Technical Analyst

Table of Contents

ABSTRACT
ORACLE7 AND RAID LEVELS

2.1 RAID LEVELS

  2.1.1 RAID 0:  STRIPING WITH NO PARITY
  2.1.2 RAID 1:  SHADOWING
  2.1.3 RAID 0+1:  STRIPING AND SHADOWING
  2.1.4 RAID 3:  STRIPING WITH STATIC PARITY
  2.1.5 RAID 5:  STRIPING WITH ROTATING PARITY

2.1.5.1 SUMMARY: ORACLE7 AND RAID LEVELS 3. ORACLE7 AND CACHED I/OS 3.1 TERMS AND BASICS
3.2 TYPES OF DISK I/O CACHING 3.3 ORACLE AS A DISK I/O CACHING PRODUCT 3.4 USING ORACLE7 WITH DISK I/O CACHING PRODUCTS

  3.4.1 RELIABILITY
  3.4.2 MEMORY WASTE
  3.4.3 PERFORMANCE EXPECTATIONS
   3.4.3.1 ORACLE DBWR AND DISK I/O CACHING

3.4.3.2 ORACLE7 LGWR AND DISK I/O CACHING 3.4.3.3 SYSTEM-WIDE PERFORMANCE 3.5 SUMMARY: ORACLE7 AND DISK I/O CACHING PRODUCTS 4. DIGITAL UNIX-SPECIFIC TOPICS 4.1 LSM 4.2 RAW VS. FILESYSTEMS

  4.2.1 RAW
  4.2.2 UFS
  4.2.3 ADVFS

5. OPENVMS-SPECIFIC TOPICS 5.1 THE SPIRALOG FILESYSTEM
5.1.1 WHAT IS SPIRALOG?
5.1.2 SPIRALOG AND ORACLE7

ABSTRACT

This paper is meant to be an exhaustive treatment of the issues concerning
Oracle and disk I/O. It covers the interactions between Oracle7 I/O and:

RAID disk caching products (hardware and software)

In addition, this paper covers Digital UNIX-specific and OpenVMS-specific
issues.

2. Oracle7 and RAID Levels

This section is a discussion of the various RAID levels, their advantages and
disadvantages, and their use with Oracle7.

2.1 RAID Levels

2.1.1 RAID 0: Striping with No Parity

RAID 0 offers striping only. It is not redundant (hence the name?); there is
no protection against drive failure at all. It is simply a collection of
drives in a stripe configuration.

During an I/O, a single drive gets <chunksize> bytes of I/O before the I/O
continues onto the next drive in the set. For I/Os that fit in a single chunk,
performance is the same is a single disk drive. For I/O's that span more than
one chunk, there may be a slight performance improvement since disks are able
to do a little work in parallel.

RAID 0 is useful with Oracle to reduce disk hot spots for Oracle data files.
It is generally not recommended for other Oracle files.

2.1.2 RAID 1: Shadowing

RAID 1 provides redundancy by duplicating an entire disk drive onto another.
It provides complete protection against single drive failures. It is also the
most expensive (in $) form of RAID since it maintains entire copies of disk
drives (perhaps even more than 1 copy).

During a read, any of the drives in the shadow set can be used. During a write,
all drives will eventually be updated with the new data.

When all drives are functioning, reads complete slightly faster than a single
disk read since the controller will route the read to a free (not busy) disk.
Writes take slightly longer than a single disk write. Performance characteristics are not effected much during a single drive failure. In the
worst case, performance is equivalent to a single disk.

RAID 1 is generally useful to Oracle (if the $ cost is acceptable). RAID 1 can
be used for any Oracle file. It is especially useful for Oracle redo log files
and control files; Oracle only has to issue one redo log I/O, saving code path
and context switching. However, the DBA/system administrator must use the RAID
controller utilities to keep up with failed disks since the shadowing of the
file is hidden from Oracle.

2.1.3 RAID 0+1: Striping and Shadowing

RAID 0+1 is often billed as a separate solution that offers the reduced hot
spot and performance benefits of striping (RAID 0) and the redundancy of

shadowing (RAID 1). It is just as costly as RAID 1.

While RAID 0+1 can be used with Oracle data files, it should not be used with
redo log files.

2.1.4 RAID 3: Striping with Static Parity

RAID 3 attempts to give performance and redundancy of RAID 0+1 without the high
cost associated with RAID 1's 1-for-1 drive redundancy. A number of drives are
ganged together in a RAID 0 stripe set. An additional drive is used to keep
parity information for the stripe set.

During normal operation, RAID 3 gives performance similar to RAID 0. Reads are
striped. Writes require two I/O's however; one for the data drive, and one for
the parity. In the event of a single disk failure, the set continues to

function albeit at reduced performance. Disk blocks from the missing disk are
reconstructed by reading all remaining drives in the set and the parity drive.
RAID vendors typically include cache on-board the RAID controller to increase
performance. Note that the parity disk in RAID 3 can be a performance bottleneck, which is why most RAID vendors go to RAID 5.

RAID 3 is useful for Oracle data files, but not for redo log files.

2.1.5 RAID 5: Striping with Rotating Parity

RAID 5 has similar performance and redundancy characteristics as RAID 3, but
the parity information is spread across all drives which eliminates the parity
drive as a bottleneck.

RAID 5 is useful for Oracle data files, but not for redo log files.

2.1.5.1 Summary: Oracle7 and RAID Levels

This is a summary of the various RAID levels and their use with Oracle7. The
numbers in parenthesis refer to the notes that follow the table.

RAID Type of RAID Control Database Redo Log Archive Log

                            File         File          File        File

 0    Striping              avoid         OK          avoid        avoid


 1    Shadowing          recommended      OK       recommended

recommended

0+1 Striping + OK recommended avoid avoid

      Shadowing                           (1)

 3    Striping w/            OK           OK          avoid        avoid

      Static Parity

 5    Striping w/ Round-     OK       recommended     avoid        avoid

      robin Parity                        (2)

Notes:

RAID 0+1 is recommended for database files because this avoids hot spots and gives the best possible performance during a disk failure. This is a costly configuration though.
RAID 5 is recommended for database files if RAID 0+1 is too expensive.
Oracle7 and Cached I/Os

This section is meant to give perspective on disk caching products and the
Oracle RDBMS. It covers both hardware-based (ESE20 solid state disk drive,
HSZ40 disk controller, Prestoserve disk caching memory module, etc.) and

software-based (UFS, AdvFS, VIOC, I/O Express, RAM disk, etc.) caching products.

3.1 Terms and Basics

In general terms, a cache:

is some small amount of expensive storage

replicates selected portions of some larger, cheaper storage

provides better (faster) access to the stored contents.

For example, a small paint can could be considered a cache compared to a 5
gallon drum of paint. While we could walk back to the 5 gallon paint drum
between brush strokes, it makes more sense to carry a small can of paint with
us for quick, easy access to the paint.

In computing terms, a cache attempts to provide fast access to data that is
held on slow, cheap media by moving the actively used portions of it to faster,
more costly media. For us, the slow, cheap media is disk drives and the faster,
more costly media is memory. So a cache is some amount of memory that is used
to hold the selected contents of a disk drive so that the CPU has quicker
access to the information.

3.2 Types of Disk I/O Caching

Generally, there are two types of caching: write-through and write-back.
These types are differentiated by the policies used to maintain them.

Both types of caching treat read I/Os the same. During a read, the memory
cache is checked first to see if the needed data is there. If it is not, the
read completes from disk, and if appropriate, a copy is saved in the cache to
save the disk I/O on subsequent reads. Performance characteristics for write-
back and write-through cached reads are the same too. If the read is able to
complete from cache, the read will be fast. If the read has to go to disk, the
read will be slow.

The difference between write-through and write-back caching is how they handle
data writes. Both methods will write the data to the cache and the disk. In
write-through caching, the write is not considered complete until the data
makes it to the disk. In a write-back caching, a write is complete when the
data makes it to the cache . This difference has both performance and reliability implications.

Write-back caching performs faster than write-through caching. Because the
write-through cache write has to go to disk, it completes at disk speeds.
Since the write-back cache write completes when the data gets to the cache, it
completes at memory speeds. The increased write performance of write-back
caching comes at a price though.

Write-back caching has vulnerabilities to system failures that write-through
caching does not have. Write-back caching is dependent upon memory. Memory is
not persistent storage, i.e., when it loses power, it forgets everything. So,
writes that an application was told were complete may not actually complete if
the system crashes before the write-back cache has a chance to dump its contents back to disk. This could leave the applications data in an inconsistent state.

Writes          Write-Through                   Write-Back

Complete when   data gets to disk               data gets to cache

Write speed     slow - have to wait for disk    fast - memory-to-memory
                                                copy

Vulnerability   none - writes to disk are       high - writes to memory
                persistent                      aren't persistent

There are a number of variations on write-through and write-back caching, most
notably, write-behind caching. Write-behind caching behaves like write-back
caching (with the same dangers), but with a time guarantee: writes will get to
disk within N seconds after the write gets to the cache. This is an interesting twist to write-back caching in that it reduces the window of

exposure, but the exposure is still there. OpenVMS' Spiralog sports write-
behind caching (discussed later).

3.3 Oracle as a Disk I/O Caching Product This may be insulting to the Oracle7 developers, but it's true: Oracle7 is a
fancy disk caching product that happens to understand SQL. The cache is the
buffer cache portion of the SGA. The portion of the disk being cached is the
database files. Like any caching product, Oracle7 is trying to provide fast
access to data that is held on slow, cheap media (disk drives) by moving the
actively used portions of it to faster, more costly media (memory).

The Oracle7 buffer cache is maintained with a write-back algorithm. A read
will be satisfied by the cache if possible. If the data is not in the cache,
the read will be directed to disk and the results will be saved in the cache.
Writes change the block in the cache; they do not immediately go to disk.

Recall that while write-back caching has great performance characteristics on
writes, it also has reliability concerns during failures. To ensure that no
writes to the database blocks are lost, a redo log is maintained. The write to
the redo log includes a list of all database block changes and must occur
before commit is returned to the database user. A redo log write is faster
than a random disk write since it is a spiral write (i.e., no other disk

activity should be on the redo log disk). The combination of a write-back
algorithm and redo logging provides Oracle7 the fastest possible performance
while maintaining complete data integrity and recoverability.

Note one of the differences between Oracle7 and other disk caching products.
Disk caching products allocate physical memory from the system for their cache.
Oracle7 does not. It allocates the SGA from virtual memory (in order to
maintain portability across platforms among other reasons). It is up to the
DBA and system administrator to ensure that there is enough physical memoryavailable on the system so that the operating system does not have to
page or swap Oracle7's cache. The idea of paging a cache is self-defeating
(think about the purpose of a cache again, then think of the consequences of
paging the Oracle7 buffer cache to disk). Further discussion of this is

outside the scope of this paper and is better left to Oracle7 database tuning
guides.

3.4 Using Oracle7 with Disk I/O Caching Products

Disk caching services are provided by operating systems, and separate disk
caching products are available from 3rd party vendors targeted for I/O-intense
environments. For OpenVMS, Digital has VIOC (Virtual I/O Cache) and Executive
Software makes I/O Express. Digital UNIX has Prestoserve, AdvFS, and UFS.
These products have good performance records and are based on well known

technology.

This leads to the question, "How does Oracle (a disk caching product of sorts)
behave when used with other disk caching products?" There are several issues
that arise: reliability, memory waste, and performance.

3.4.1 Reliability

If you want the performance improvements of a disk caching product, it is
important to understand their reliability characteristics. Using unprotected
write-back caching with Oracle7 will probably lead to database corruption if a
system failure occurs. Write-through caching does not cause these reliability
problems. First we will see how these corruptions can happen in general terms,
then we will further define protected vs. unprotected.

Write-back cached database files. Oracle knows at all times where the current
copy of a database block is: either it is in the buffer cache or it is in the
database file (we will ignore Oracle Parallel Server for now, but the same
argument holds). Even when the current copy of a database block is in the
buffer cache, Oracle knows how stale the disk block is and how much information
it needs to keep in order to bring the stale block on disk up to date again in
case of a system failure. When write-back caching is used on database disks
and a system failure occurs, it is possible that Oracle's recovery mechanism
will find a disk database block to be more stale than expected, and have

insufficient information to bring the database block up-to-date.

Write-back cached redo log files. When Oracle says a transaction has been
committed, this really means that Oracle has written the transaction's redo to
a persistent store -- the redo log file -- so that if the system crashes,
Oracle can regenerate the transaction. When write-back caching is used on redo
log files, this redo log write is no longer persistent. During recovery from a
system failure, it is possible that Oracle will not recover transactions that
it said were committed before the failure.

How can these corruptions happen? In essence, Oracle has no idea that disk
caching software is running underneath it. Most disk caching products are
implemented as device drivers or disk controllers and fool the layers of

hardware or software above themselves into thinking the I/O is really done.
Oracle optimizes both the amount of disk I/O it does and the amount of information it keeps around to bring stale disk blocks up-to-date. Oracle
depends on knowing what is on disk. If the caching software does not get the
data to disk eventually, Oracle cannot recover from system failures.

The distinction between protected and unprotected write-back caching is as
follows. Protected write-back caching ensures that the cached I/O will eventually get to the disk drive. Protected write-back caching is typically
battery-backed memory implemented as an I/O controller (HSZ40) or as a separate
memory module (Prestoserve). Unprotected write-back caching simply uses the
computer's physical memory to implement the cache with no way to guarantee that
the I/O will get to disk in case of system failure.

Caution is in order even when using protected write-back caching. In case of
system failure, protection against database corruption is only as good as the
battery that is keeping the write-back cache warm. Make sure the write-back
cache hardware can complete the I/O's it said it would, especially after a
system failure. Oracle cannot be held liable for database corruptions due to
write-back cache hardware problems.

3.4.2 Memory Waste

For software-based cache products, one problem that arises from combining them
with Oracle7 is that Oracle data may be doubly cached, once by Oracle and once
by the disk I/O caching product. In the worst case, the user gets no performance win from the disk I/O caching product and lots of memory is wasted
by storing the same information twice. Some cache products recognize this
situation and allow the DBA or system administrator to disable disk I/O caching
by the product for selected files if desired. OpenVMS' Spiralog and other 3rd
party products provide this selectivity. Unfortunately, the current UNIX
filesystems, UFS and AdvFS, don't.

Memory waste is not an issue with hardware-based caching solutions since they
do not use system memory.

3.4.3 Performance Expectations

It is important to understand how the DBWR and LGWR algorithms work in order to
understand the affect disk caching products can have on Oracle's performance.
Beyond that, the larger needs of the system will determine which kind of

caching product, if any, should be used to attain optimal performance. We
start first by restricting our view to Oracle DBWR and LGWR algorithm performance with caching, then we will take the broader system-wide view.

3.4.3.1 Oracle DBWR and Disk I/O Caching

There are two notable problems with using disk caching with Oracle7's current
DBWR algorithm. First is that caching disk writes may not necessarily make
Oracle run any faster. Second, if caching does succeed in making DBWR run
faster, it may actually slow down users' ability to get read work done.

DBWR performs parallel, synchronous writes of batches of blocks at a time,
referred to as a write batch. DBWR issues a batch of I/Os (asynchronous I/Os
for both OpenVMS and Digital UNIX), then waits for them all to complete before
continuing processing. This means that the latency time before DBWR begins
doing useful work again is as long as the longest I/O in the batch. Because of
this, there is little reason to have a database disk farm with disks of widely
differing latency times. In other words, there is no performance gain to
having part of a database on a solid state disk (like the ESE20) and another
part on a traditional disk (like an RZ28).

Perhaps a system-wide disk caching product is used, caching all DBWR writes.
This simply means that DBWR is able to get back to the work of cleaning the
Oracle7 SGA buffer cache more quickly. Unless keeping the SGA clean is a
problem (and it rarely is), DBWR could be wasting processing time keeping the
cache too clean.

3.4.3.2 Oracle7 LGWR and Disk I/O Caching

The problem with caching LGWR writes is similar to the previous problem mentioned with DBWR: LGWR is able to get back to work too quickly. The LGWR
design writes out batches of redo at a time, and uses the log write I/O latency
as a natural gating factor to determine the batch size. This design allows the
LGWR algorithm to degrade gracefully under load. As LGWR I/O latency becomes
smaller, so does its batching factor. In the worst case, LGWR is doing a
separate write for each transaction on the system, causing the LGWR code path
executed per transaction to skyrocket. It is possible (and we have seen it)
for LGWR to fire continuously when writing to a cached redo log file, consuming
an entire CPU in an SMP system.

3.4.3.3 System-Wide Performance

The performance improvement attributable to a caching product is highly dependent upon a number of factors:

whether the product is hardware or software based

whether the product is doing write-back or write-through caching

the size of the cache

the read vs. write mix on the system (heavy write would favour write-back caching, heavy read favors write-through)

the locality of reference of the I/Os

the performance requirements for different applications on the same system

In the end, whether or not a situation calls for disk caching software depends
on the performance needs and the situation itself.

Assume we have a system where Oracle is the only performance-critical application on the system. It does not make sense to use software-based disk
I/O caching. Any physical memory that we would have used on the caching

software could be better utilized by Oracle7. For additional performance, we
could consider adding more memory to the system (and giving it to Oracle7) or
perhaps using controller-based, protected write-back caching.Now let's assume
we have a system where both Oracle and another I/O-intense application share the label "performance-critical". We might
consider a software-based, write-through disk I/O caching product for this
situation. If the disk caching product can be told to cache only the I/Ointense
application, then we will have two tuning "knobs" that we can use to
fine-tune the applications' performance -- one the size of the Oracle buffer
cache, the other the size of the caching product's cache. If the disk caching
product does not have this selectivity, then we might want to give enough
memory to it to make both applications run well and use a very small Oracle
buffer cache.

3.5 Summary: Oracle7 and Disk I/O Caching Products

This is a summary of the use of disk caching products with Oracle7. Numbers in
parenthesis refer to the notes following the table.

Type of Caching Control Database Redo Log Archive Log

                    File       File       File         File

Write-Through        OK         OK       avoid        avoid

Write-Back,        never      never      never        never

Unprotected

Write-Back, OK (1) OK (1) avoid (2) avoid Protected

Notes:

Oracle cannot recommend using write-back caching. While it may benefit control and database files, there are too many implementation issues that affect database integrity to make a wholesale endorsement. If you choose to use protected write-back

caching, test the cache's ability to recover from system failures

before relying upon it in production systems. 2. Write-back caching could cause the LGWR to work too hard and consume

an entire CPU.

4. Digital Unix-Specific Topics

4.1 LSM
4.2 Raw vs. Filesystems

4.2.1     Raw
4.2.2     UFS
4.2.3     AdvFS

5. OpenVMS-Specific Topics

5.1 The Spiralog Filesystem

This section discusses Digital's new Spiralog filesystem and its use with
Oracle7. Spiralog has also been known as "The New Filesystem" and "Dollar"
(its code name).

5.1.1 What is Spiralog?

Spiralog is a log-structured file system that Digital is selling as a high-
performance alternative to OpenVMS' standard Files-11 filesystem. Spiralog has
the following features and characteristics:

files live on the volume end-to-end with each other

reads are cached in a really large O/S memory cache and support read-

ahead caching, so little disk head movement is expected due to reads

writes to files (only the changed blocks) are written to the end of the

volume resulting in fast write times (due to spiral writes)

supports both write-through and write-behind (write-back with to-disk

guarantee within 30 seconds) caching

incremental backups are lightening fast since all changes are congregated at the end of the volume

a nightly defragmentation utility re-coalesces free space to the end of

the volume

5.1.2 Spiralog and Oracle7

We have done a little testing of Oracle7 with Spiralog. These are the conclusions so far:

Oracle7 works on Spiralog, with restrictions...

do not use write-behind caching on redo log files

do not use write-behind caching on files that contain rollback segments, or the system tablespace (since it contains a rollback segment); if used, this will result in a database that thinks it needs

recovery and doesn't need recovery at the same time, and an ORA-600 when the affected data is accessed

can use write-behind caching on other data files

no major transactions-per-second performance improvement

affect on incremental backup time expected to be good but not tested

yet

We recommend staying away from Spiralog unless incremental backup time is of
utmost importance.

remove nospam to email me Alain Barrette wrote:

> Hi,
>
> Any informations, ressources, pointers regarding Oracle7 and a RAID
> level5 configuration (or any other level for that matter) ?
>
> Specifically, performances improvements, i/o distribution across
> multiple disk, pro and con, etc...
>
> Some stats would be welcome too...

--

Received on Wed Apr 15 1998 - 00:00:00 CDT