Re: A Real Stinker to Solve - EXPERTS APPLY WITHIN (DB Crashing)

From: joel garry <joel-garry_at_home.com>
Date: 12 Oct 2006 15:17:29 -0700
Message-ID: <1160691449.285972.320230@m73g2000cwd.googlegroups.com>

Johne_uk wrote:
> Hi Joel. Some answers - not all complete. I'm not an experienced
> DBA - thanks for the time you have taken to respond.
>
>
> Q. What kind of file system and related hardware are you using? Please
> specify raw/cooked, sync/async, RAID level, number of controllers, SAN,
>
> etc.
> A. Pairs of mirrored disks (RAID 1). Built in RAID controller
> (hardware).
>
> Q. Where exactly are the log files, archived log files and data files?
> A. All data files reside on one mirrored pair
> Binaries and archives logs on second mirrored pair.

That's reasonable enough, considering just a pair. Binaries have low disk access. Archives, redo and undo tend to have the most access in an OLTP system, with the former being mostly sequential writes. It's worth checking usage (but not for this problem).

>
> Q. Does the system ever just kind of slow down for a while?
> A. I've not observed this happening.
>
> Q. Ever see any log file related waits? What is your log buffer size?
> A. In the past there were occasional log waits yes - this avenue was
> pursued by Oracle Support with no success. Ive justed look through a
> weeks worth of alert logs and there were no references to this.
> db_block_buffers=180000

This is for db blocks. log_buffer is the parameter we need to know. Since you didn't list it in the init.ora parameters, it must be default, which is 512 KB or 128 KB * CPU_COUNT, whichever is greater. See
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96533/memory.htm#38479 to see if you need to change it.

>
> Q.How often are you checkpointing? Any messages about it alert log? Any
>
> other messages in alert log besides log switches?
> A. Occasionally get this entry in alert log.
> ARC0: Unable to archive log 5 thread 1 sequence 2405
> Log actively being archived by another process

Could be RMAN backing up. Is that when you see this? If it isn't, could be a problem.

>
> Q.How big are your redo logs? How often do they switch (maximum)?
> How many archivers are running?
> A. There are 5 Log Groups of one member each (100MB in size).
> There are 2 arch processes running - arc0, arc1
> CHECKPOINT PARAMETERS
> db_block_checksum TRUE store checksum in db blocks and check during
> reads
> log_checkpoint_interval 0 # redo blocks checkpoint threshold
> log_checkpoint_timeout 1800 Maximum time interval between checkpoints
> in seconds
> log_checkpoints_to_alert FALSE log checkpoint begin/end to alert file
> db_block_checking FALSE data and index block checking

I meant for you to look in the alert log and see how often they are switching during your busiest times. You probably don't want to log checkpoints to the alert file, they just become annoying.

>
>
> Q.What else besides Oracle is on the server? Are you running multiple
> instances or databases? RAC involved?
> A. There is a Sybase ASA dbase product on the same server.
> Oracle not using RAC and single instance only.
>
> Q.What kernel parameter modifications have you made?
> A.set shmsys:shminfo_shmmax=4294967295
> set shmsys:shminfo_shmmin=1
> set shmsys:shminfo_shmmni=100
> set shmsys:shminfo_shmseg=10
> set semsys:seminfo_semmni=100
> set semsys:seminfo_semmsl=260
> set semsys:seminfo_semmns=720
> set semsys:seminfo_semopm=100
> set semsys:seminfo_semvmx=32767
>

That MAXFILES issue in the other posts could very well be it. Didn't even think of it. The archiver can start up a variable number of archive processes, so you might watch to see if something is happening like, you start up the instance, there's one arc process, one lgwr process, later it decides to start up another arc process, lgwr dies. Although I couldn't say why it wouldn't just barf on the second arc process. Perhaps that is the cause of the "archived by another process" alert - you run out of file handles, arc has a problem, tries to start another arc, lgwr crashes, or something similar. Normally, if you can't archive the logs, the db comes to a screeching halt.

Of course, MAXFILES is an hp-ux thing, a soft limit on the number of files a process may have open. I think the equivalent on solaris is rlim_fd_cur, but that defaults to a much higher number. Do you have thousands of data files? Or thousands of connections to the db? Or thousands of weblogic connections after a couple of days? Or some bizarro utility that looks at files and forgets to close them?

--
@home.com is bogus.
http://tailrank.com/662309/Jury-awards-$11-3M-over-defamatory-Internet-posts
http://www.perkinscoie.com/casedigest/icd_results.cfm?keyword1=defamation&topic=Defamation

Received on Thu Oct 12 2006 - 17:17:29 CDT