Re: Weird database hanging

From: Rajeev Prabhakar <rprabha01_at_gmail.com>
Date: Fri, 21 Sep 2007 20:30:58 -0400
Message-ID: <2ba656800709211730g48212570jde71fdfaec7282c3@mail.gmail.com>

Don

That's interesting....I want to share our experience just in case it helps anyone..

While conducting stress tests against a two node 10.2.0.3 rac/asm/ SAN based database, we were facing near freeze / hang besides the (ORA-3136) error and ipc timeouts followed by node evictions.

So, we tried all the recommended things. Bumped up sqlnet/listener timeouts, sessions/processes/pga_aggregate_target, shared pool size etc.. without any luck. The near freeze/hang continued beyond a particular number of concurrent database sessions. We doubly checked our o.s. params etc just in case...but it didn't help.

Later, we decided to increase swap space (given some low available swap space observed during these tests even when memory was available) and we have found that post increase, the database hangs/ node evictions didn't occur any more AND the load tests completed the allocated window. Although, concurrency continued to be the #1 wait during these window, but all our instances(db/asm) survived the load test.

Now, it is quite possible that we haven't fixed the root cause and this is just a distraction/giving us a temporary breather.

Anyway, if we find something later (e.g. a bug etc.), I'll let everyone know..

-Rajeev

On 9/21/07, Don Seiler <don_at_seiler.us> wrote:
>
> We *think* we have found the issue, and it isn't quite Oracle-related
> (of course).
>
> The SA had been doing a Veritas online relayout on the disk partition
> that is our archivelog destination. He aborted it, but rather than
> aborting, Veritas left it in a "paused" state. This happend 20
> minutes before the bulk load that caused our first instance hang.
> Note that we *were* able to archive logs, it just seemed to have
> caused some more waiting than normal. This was compounded during bulk
> loads, and in the end caused a crush of shared pool and library cache
> latches.
>
> This situation was discovered yesterday and the times seemed all too
> coincidental. The state was corrected and we've been happily bulk
> loading anything and everything since then.
>
> In the end, we recognize there is plenty of room for improvement in
> the application code (and horrible inefficiencies in the app database
> design), but were quite certain that wasn't the root cause of this
> problem. I'm still pretty upset with Oracle support over their
> blinders and insistence that the problem was "properly diagnosed" and
> ignored all of my input and feedback.
>
> Don.
>
> --
> Don Seiler
> oracle: http://ora.seiler.us
> ultimate: http://www.mufc.us
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Sep 21 2007 - 19:30:58 CDT