RE: Oracle 10g hangs intermittently waiting for I/O
Date: Sat, 16 May 2009 14:33:44 +0300
It looks like a classic case of extremely slow IO at some lower (hardware or hardware driver) level. Once the IOs complete only then we know the IO times and that's why the average IO times jump up in iostat only after the "hang" is over.
In my experience sometimes (or often) the SA's and storage admins just perform some kind of healthcheck - they look into their equivalent of alter.log's and if don't find anything from there, they come back with "we did a healthcheck and everything looks fine from our side". What this statement really means is "we have no idea where to look and frankly we don't care as its easier to think that it must be a database problem anyway".
Another thing what I've unfortunately found out too often that the storage team (running high end storage arrays) sometimes doesn't even have proper LUN/port level performance instrumentation enabled. They say it's gonna affect IO performance a lot (Even though I'm not a storage guy I find it a little hard to believe that todays most expensive DMX etc arrays haven't gotten this right). And that's why their "healthcheck" doesn't show anything.
During most of my sudden IO problem troubleshooting cases we have eventually found out that there has been some change or misconfiguration (like putting database on the slow storage meant for backups or forgetting to enable some HBAs for multipathing). DBAs can't look into storage level, but it helps if you can point out (and show hard evidence) that there is definitely a difference in IO performance. That's when the SA's and storage admins go and do yet another "healthcheck", this time taking it seriously thanks to evidence displayed and oops they find out that someone had forgotten to do their work properly.
When there's a lot of fingerpointing going on, then visualizing IO stats (before and after) can be a good asset at meetings with different infrastructure teams as I've written here: http://blog.tanelpoder.com/2008/12/28/performance-visualization-made-easy-pe rfsheet-20-beta/
So I would first go and ask from SAs and storage admins, *what exactly* did they check and see during their "healthchecks".
If you want to get systematic about this troubleshooting then there are tools for monitoring IO requests to lower kernel levels. Linux has blktrace and systemtap for that. However neither of these are 100% production-ready. Blktrace requires mounting debugfs and requires a recent kernel, 2.6.18 I think (which is standard in redhat 5.2 equivalent) and systemtap requires installing systemtap & kernel debuginfo RPMs.
You probably don't want to start hacking your production environment like this so I would suggest to ask what exactly did the SAs and storage admins check when they said that everything is fine...
-- Regards, Tanel Poder http://blog.tanelpoder.comReceived on Sat May 16 2009 - 06:33:44 CDT
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org
> [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Pawel Kotlarz
> Sent: 16 May 2009 00:35
> To: Rajeev Prabhakar
> Cc: oracle-l_at_freelists.org
> Subject: Re: Oracle 10g hangs intermittently waiting for I/O
> Oracle shows many sessions waiting for direct path read
> (temp). Tanel's waitprof reports single events taking many
> seconds though most of them are below 15ms.
> On the OS level vmstat shows normal reading for some time and
> then sessions in an uninterruptible sleep with no I/O taking
> place. iostat -x and asmiostat (ML 437996.1) show specific
> volumes. Just after the performance returns to normal these
> volumes show much greater queue length (iostat) or much
> greater average read time (asmiostat).
> I ran strace on a process servicing the session on which I
> used waitprof earlier. It stops on a read call.
> Currently I only know that the sysadmins found nothing in
> Linux logs and on a 'system management page'. Unfortunately
> it is difficult to obtain more information from them unless I
> tell what exactly to check...