Re: How I save Cingular Wireless USD 30M

From: Tom Pall <oracle.list_at_gmail.com>
Date: Mon, 27 Aug 2007 15:47:24 -0500
Message-ID: <7cf45f7f0708271347u7e3806b2k2dcbdb5ca05e313c@mail.gmail.com>

Cingular had done upgrades over the years in situ, as these were always, for the hardware, gigantic databases.

It did not occur to Oracle Support experts, whose names I prefer to not mention, that this problem happens with upgrades. It is a /known/ problem with dropping partitioned IOTs that sometimes the recursive DDL does not complete, leaving the data dictionary in a /mess/.

I told everyone, if you'd only read, how you can tell if you have a corrupt data dictionary. Run hcheck.sql .

There were no messages in the alert log besides deadlock messages. I repeat. Oracle was throwing /internal/ errors, not the errors listed in the message/event file which contains all of the ORA-XXXX.

This is a known problem, this dropping IOT partitions. But it does not have a bug number that I know of because a bug implies something that's been caught by Oracle. They know it exists, but don't know any more about it. Or at least they didn't know any more about it when I left Cingular. I do know that it was not fixed in 10gR1, according to the Oracle Internals gurus I dealt with.

I just don't think everybody gets it. Please try to think outside the box just for a moment. This was not your common, everyday run of the mill problem which revealed itself in the alert log. I did everything a DBA could do: set events, ran traces, ran Statspack reports, iostat, vmstat. I was able to correlate the slowness of queries and batch jobs to the result from hcheck.sql. I had my boss, the SA, bring back 6 month old and 1 year old backups of databases and checked the results of hcheck in them and the results of Statspack reports and also look at the application's log to see how long it took to run batch jobs. The more dd corruption, the slower things ran, the less corruption, the faster things ran. And Oracle Support Internals Group and I predicted the demise of our biggest database. We gave it 5 months, it died after 7 months as the result of hcheck got bigger and bigger.

I feel that I've explained enough. If you discover that your database seems to be slowing down and deadlocks are appearing where no one has changed the code or the load, then run hcheck.sql, take the results and contact Oracle Support Internals Group. They and their management will remember that guy at Cingular Wireless with those dozens and dozens of iTARS.

Tom Pall
An Oracle DBA who's very sorry he shared this with Oracle-L:

On 8/27/07, Bill Ferguson <wbfergus_at_gmail.com> wrote:
>
> Well along with "liking to know" how to fix the problem, which evidently
> we won't know unless or until the exact same symptoms appear on our systems
> and Oracle Support divulges the information then, I'd also like to know what
> caused the problems and what exactly the symptoms were.
>
> Just a "slow" database is rather vague. Were their consistent messages in
> the alert log that pointed to something, or was everything acting like 10x
> more users than normal were accessing the system, or what.
>
> Knowing what caused the problem as well would be beneficial, in case the
> same sort of process (or processes) were taken here. I've only done an
> "upgrade" once, usually I prefer to always do a clean install and then
> export from the old version and import into the new version, just so
> everything stays as "clean" as possible, but if this was done at Cingular,
> did anybody have any ideas on how the corruption occured in the first place?
>
>
> --
> -- Bill Ferguson
>

--
http://www.freelists.org/webpage/oracle-l

Received on Mon Aug 27 2007 - 15:47:24 CDT