RE: How I save Cingular Wireless USD 30M

From: Randy Johnson <oraclelist_at_sbcglobal.net>
Date: Tue, 4 Sep 2007 20:38:35 -0500
Message-ID: <006c01c7ef5d$7a0148d0$3e126480@scraunch>

Well after reading Tom's comments I'd say there is not enough room in here for his ego anyhow.

Good riddance Tom. Go masturbate your ego somewhere else.

Ooo. Is it getting hot in here?

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Tom Pall
Sent: Sunday, August 26, 2007 12:02 AM
To: Jeremiah Wilton
Cc: Niall Litchfield; oracle-l
Subject: Re: How I save Cingular Wireless USD 30M

Let me state that this is the last challenge I will answer.

The database was spending extra CPU time.

The particular database which went belly up (but I had cloned and fixed and fed the backlogged data to) was unusable. It would not open up. So no steps could have been taken. I could quote iTARS from Oracle Support on this but that is Oracle and Cingular confidential.

The $30 million figure was my boss's. He told me how much they were planning to budget to get the databases working at a proper speed. Not prevent loss of data, just to upgrade hardware.

I'm sorry to say that yes, my solution would have required psychic abilities or perhaps a somewhat talented DBA, because Oracle Support had the problem of the slowness and deadlocks elevated and elevated. They couldn't go any higher and Oracle did not have a solution until I dreamed about data dictionary corruption and came upon hcheck.sql and deduced what that problem was. It took a while for Oracle Support to work with developers to verify that truly this corruption was the cause of the increasing slowness. Then it took a couple weeks of negotiations (threats from Cingular to go to DB2) before they agreed to allow me to fix the data dictionary and keep the database instead of their memorized method of fixing the data dictionary so you could export the data and import it into another database.

How many times do I have to tell you. I ran Statspack reports at the highest level of detail until I was blue in the face. I ran traces. I set events. But I also am by nature intuitive and tend often to use intuition to solve a problem with facts to back up my intuitive conclusion. So after providing all of this stuff to Oracle Support, they were at a loss, well, they were very eager to look at corruption as a cause, because they didn't have another solution.

Yes, the problems was solved. Over the duration of my stint with Cingular (I had one database which Oracle and I had to work up DML to the data dictionary for a couple months, then apply it to a clone, which resulted in the clone pegging the CPU with SMON running for 6 weeks straight). And I had many of these databases. The problem got cleared up when finally all of the 5 types of data dictionary corruption were fixed with a total of 12 techniques, which not only speeded up the databases (saving $30 million in hardware upgrades and perhaps having to go to RAC), and then converting to LMT. So yes, I started on the problem during my first week at Cingular end converted the last database to LMT during my last week at Cingular, working on this problem (and the usual development/production DBA work) for the duration of my tenure there. The databases now have 10X as much data than they had when they were built but run as fast as they did when they were built years before.

I am hereby ending my participation in this thread. Flame me all you want, I will just hit the delete key.

Tom in Austin

On 8/25/07, Jeremiah Wilton <HYPERLINK
"mailto:jeremiah_at_ora-600.net"jeremiah_at_ora-600.net> wrote:

Tom,

You say that the 'orphaned segments' caused a performance problem. What was the database spending time doing to cause this performance problem? If you had done nothing about the orphaned segments, what would have prevented someone from taking the same steps to manually update the data dictionary at the point that the database became so slow as to be unusable.

Your assertion that you saved Cingular $30MM seems to imply that had you not taken action that there would have been complete loss of data. Can you characterize how that data loss would have occurred?

This response actually is not very technical. My chief gripe is that it doesn't say how a person like myself with no apparent psychic abilities vis-a-vis Oracle databases might have detected and resolved the problem.

Most people on this list (hopefully) use wait events, preferably via ASH, to detect the root cause of performance problems. How was the time being accounted for in the wait event interface? DD reads are accounted in that interface just as normal index and heap segment reads are. So you can see why some people here who approach problems in an empirical manner might have questions about the character of the problem.

My questions in no way are meant to invalidate the way that you solved the problem. After all, if you solved it, regardless of how you obtained the solution, wasn't the problem solved?

Thanks

Jeremiah Wilton
ORA-600 Consulting
HYPERLINK "http://www.ora-600.net"http://www.ora-600.net

Tom Pall wrote:
> I did the traces, ran Staspack till I was blue in the face, set the
> events to trap deadlocks. I did all of the things a DBA would do but
> decided that there was something deeper than just two applications
> colliding, because as I worked the problem over a two week period, I
> noticed the database slowing down. Not waits slowing down, not I/O
> slowing down, just throughput slowing down. Slowing down in ways
> neither I nor Oracle Support could explain before my dream, research in
> Metalink and discovery of hcheck.sql in Metalink.
>
> Is this technical enough?

No virus found in this incoming message. Checked by AVG Free Edition.
Version: 7.5.484 / Virus Database: 269.12.10/976 - Release Date: 8/27/2007 6:20 PM

No virus found in this outgoing message. Checked by AVG Free Edition.
Version: 7.5.485 / Virus Database: 269.13.5/988 - Release Date: 9/4/2007 9:14 AM

--
http://www.freelists.org/webpage/oracle-l

Received on Tue Sep 04 2007 - 20:38:35 CDT