Re: Oracle RAC cost justification?

From: Matthew Zito <mzito_at_gridapp.com>
Date: Fri, 3 Jun 2005 01:07:11 -0400
Message-Id: <90B8E05A-92BC-49E8-98EF-26C4139A6D8E@gridapp.com>

I can talk about a few:

Major organization in the NE region has two datacenters, each with its own SAN and about 100 servers attached to them - primary and DR site. They followed all the best practices - two independent fibre channel fabrics, every host connected to both fabrics through independent cards using automatic failover software, mapping to two independent ports on the storage array. The one thing they did, though, was merge the A fabric at the primary site and the DR site and the B fabric in the same fashion to make replication easier.

They had dark fiber between the two sites, and the A fabric was connected to one strand of dark fiber and the B fabric was merged over the other strand. Everything runs fine, time goes by, and then it turns out due to a cabling error, both fabrics on the primary side were connected to the same card in the DWDM (dark fiber box). That card has a problem, the fabrics segment, rejoin, segment, rejoin, segment, rejoin, etc. etc.

Now, in Fibre Channel, whenever there is a fibre network segmentation or join event, every host has to log out and log back in to the network (think of it like dhcp). However, the drivers on the fibre channel cards turn out to have a safety feature - when there are a certain number of logins/logouts on the fabric in a short enough timeframe, it keeps the server logged out until a reboot.

So when the card on the DWDM gets flaky, every host in both datacenter on both fabrics log in and log out about 100 times in 20 seconds, and they get permanently ejected from the fabric and now need to rebooted. On top of that, this happened in both datacenters, so now the DR facility is ineligible for failover. All business operations are down for two hours until they can reboot all the servers. Whoops.

-Much simpler one - Hard drive dies in a SAN array. Customer goes to yank out the failed drive, accidentally grabs the drive next to it, pulls it out. Yanking the spinning drive destabilizes the fibre channel drive bus, which takes down the array, and manages to crash both "redundant" array controllers.

-Redundant controller SAN array. Performance suddenly slows to a useless crawl for no apparent reason. After about four hours of troubleshooting, it becomes clear that the controller is only servicing one out of every four interrupts, which turns out to be the bare minimum it has to meet in order to not failover to the standby controller. All of the applications were down, because the array was so slow as to be useless, but it didn't meet the controller's definition of failure.

I could go on and on. I have a particularly choice one that I think has another year or so left on the NDA (even without names - this one was big enough that someone on this list would recognize it). We're not even counting the "failures" that arise from DBAs, sysadmins, storage folks, etc. misunderstanding exactly what they've bought and how it works.

Matt

On Jun 2, 2005, at 6:57 PM, Jared Still wrote:

> On 6/2/05, Carel-Jan Engel <cjpengel.dbalert_at_xs4all.nl> wrote:
>
>>
>> On Fri, 2005-06-03 at 00:01, Jared Still wrote:
>>
>> *> I agree it could still be a SPOF but it certainly is redundant
>> component
>>
>>> wise...
>>>
>>
>> SANs will and do fail.*
>>
>> Oh yes, they do fail. Don't get me started.......
>>
>>
>>
> Well, why not. :)
>
> Tell us about a SAN failure you can discuss.
>
> I've experienced some, but a public forum is not a good place
> for me to vent about it.
>
>
> --
> Jared Still
> Certifiable Oracle DBA and Part Time Perl Evangelist
>
> --
> http://www.freelists.org/webpage/oracle-l
>

--
http://www.freelists.org/webpage/oracle-l

Received on Fri Jun 03 2005 - 01:12:42 CDT