Re: Quick Tale: Lost production database because of keyboard and Veritas Cluster Server

Home -> Community -> Usenet -> c.d.o.server -> Re: Quick Tale: Lost production database because of keyboard and Veritas Cluster Server

From: RSH <RSH_Oracle_at_worldnet.att.net>
Date: Thu, 28 Feb 2002 03:00:01 GMT
Message-ID: <R2hf8.4653$106.262531@bgtnsc05-news.ops.worldnet.att.net>

A couple comments, starting with my prayers are with you; I didn't get from your note whether you have Production running or not.

As far as what started it all, that happened to me in Virginia... Important Safety Tip: DO NOT EVER try the one-console for many Sun server thing with a standard setup with one of those switches that lets you "effortlessly monitor many systems from one workstation, saving space and energy" or whatever crap they say.

Our switch-machines switch was high quality, gold contacts, etc blah blah blah.

But it sent something Solaris interpreted as an ASCII BRK (break) signal, sending the switched-to or switched-from system down the well into firmware mode. I remember in particular, because the SA and I were meeting with a roomful of Oracle suits when our poor DBA [meaning one actually employed by the company, not a hired gun] burst into the room with a "Tom, Scott!! Atlanta's burning! The Yankees are coming!!!" type of hysteria.

So anyone (especially the rack mountable baby Suns), tempted to go the 1 console route, make sure you use an electronic switch and make sure Sun says it's okay, and test to be sure it is. Inadvertently removing the KB from SYSCON probably does the same thing. "STOP-A"? Is that BRK, or DC1 (ctrl-I forget, I think it's CTRL-s, or CTRL-q; I miss the old Teletype days where that stuff was on the keys...) I don't think anything as simple as BREAK should actually cause the situation the signal is named, but that's how it is; it at least should be user-configurable in the firmware. So yeah I totally agree with you that a simple one-pulse signal of any sort should put you on the express AMTRAK Metroliner to Hell.

In fact nothing that could do anything like that has any business being an ASCII signal, or interpreted as one; if anything, it should bring up some emergency screen and from that point, you could pick to kill the system, or not. Heck that can't be that hard to do and change the firmware.

In regard to OPS, I am a devoted fanatic of it and I'll be getting hell for this from a lot of people here, so don't worry about you getting heat.

I don't know how your SCN's got out of whack, that's what my old boss would call 'very bothersome', and you ought to get some people in to find out how that happened. You already chastised yourself about backups so we won't go there.

How are the machines interconnected and are you using Oracle's standby product or someone else's?

OPS requires a dual fiber link between the Suns and a dual connected shared RAID box, and the use of raw disk I/O, and costs a fair (meaning very substantial) bit of money. And, last time I checked in a vain effort to get my customer to use it, the price includes REQUIRED, no exceptions on-site Oracle support until your OPS is flying right. (Personally, I think that's a prudent move; it's bundled into the price anyway so you have no real choice.)

However, overall, OPS offers a few things that warm-failover things do not:

In normal mode, Machine A can be used for online processing / transactions, etc; Machine B can be doing work like reports and long running queries; this gets at the root of the eternal conflict between OLTP (online transaction processing) and DSS (decision support system) activities on the same server and instance, and using real-time data. Most warm-standby things use the technique of sending over what are called redo log files and applying those to the standby database. Depending on the interconnect and the degree of skill in setting it all up, this can kill you.

At one customer's site, 20%+ of the production machine (you know, the one customers use and expect fast performance from??) resources including its network bandwidth, CPU, etc, were sucked up by the synchronization send-the-logs process. In OPS, both machines use the very same disks, no copying, no updating, no nothing to synchronize.

If you don't have that problem (reporting and analyzing live transaction tables), then that wouldn't matter much to you, but it sure was killing us. I won't mention the name of the failover product used, except to say it was NOT Oracle's.

2) Again because there is only one set of data, you would not run into any synchrony problems; where we have had OPS, failover was great [the fact failover was triggered wasn't nice, but there was a working environment, albeit slower than when 2 machines were up.

3) Which leads to of course, a careful part of all this is writing scripts and different initialization files for Oracle, so, say an instance on A used to having x amount of memory, will come up on an init file that demands a lot less; don't forget, you're cramming everybody onto one machine, and you can't fit 20 pounds of uh, er, Oracle into a 10 pound bag. That's a bit different than the warm standby technique; but what I don't like about dedicated failover boxes (a lot of things besides) is they basically sit there and do nothing, where in OPS, you can take advantage of that horsepower for some very good uses (like reports and huge queries some VP wants in the middle of the busiest time in your system schedule).

'Needless to say', you want RAID-1 mirroring in your RAID box (I can sense another whipping coming...) for this. If you really have the money, go for a hugely memory buffered monster like EMC makes, which also offers a third tier of disks to which everything is mirrored, that can be taken offline and dumped to backup, without causing I/O bottlenecks and other chaos in production. Other people make similar storage solutions, I mentioned EMC only because I've met with them extensively and trust the product (because I know about it, not saying everything else is crap).

Well I said my piece. If there's nothing left of me but a skeleton after all the meat has been chewed off my bones, this'll be why. In any case, good luck and sorry to hear of your troubles.

RSH. "milkfilk" <milkfilk_at_yahoo.com> wrote in message news:90d82e70.0202271308.3056cee3_at_posting.google.com...
> I'm posting to multiple groups so that I might save someone's neck.
>
> Me:
> I'm not a DBA and I'm no Veritas expert.
>
>
> Background:
> We have two 420r's that are Clustered with Veritas Cluster Server
> (VCS) and are running Veritas Oracle agents for High Availability
> (HA). The servers are hooked up to two A5200 drive arrays that are
> mirrored. The disks on each array are striped (RAID 0+1).
>
> The cluster is configured for failover so if one server blows up, the
> other server mounts the disk and starts the oracle processes and
> starts up the db instances.
>
>
> What happened:
> I pulled the keyboard plug on our Sun server while rewiring our KVM
> switch. Yes, I know.
>
> What this does (according to usenet posts) unfortunately, is send a
> Stop-A signal (in the form of an electrical short, I suppose). This
> shouldn't be a problem, because the server is simply 'paused' and in
> most instances you can simply type go and there shouldn't be any large
> consequences. Of course, you can't expect to hit Stop-A all the time
> and get away with it.
>
> Our cluster is configured for failover, like I said, and the other
> server mounted the arrays and started up the oracle process. The
> instance hic-upped but it was running.
>
> Come monday, my DBA told me that the DB wasn't coming back up and we
> spent 17 hours finding out that the SCN number was off and our system
> table space was corrupted.
>
> We have an old backup but it's not great and we were in the process of
> getting our backup procedure tested / working.
>
> We looked at options with and underground tool "Data Unloader" ( DUL )
> which we were told would cost us $10,000 out front and would take a
> consultant 1 or 2 days to get out here. Our tables and columns would
> be unnamed (column001, column002 etc).
>
> Let me tell you that I'm shocked that an enterprise system can do
> this. A keyboard unplug started all of this. I'm looking at
> disabling the keyboard and this is my job as a UNIX SysAdmin to know
> this stuff, but the Veritas Cluster should have worked!
>
> If one server simply blows up, the other server should pick up the
> database and certainly not corrupt this "SCN" ...
>
> I know I'm going to get flamed because I broke the golden rule (always
> have a current backup), but it's more of a factor of my job &
> available time [ref: chickens with no heads]. And also, this
> avalanche was all started by myself screwing with a production system.
>
>
> But amazingly, I still have a job and I have a few comments /
> questions:
>
> 1. We have our backups square now and we are looking at Oracle
> Parallel Server. Anyone using Oracle Parallel Server with software
> clustering? I was reading about an extension to Veritas that allows
> you to mount a volume more than once (a limitation of Unix-ish systems
> - so I believe)
>
> 2. We are going to use archive log mode and cold backups.
> 3. Anyone having problems with Veritas Cluster Server?
> 4. Anyone have comments?
Received on Wed Feb 27 2002 - 21:00:01 CST