Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: Quick Tale: Lost production database because of keyboard and Veritas Cluster Server

Re: Quick Tale: Lost production database because of keyboard and Veritas Cluster Server

From: RSH <RSH_Oracle_at_worldnet.att.net>
Date: Thu, 28 Feb 2002 03:26:13 GMT
Message-ID: <prhf8.4703$106.268250@bgtnsc05-news.ops.worldnet.att.net>


Mr./Ms Wright, everything you said, or pretty much everything, is stuff I generally agree with in a factual manner.

But the poor guy SAID he was NOT a DBA and wasn't really familiar with how Sun's work, or Oracle, or Veritas, or failover. And all the things you mention are things that you have to know, to be able to do them, which he didn't and therefore couldn't.

That's the trouble with these USENET etc forums, I hate seeing people in the middle of a disaster being yelled at; that's kind of like using a megaphone to yell at a family being taken away by the National Guard, in a boat from their home, that's now under water, and saying, "YOU SHOULDN'T HAVE BUILT IN A KNOWN FLOOD PLAIN AREA! I HOPE YOU AT LEAST TOOK OUT NATIONAL FLOOD INSURANCE!". As I said, all you said was useful information, but don't you think it's a little mean to dump a whole ton of woulda / coulda / shouldas onto the shoulders of a guy who probably already feels pretty lousy, and is just begging for help to get out of a mess, and not lectured about all the things done wrong or not done that led to this crisis? Over many of which he probably had little or no control?

Well, I guess its a matter of different management philosophies. If someone on my team makes a big boo-boo, we get it fixed working together, then the team gets together to do a post mortem, and we turn tragedy into an educational experience; making someone feel worse about something they did that they already know was wrong, is not the way I handle my DBA's, programmers, SA's, and network folks. The man feels bad enough already.

Or, if you prefer (I think this is a Twain-ism):

"Teaching a pig to sing gets you nowhere. It won't work, and it just annoys the pig."

RSH.
"B.M. Wright" <bmwright_at_xmission.xmission.com> wrote in message news:a5josg$i9g$2_at_news.xmission.com...
> In comp.sys.sun.hardware milkfilk <milkfilk_at_yahoo.com> wrote:
> > I'm posting to multiple groups so that I might save someone's neck.
>
> > Me:
> > I'm not a DBA and I'm no Veritas expert.
>
>
> > Background:
>
> > The cluster is configured for failover so if one server blows up, the
> > other server mounts the disk and starts the oracle processes and
> > starts up the db instances.
>
>
> > What happened:
> > I pulled the keyboard plug on our Sun server while rewiring our KVM
> > switch. Yes, I know.
>
> As I'm sure others will point out, why the hell did you have a
> keyboard/graphics console hooked up to a database server?
>
> > What this does (according to usenet posts) unfortunately, is send a
> > Stop-A signal (in the form of an electrical short, I suppose). This
> > shouldn't be a problem, because the server is simply 'paused' and in
> > most instances you can simply type go and there shouldn't be any large
> > consequences. Of course, you can't expect to hit Stop-A all the time
> > and get away with it.
>
> Well, this is where you went wrong, you would have been much much
> better off by doing a 'sync' and panicing the machine at that point
instead
> of typing 'go'.
>
> > Our cluster is configured for failover, like I said, and the other
> > server mounted the arrays and started up the oracle process. The
> > instance hic-upped but it was running.
>
> Which is what it would be expected to do, however the old instance
> was still running on the other machine that was "paused". By the time you
> "un-paused" it the other machine had already imported the shared diskgroup
> and started the database. Both were writing to the disks and they trashed
> the data. This is why you should have simply crashed the machine instead
> of trying to recover.
>
> > We have an old backup but it's not great and we were in the process of
> > getting our backup procedure tested / working.
>
> It's critical production and you're still testing backups then in
> the meantime monkeying around with hardware while the system is live? No
> more comments there really.
>
> Even if the backup is "old" then provided your DBAs are using
> transaction logging and they've been backing up the logs then you should
be
> able to 1) restore "old" backup 2) roll all the transaction logs forward
and
> be back to the point you were at when the last logs were dumped/backed up.
> Exceptions being if they had done bulk loads or other things that don't
use
> transaction logs (or typically have the logging turned off during the
operation
> for performance reasons).
>
> > Let me tell you that I'm shocked that an enterprise system can do
> > this. A keyboard unplug started all of this. I'm looking at
> > disabling the keyboard and this is my job as a UNIX SysAdmin to know
> > this stuff, but the Veritas Cluster should have worked!
>
> It was your mistake, you can't blame the software really when it DID
> work. It did take over the instance on the other machine. I'm pretty
sure
> that if you read around in the documentation either for the Oracle agent,
> the Oracle docs, or the VCS docs and find a scenario like this that they
> will recommend you doing exactly what I said (crash the machine that is no
> longer owner of the service/database).
>
> > If one server simply blows up, the other server should pick up the
> > database and certainly not corrupt this "SCN" ...
>
>
> --
> B.M. Wright
> bmwright_at_xmission.com
Received on Wed Feb 27 2002 - 21:26:13 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US