RE: Server failures

From: Freeman, Donald <dofreeman_at_state.pa.us>
Date: Tue, 30 Sep 2008 09:55:40 -0400
Message-ID: <55264C4C0484A547B34C0B1A28E219EA18794C122C@ENHBGMBX01.PA.LCL>


Just to follow up, responsibility for problems is hard to assign at my location. The application owners pay for the servers and manages the users, the server team operates and manages the servers, and the database team operates the database. I couldn't tell you how often we endure an outage because of the lack of willingness to step up and at least say something when something goes wrong. The servers are aging out and failing and everybody waits for everybody else to take action. Everybody is paralyzed into inaction by fear of the response, "That's not your job, mind your own business." I have a six year old production DB server down right now that previously failed back in June. We have servers or VM's that we could have moved it to but everybody is pretending that its not their problem.

My DBA's get testy also when I ask them to look into something that is not strictly their responsibility. All of us get nervous when we are clearly on somebody else's turf. When they find something I can take it up the chain and get something done for the benefit of all of us. I point out to my team that when things draw to their logical conclusion and a system fails that it will be them working around the clock to move and restore a database on Christmas Eve.

Donald Freeman
Database Administrator II
Commonwealth of Pennsylvania
Department of Health
Bureau of Information Technology
2150 Herr Street
Harrisburg, PA 17103
dofreeman_at_state.pa.us<mailto:dofreeman_at_state.pa.us>



From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Freeman, Donald Sent: Tuesday, September 30, 2008 9:34 AM To: 'Chris.Taylor_at_ingrambarge.com'; ORACLE-L Subject: RE: Server failures

I'm sure it depends but I have access to all our database servers and review server logs when something happens. Then I open a ticket if I find something. I'm sure lines of authority vary widely in the field.

Donald Freeman
Database Administrator II
Commonwealth of Pennsylvania
Department of Health
Bureau of Information Technology
2150 Herr Street
Harrisburg, PA 17103
dofreeman_at_state.pa.us<mailto:dofreeman_at_state.pa.us>



From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Taylor, Chris David Sent: Tuesday, September 30, 2008 9:19 AM To: ORACLE-L
Subject: Server failures

So how many of you are responsible for examining your database servers for hardware/software faults when it crashes? Not the database, but the actual machine?

We recently had a server crash that reported problems when it came back up. It has also saved a dumpfile to be examined and it reported problems during the POST routine.

Now I get this email from my DBA manager: (paraphrased)

"Chris,

John [pc/lan mgr] requested that we try to put our finger on what caused MachineA to failover on Saturday. I looked through the logs extensively today [uh huh] and couldn't find anything - can you look around too and see if you find anything?"

-Bob"

(Obviously names changed)

Maybe I'm just in a bad mood this morning....grrrr

Chris Taylor
Sr. Oracle DBA
Ingram Barge Company
Nashville, TN 37205
Office: 615-517-3355
Cell: 615-354-4799
Email: chris.taylor_at_ingrambarge.com<mailto:chris.taylor_at_ingrambarge.com>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Sep 30 2008 - 08:55:40 CDT

Original text of this message