What DBAs do at Easter

From: Ingrid Voigt <GiantPanda_at_gmx.net>
Date: Sun, 27 Mar 2016 14:05:33 +0200
Message-ID: <56F7CC8D.20104_at_gmx.net>



Hi,

happy Easter to those who celebrate. For everybody else - here is the fun some colleagues of mine and I had yesterday.

1:30 AM. One of our (physical) hosts loses its array controller. The RAID contained all datafiles for about a dozen database instances. 4 x prod, the rest test / dev. Customer notices at about 3:00 AM and calls our second level support (the only person on duty in the whole story). Restarting the host does not help. SLS contacts the hardware vendor and waits.

5:30 AM. DBA (i.e. me) can't sleep, decides to check her email. Very bad idea. My restarting the host does not help either (of course). I send out a call for help to the backup team and start to see what can be done.

Two of the prod databases are DataGuard.

07:00 AM. Failover PRD1 - done. Great.

Failover PRD2 - done. Great. Except the customer's application doesn't connect to the standby host. Firewall. So I send out another call for help to the network people. Offer to migrate the database directly to the application host. We think about this for a while and decide that the negatives (this is in the DMZ, we would have to find storage first, install Oracle, ... and have more databases to work on) outweigh the positives.

PRD3 and PRD4 are single instance. SPFiles / Password Files / alert logs are still there. Controlfiles are not. Three copies, all gone. (We might have to rethink our default setup). Look up disaster recovery procedure how to restore controlfile from backup (no autobackup and no recovery catalogue!) to another host. Can be done, but I need help from the backup people for this.

09:00 AM. Network guy has woken up, checked his email too and has changed the firewall configuration. Customer of PRD2 still can't connect. So... logon to his application server, start changing tnsnames.ora. Doesn't help - SQL*Plus works, the application doesn't.

Customer sends call for help to his software vendor - where does the application store its database connection?

10:00 AM. In the meantime backup guy has called back, is on his way to the office and ready to help.

Hardware vendor has called back, re-activated the array, all the volumes seem to be back. We restart the host to test... oops. Volumes are gone again. Second level support and hardware vendor repeat the reactivation. I tell them to not touch anything and start copying files of PRD3 and PRD4 to the second host. Controlfiles, archivelogs, redologs, datafiles. Everything goes smoothly, but it takes another hour. Or two.

12:00 PM. Create new instances, change a couple of spfile parameters, startup nomount - mount - open - great. PRD3 is done.

PRD4 crashes immediately. And again. ORA-00600 [4194]. MOS says, undo corruption, restore from backup. I wouldn't have wanted the backup guy to be bored.

(Two more coworkers have called in and offered assistance. Apparently workoholism is contagious. But by now everything is under control.)

1:30 PM. Restore of PRD4 goes fine. Database opens, customer can't connect. Oops. I need to update DNS name PRD4.oracle.intern to point to the new host. Done. Connection works, we are getting there...

Start copying all the files of test / dev databases to the second host. Just in
case.

2:30 PM. Customer's software vendor has called back. Turns out there is a configuration xml file stored somewhere. Update that and customer can connect to PRD2 again.

SLS and hardware vendor still have no idea how to persistently re-activate the array. They offer to send out a technician immediately with a new controller. We thankfully decline - this would take several more hours, all the PRD databases are up and running again, the others can wait.

Write up an emergency report to CIO / CTO / everybody else. Create a ToDo-List for Tuesday. Express my deepest gratitude to all the people involved. We haven't even broken any SLAs.

3:00 PM. Finally start the holiday (except for leaving my phone on...)

Hope you all have a better weekend than I did!

Best regards
Ingrid

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Mar 27 2016 - 14:05:33 CEST

Original text of this message