Re: Bank Databases

From: Matthew Zito <matt_at_crackpotideas.com>
Date: Mon, 25 Jun 2012 13:35:42 -0400
Message-ID: <CAJ7936yUrH7HOaCcRL3TZ1e+yq8O+G6Xi0vorpPnCcqyWVXZqg_at_mail.gmail.com>



I think there's often a tendency to blame the outsourced team whenever these kinds of issues crop up and there's contractors or remote teams or offshored folks involved. But in a failure like this, as I think I said upthread, there's plenty of blame to go around:
  • If the offshore people were unqualified, why was management allowing them to do this upgrade?
  • If engineering "got to work on a fix" weds morning, why weren't they involved in the planning to insure sufficient safeguards?
  • If this system was so critical, why was the vendor not already involved in the upgrade process?
  • If the vendor was involved, why the heck did it take days to get a fix for a major international bank?

I work with software far less operationally critical to normal business execution, and I *still* get direct calls from customers that say, "We're planning to upgrade to 8.2 of your software, and I was wondering if you can take a look at our plan and make sure we're not doing anything wrong?"

I can only imagine the planning they would do if my software would prevent them from allowing people to access their money.

Matt

PS - Full disclosure notice, I work for BMC software, which makes a competing job scheduling product to CA's, though I don't work with it, have never used it, or even seen a demo - totally different side of the company. So I have no axe to grind against CA, wish them all the best, and my views are definitely not those of BMC.

On Mon, Jun 25, 2012 at 1:15 PM, Powell, Mark <mark.powell2_at_hp.com> wrote:
> The problem is not the CA-7 software in my opinion but the failure of the out-sourced staff to properly use the software.
>
>
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Matthew Zito
> Sent: Monday, June 25, 2012 7:53 AM
> To: Řyvind Isene
> Cc: howard.latham_at_gmail.com; niall.litchfield_at_gmail.com; oracle-l
> Subject: Re: Bank Databases
>
> Doh - resending as got dinged for overquoting:
>
> Timely enough, the Register is reporting that CA's job scheduler software may be responsible:
>
> http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/
>
> Could certainly mean that Oracle was still involved (or Sybase, or some other database), but the inability to schedule jobs was the root issue.
>
> Matt
>
>
>>>>
>>>> I'm particularly interested as we test our failover every 3 months
>>>> and last time we did so there was a power outage on the standby
>>>> which was running temporarily as primary which we hadn't
>>>> anticipated. The start up script tried to bring what was currently a
>>>> primary db as a standby. I'm trying to automate this and yuk without
>>>> dg broker which has its own set of problems I'm a bit stymied!
>>>> I'm not suggesting Nat West hadn't tested thir failover , but
>>>> imagine its difficult due to volumes.
>>>> On 25 June 2012 12:08, Matthew Zito <matt_at_crackpotideas.com> wrote:
>>>> > Yes, though I doubt it's anything as simple as an "Oracle issue".
>>>> > From my experience watching large organizations deal with complex
>>>> > crises like this, typically it's a series of cascading failures -
>>>> > so perhaps an Oracle database was involved, but many separate
>>>> > pieces had to fail in order to get to this point.
>>>> >
>>>> > For example, I once saw a major global company's firmwide email
>>>> > system go down for over a day due to a cascading series of:
>>>> > - storage array failure
>>>> > - misconfigured hardware
>>>> > - engineer typo
>>>> > - misunderstood recovery architecture
>>>> >
>>>> > I'm trying to keep it vague intentionally, but if any one of those
>>>> > things hadn't happened, they would have had an hour downtime on
>>>> > their email instead of a 30 hour downtime.  I suspect the natwest
>>>> > issue is similar, *though* I do expect that we'll get more info in
>>>> > the coming days/weeks, so maybe we can get some more details then.
>>>> >
>>>> > Matt
>>>> >
>>>> > On Mon, Jun 25, 2012 at 7:01 AM, Howard Latham
>>>> > <howard.latham_at_gmail.com>
>>>> > wrote:
>>>> > >
>>>> > > So Nat west being unable to process transactions for 5 days due
>>>> > > to a
>>>> > change
>>>> > > in backup software and  fail over could well be an Oracle issue.
>>>> > >
>>>> > > --
>>>> > > Howard A. Latham
> --
> http://www.freelists.org/webpage/oracle-l
>
>
> --
> http://www.freelists.org/webpage/oracle-l
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jun 25 2012 - 12:35:42 CDT

Original text of this message