Re: Parallel Server and availability?

From: <mikegrof_at_rocketmail.com>
Date: Thu, 28 Jan 1999 22:24:57 GMT
Message-ID: <78qo3f$1va$1@nnrp1.dejanews.com>

In article <78piqa$de5$1_at_hermes.is.co.za>, "Billy Verreynne" <vslabs_at_onwe.co.za> wrote:
> mikegrof_at_rocketmail.com wrote in message
> <78li4n$l21$1_at_nnrp1.dejanews.com>...
> >In a traditional Parallel Server setup, <snipped>
> >However, the user/job will need to be setup to access a specific node
> (based
> >on the connect string given). So, if a user's PC is setup to connect to
> >node1 and node1 crashes, then the user is out of luck. This is great for
> >load balancing, but not for availability.
> >Is there a product, software and/or hardware, that provides for load
> balancing
> >and availability?
>
> This is two different issues IMHO. And neither of them are specific to
> Oracle Parallel Server (OPS). OPS major function is breaking complex
> queries up into simpler processes and running these in parallel.

It sounds like you are talking about the Parallel Query Option. This is a very different product.

> Is load balancing provided by OPS? I don't recall that OPS will actually
> look at CPU and I/O stats to determine which node's PQs to use. So you
> can wind up with node 1 having 15 PQs running while node 10 only have 1
> active PQ.

Can anyone confirm this for 7.x and 8.0? I think I read that 8i will have this capability.

> Availability is provided simply by having other nodes up when one node
> crashes. This is assuming that the crashed node's disks are automatically
> taken over (at o/s or hardware level) by the backup node for the crashed
> node. For example, if node 1 goes down, node 2 should automatically take
> over the SCSI bus of that node and provide access to those disks to the
> rest of the cluster. Also nothing to do with OPS.

What you are describing is a Fail-over setup. Yes, this has nothing to do the OPS, but it's not related to my question.

> So yes, you do have a kind of "high availability" and "load balancing"
> with OPS, but that is more like a bonus because of the hardware/software
> architecture that OPS need to run on.
>
> If you for example want to have actual process recovery of the Oracle
> shadow processes that were running on the crashed node... that's a tough
> one. You will need a lot of special hardware and software to do this.
> However, if you want the users to simply re-connected after their clients
> said that connection with Oracle has been lost - that can be done with
> distributed TCP/IP. You use a single virtual IP address for the cluster
> and the cluster will then connect the user to an up and running node. You
> can also configure distributed TCP/IP for a cluster for load balancing.
> So it can connect the user to the active node using the least amount of
> CPU.
>
> So to answer your question, yes, distributed TCP/IP is probably one of
> the products that fits the requirements of "high availability" and "load
> balancing". I'm using quotes as this depends on what you understand these
> two terms to be. As for distributed TCP/IP, my experience with this was
> fleeting and brief on a MPP mesh. It could be configured I think for load
> balancing as round robin connections, connection with the node with the
> least CPU and connection with the node with the least amount of disk
> utilisation. But I'm speaking under correction here.

I know that some "clusters" work this way, but do they work with OPS? It sounds like it should, but I haven't heard from anyone that is actually running it.

> My suggestion is to look at what the actual requirements are. Often the
> users and managers insist of having high availability, redundancy, load
> balancing and all the other good stuff, without having the -faintest!-
> notion of what this really mean and really entail. -Real- 24x7!?! Sounds
> nice, but forget just paying an arm and a leg. You have to donate all
> your body parts for this! :-)

True.

> Often these user requirements can be addressed via some "innovative use"
> of the OPS and the cluster (have ksh and script! ;-). Have different user
> groups connecting to different OPS instances can for example spread the
> load. Or running the background admin stuff on one node and disabling the
> listener on that node - so distributed TCP/IP will connect the users to
> the other nodes. Running UNIX scripts on node, having one node checking
> the other every few minutes. It's even possible to initiate the startup
> of that crashed node again automatically (if the error can be solved by a
> simple reboot). Or reconfigure and remove that node from OPS/the cluster
> so that it can be fixed "outside" the cluster without causing problems to
> the rest of the cluster. Etc.
>
> My 2'c worth... :-)
>
> regards,
> Billy

Thanks,
Mike

-----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own Received on Thu Jan 28 1999 - 16:24:57 CST