Re: Parallel Server and availability?

From: Billy Verreynne <vslabs_at_onwe.co.za>
Date: Thu, 28 Jan 1999 13:46:27 +0200
Message-ID: <78piqa$de5$1@hermes.is.co.za>

mikegrof_at_rocketmail.com wrote in message <78li4n$l21$1_at_nnrp1.dejanews.com>...
>In a traditional Parallel Server setup, <snipped>
>However, the user/job will need to be setup to access a specific node
(based
>on the connect string given). So, if a user's PC is setup to connect to
>node1 and node1 crashes, then the user is out of luck. This is great for
>load balancing, but not for availability.
>Is there a product, software and/or hardware, that provides for load
balancing
>and availability?

This is two different issues IMHO. And neither of them are specific to Oracle Parallel Server (OPS). OPS major function is breaking complex queries up into simpler processes and running these in parallel.

Is load balancing provided by OPS? I don't recall that OPS will actually look at CPU and I/O stats to determine which node's PQs to use. So you can wind up with node 1 having 15 PQs running while node 10 only have 1 active PQ.

Availability is provided simply by having other nodes up when one node crashes. This is assuming that the crashed node's disks are automatically taken over (at o/s or hardware level) by the backup node for the crashed node. For example, if node 1 goes down, node 2 should automatically take over the SCSI bus of that node and provide access to those disks to the rest of the cluster. Also nothing to do with OPS.

So yes, you do have a kind of "high availability" and "load balancing" with OPS, but that is more like a bonus because of the hardware/software architecture that OPS need to run on.

If you for example want to have actual process recovery of the Oracle shadow processes that were running on the crashed node... that's a tough one. You will need a lot of special hardware and software to do this. However, if you want the users to simply re-connected after their clients said that connection with Oracle has been lost - that can be done with distributed TCP/IP. You use a single virtual IP address for the cluster and the cluster will then connect the user to an up and running node. You can also configure distributed TCP/IP for a cluster for load balancing. So it can connect the user to the active node using the least amount of CPU. So to answer your question, yes, distributed TCP/IP is probably one of the products that fits the requirements of "high availability" and "load balancing". I'm using quotes as this depends on what you understand these two terms to be. As for distributed TCP/IP, my experience with this was fleeting and brief on a MPP mesh. It could be configured I think for load balancing as round robin connections, connection with the node with the least CPU and connection with the node with the least amount of disk utilisation. But I'm speaking under correction here.

My suggestion is to look at what the actual requirements are. Often the users and managers insist of having high availability, redundancy, load balancing and all the other good stuff, without having the -faintest!- notion of what this really mean and really entail. -Real- 24x7!?! Sounds nice, but forget just paying an arm and a leg. You have to donate all your body parts for this! :-)

Often these user requirements can be addressed via some "innovative use" of the OPS and the cluster (have ksh and script! ;-). Have different user groups connecting to different OPS instances can for example spread the load. Or running the background admin stuff on one node and disabling the listener on that node - so distributed TCP/IP will connect the users to the other nodes. Running UNIX scripts on node, having one node checking the other every few minutes. It's even possible to initiate the startup of that crashed node again automatically (if the error can be solved by a simple reboot). Or reconfigure and remove that node from OPS/the cluster so that it can be fixed "outside" the cluster without causing problems to the rest of the cluster. Etc.

My 2'c worth... :-)

regards,
Billy Received on Thu Jan 28 1999 - 05:46:27 CST