Re: Wildfire/Oracle/NUMA/QBB affinity on OpenVms/Oracle73

From: Rob Young <young_r_at_eisner.encompasserve.org>
Date: 9 Feb 2001 12:44:24 -0500
Message-ID: <gksoPRM3DeA7@eisner.encompasserve.org>

In article <3A8422B7.669C5D9D_at_uk.sun.com>, andrew harrison <andrew.nospam_at_uk.sun.com> writes:
> Rob Young wrote:
>>
>
> Only you Rob could answer a query about someones
> performance issues on a current brand spanking
> new WildFire with an advert for a processor
> that is currently not available running in the
> machine that will replace the posters brand spanking
> new WildFire.
>

	Funny you should point a finger at that processor
	that isn't available.  It is helping to win major
	supercomputer bids, or haven't you noticed?

> They havn't had it that long and allready its
> out of date !!

Its called an upgrade.

> This sort of response is could have been taken
> directly from the Microsoft book of marketing
> answers when people complain about MS's latest
> product, what do they do sell them the next
> new product.

	Oh?  What part of "better latency" don't you get?  Maybe
	I'm off base on something I say below but in your typical
	fashion you quickly digress into unrelated tangents.

>
> Why not try a more sensible suggestion:
>
> 1. Use OPS
> 2. Upgrade to a more NUMA friendly
> version of Oracle like 8.1.7

Those were mentioned. Read this first sentence I wrote:

Don't overlook hardware helping out too.

> 3. Get Compaq to replace the WildFire
> with a GS140 it will be cheaper and
> it will probably be faster.
>
> With the money the poster saved
> he could have bought you a beer
> and even splashed out on nibbles.

You are losing your touch.

	Here... since it seems you need something to do... address this
	thread:

http://www.deja.com/getdoc.xp?AN=725489146&fmt=text

From: dsiebert_at_excisethis.khamsin.net (Douglas Siebert) Subject: Re: Why SMP at all anymore?
Date: 08 Feb 2001
Newsgroups: comp.arch

Aaron Spink <spink_at_kraftwerk.pa.dec.com> writes:

>plugh_at_NO.SPAM.PLEASE (Caveman) writes:

>> It's probably the best high RAS box on the market now, IMHO, and
>> it clearly blows a HP V2600 off the planet in scalability.
>>

>I'm fairly confident that most people don't view an UE10K as a high
>RAS box. There have been some very widely published problems with the
>cache design and the lack of ECC on the off chip cache. When I think
>of a high RAS box, I think along the lines of IBM Z series, Tandem,
>and Stratus, and maybe OVMS clusters.

At a consulting gig I did last year, we had a boatload of E6500s, plus two E10Ks, all with the 400MHz US-II w/4MB cache that was the big problem. This was for a brand spanking new very large SAP environment that was being relocated from an existing environment that was all Sun (bit older stuff, but still E10K based) While we were building it, and after the switchover, there was rarely more than a couple days between crashes of one of machines what with nearly 300 CPUs in all. The best one I remember was when the main SAP DB server (E10K) fell over one evening, and the failover box (E6500) fell over while the Sun guys had the E10K open to replace the failed CPU. So the whole system was stone dead for nearly two hours. Not good for a system that runs your worldwide business. The customer was quite displeased with Sun, to say the least, to the point where HP has a really good chance at this point to get them to switch to Superdome. This is not a trivial move as this is one of the larger SAP installs around, and these guys have been Sun since they went live.

Anyway, Sun came up with the "patch", which basically "fixes" the problem of corrupt dirty data in the cache by having each CPU flush its cache every 10 seconds -- thus less dirty data is likely to be in the cache at any given time. That reduced the occurances, but didn't fix it. The big fix was/is supposed to be the CPUs with mirrored cache modules. I talked earlier this week to one of the guys still on the project up there, a while back Sun was out replacing all the CPUs. The very next weekend the SAP DB server fell over, with the exact same problem (at least the failover didn't fail as well this time :) ) Sun is apparently making all kinds of concessions to try to keep the account, since it is large enough HP would probably be happy to put out a press release, since a "customer switches from E10K to Superdome for higher availability" story would be real good for them trying to market the all-new Superdome versus the well known E10K. But from what I hear, after this latest fiasco, Sun would have difficulty keeping the account if they promised them free hardwarex and support at this point.

Now tell me again about the RAS of E10Ks... I will grant you the comments about scalability, especially when you compare to the V2600, which is basically a rebadged Convex S class that dates back to 1993 or so, but the V2250 and V2500s we had there for other customers were a whole lot less trouble (though granted they ran much smaller SAP setups) The box itself certainly doesn't have near the RAS featureset the E10K has, but if your CPUs are congenitally defective, it really doesn't matter much what you else you do, unless you have Tandem or Stratus systems and are running lockstep.

--
Doug Siebert
dsiebert_at_excisethis.khamsin.net


	Have Scotty, Zurg and Johnny Shoe put on their Magic Hats for
	this customer yet?

				Rob




> 

> Regards

> Andrew Harrison

> Enterprise IT Architect

> 

>> In article <3a7fb92b.2226779_at_news-server>, nsouto_at_nsw.bigpond.net.au.nospam (Nuno Souto) writes:

>> >

>> > With new releases for that specific h/w, it's possible the problem

>> > will go away, although I doubt it will be a complete solution.

>> >

>> 

>>         Don't overlook hardware helping out too.  Some people get

>>         antsy when futures are trotted out.  But some folks make

>>         decisions based on futures (Sandia/Celera, Los Alamos, European

>>         SuperComputer Centre) so we shouldn't be that nervous.

>> 

>>         This afternoon, the following presentation takes place:

>> 

>> http://www.isscc.org/isscc/2001/ap/ap/AP_forWeb_Nov16.pdf

>> 

>> 15.6    A 1.2 GHz Alpha Microprocessor with 44.8 GB/s Chip

>>             Pin Bandwidth

>> 

>> A. Jain, et al.                                                        Feb 6.

>> 4:15 p.m.

>> Compaq Computer Corporation, Shrewsbury, MA

>> 

>> A 4th generation Alpha microprocessor running at 1.2 GHz delivers up to

>> 44.8 GB/s chip pin bandwidth and dissipates 125W at 1.5V.  It contains a

>> 1.75MB 2nd level write-back-cache, two memory controllers supporting 8

>> Rambus(tm) channels running at 800 MB/s, four 6.4 GB/s inter-processor

>> communications ports, and a seperate IO port capable of 6.4 GB/s.  The chip

>> measures 21.1x18.8 mm2 and contains 130M transistors.

>> 

>>         What interests me more than bandwidth is latency.  Latency is

>>         the issue (as everyone has bandwidth or soon to, i.e. Power4,

>>         Ultra III) now.  From what we see, latency gets much better

>>         with EV7, this link is no longer there:

>> 

>> http://www.alphapowered.com/alpha21364.ppt

>> 

>>         But if it was there you would notice that local latency is

>>         "30 ns CAS latency pin to pin" (slide 17) and L2 latency is

>>         "12 ns load to use" (slide 16) with "15 ns processor to processor

>>         latency" (slide 18, i.e. remote memory routing) so it *appears* if the

>>         memory is two hops away , you may be looking at < 150 ns memory

>>         access if the page is open (sure, add a few dozen nanoseconds for

>>         routing , whatever).

>> 

>>         Point is latency for Alpha gets MUCH better and NUMA *should*

>>         become less of an issue for future Alpha hardware.  Perhaps they

>>         talk more about latency this afternoon.

>> 

>>                                 Rob

> 

> -- 

> Andrew Harrison

> Enterprise IT Architect

Received on Fri Feb 09 2001 - 11:44:24 CST