Re: RAC or Large SMP...?

From: Tim X <timx_at_nospam.dev.null>
Date: Tue, 07 Oct 2008 19:40:21 +1100
Message-ID: <878wt0ol1m.fsf@lion.rapttech.com.au>


mccmx_at_hotmail.com writes:

> I support an OLTP application which handles 2 million transactions per
> day and is running on 10gR2 EE on RHEL 4 x86_64.
>
> I am investigating scaling options for the application and I'm trying
> to decide between 2 large SMP servers or a multi-node RAC
> configuration.
>
> As far as I can tell, the highest number of cores available in an
> x64_64 server is 24. This would only allow us to handle 6 times the
> current workload, and realistically we need to be able to support up
> to 20 times the load.
>
> Has anyone had any experience of comparing the 2 approaches with
> respect to cost, manageability, performance, etc. Can you offer any
> advise and/or pointers to resources to help out with this
> investigation.
>
> Specifically I'm interested in:
>
> What is the most powerful x86_64 machine available..?
> Can Oracle scale well on NUMA based architectures (as some of the high
> end x64_64 based servers seem to be)..?
> Is the cost difference between '2 x large SMP' and 'multi-node RAC'
> large enough to justify the extra complexity of administering a
> cluster environment...?
>
> Any assistance on this would be greatly appreciated.
>

We recently moved to RAC running on Red Hat 64bit servers. We adopted RAC rather than fewer boxes with higher numbers of cores because of a number of reasons. Note that much of this is based on a lot of assumption rather than quantifiable evidence.

  1. The extent to which any program can fully utilise multi-core architectures depends largely on how it was written. parallel processing is not-trivial and we have found that many programs are just not written in a way that can make efficient use of multiple cores. Note that this relates to all the programs in your technology stack and not just Oracle. Even within oracle, consider things like the Java runtime engine etc.
  2. Pricing seems more favorable with standard multi-core CPUs compared to CPUs with higher number of cores. Obviously, this will change as the higher core numbers become more prevalent. When you have multiple cores, you also need to have the rest of the equation able to keep the cores adequately 'fed' as well. No point having lots of cores if the system can't keep the data up to them. This tends to further increase the initial investment required to purchase another server.
  3. We felt that Oracle's RAC was probably more mature than perhaps its ability to take advantage of multiple cores. We also felt there was more knowledge, experience and expertise available for RAC related issues compared to possible multi-core related issues.
  4. The level of scaling we are likely to need seemed better suited to an environment where we could just add another box to the cluster rather than have to spend a lot of money to get another system with lots of cores that may be a lot more powerful than we require.
  5. We were able to do a good deal for our RAC license. As usual, Oracle's initial price was rediculous, but after considerable haggling, we were able to find a solution that was quite good.
  6. Risk mitigation. We felt that a cluster of many servers rather than just a couple of servers with high core counts was a better risk mitigation strategy. If we lose one node, impact may be felt, but probably not as much as losing just one server from just a couple that would take out the equivalent of 4 or more RAC clustered boxes.
  7. We run many Oracle instances and applications. If you only have a single app, then YMMV. With the multiple database instances we have, RAC just seemed to be a better fit.

We have been running this configuration for the last 9 months or so. So far, results have been very good and we have had no significant problems moving any of our in-house developed applications (migrated from 98i/9i on Tru64). The increased number of servers to maintain hasn't been an issue as far as I can tell and the sys admins don't seem to be moaning too much about the extra servers to maintain. We did have some initial teething problems, but I think they were mainly due to the new SAN we implemented at the same time. On our other linux based clusters, we have had issues with things like tomcat and I suspect some of the problems are just due to the relative immaturity of clustered Linux compared to other OSs that have had a longer time to develop more mature clustering infrastructure. However, Oracle RAC has performed really well with minimal problems. Performance has been better than expected. Of course, the real test will come when we actually do need to add additional nodes!

Only one of our applications, an HRMS system, has caused us problems. However, our analysis and the analysis of external consultatns brought in by HR (don't get me started!) have come to the same conclusion - the problems are all due to how the vendor has implemented there system. For example, we found a major bottleneck in a table called user_temp_table, which was defined as a normal table rather than taking advantage of Oracle's temprary global tables. End user activity is logged to this table and as the day progressess, the system gets slower and slower as the table blows out with millions of transactions. We dropped the table and re-created it as a temporary global table and the problems were gone. However, the vendor said that if we do that, they won't support us, so we had to go back to their brain dead solution.

Watch out for your vendors. The same HRMS vendor originally signed off that their application was certified to run under RAC. Then, as soon as we began to get performance problems, they stated that there software was not certified to run under RAC and that as we were running on a non-certified platform, the support contract was void. Luckily, we had a signed letter from them stating that they did certify their software to run on Oracle RAC, so they are pretty much stuffed and have to support us. Our analysis indicated that RAC had nothing to do with the performance problems. We have further prooven this by running up the system on a non-RAC based environment and the performance is even worse. As far as I'm concerned, the problem is now with our senior managers and legal department - from a technical perspective, I think its pretty straight-forward. Unfortunately, its just another one of those unfortunate situations wehre you had a great product from a great vendor who was bought out by another vendor who just doesn't have the same high level of good project management, development planning or technical expertise to both maintain and extend the product. Essentially, the more recent versions are just not as good as earlier ones. this has been combined with poor business practices within the HR department, who insist on blaming the technology rather than looking at how they are using it (same old story I'm afraid - we have all seen it!).

Luckily, I work with two very good DBAs and a couple of talented business analysts and project managers. they take care of most of the icky stuff that I don't have the patience for. this allows me to get into what I enjoy doing. This has further rewards in that the DBAs really seem to like assisting me because they say the problems I bring them are interesting ones rather than non-problems brought to them by developers who can't be bothered learning about the tools and environment they are working with. This tends to help me even more because they show and explain stuff to me that they just don't bother doing with the others and as a consequence, I tend to get more interesting jobs and avoid boring maintenance work. I still wonder when, if ever, I will feel I know what Oracle has to offer. It wasn't too bad back with Oracle 7 as it seemed to be around for years. Then came 8i, which had some exciting new features, but before I could blink, it was 9i, then I was off doing other work for 3 years and I come back and its 10g and before I'm even comfortable with all the new features it has to offer, I'm told to start getting ready for 11!

HTH Tim  

-- 
tcross (at) rapttech dot com dot au
Received on Tue Oct 07 2008 - 03:40:21 CDT

Original text of this message