Jeremy Schneider

Subscribe to Jeremy Schneider feed Jeremy Schneider
Jeremy Schneider
Updated: 17 hours 11 min ago

Patching Time

Wed, 2014-10-15 10:17

Just a quick note to point out that the October PSU was just released. The database has a few more vulnerabilities than usual (31), but they are mostly related to Java and the high CVSS score of 9 only applies to people running Oracle on windows. (On other operating systems, the highest score is 6.5.)

I did happen to glance at the announcement on the security blog, and I thought this short blurb was worth repeating:

In today’s Critical Patch Update Advisory, you will see a stronger than previously-used statement about the importance of applying security patches. Even though Oracle has consistently tried to encourage customers to apply Critical Patch Updates on a timely basis and recommended customers remain on actively-supported versions, Oracle continues to receive credible reports of attempts to exploit vulnerabilities for which fixes have been already published by Oracle. In many instances, these fixes were published by Oracle years ago, but their non-application by customers, particularly against Internet-facing systems, results in dangerous exposure for these customers. Keeping up with security releases is a good security practice and good IT governance.

The Oracle Database was first released in a different age than we live in today. Ordering physical parts involved navigating paper catalogs and faxing order sheets to the supplier. Physical inventory management relied heavily on notebooks and clipboards. Mainframes were processing data but manufacturing and supply chain had not yet been revolutionized by technology. Likewise, software base installs and upgrades were shipped on CDs through the mail and installed via physical consoles. The feedback cycle incorporating customer requests into software features took years.

Today, manufacturing is lean and the supply chain is digitized. Inventory is managed with the help of scanners and real-time analytics. Customer communication is more streamlined than ever before and developers respond quickly to the market. Bugs are exploited maliciously as soon as they’re discovered and the software development and delivery process has been optimized for fast response and rapid digital delivery of fixes.

Here’s the puzzle: Cell phones, web browsers and laptop operating systems all get security updates installed frequently. Even the linux OS running on your servers is easy to update with security patches. Oracle is no exception – they have streamlined delivery of database patches through the quarterly PSU program. Why do so many people simply ignore the whole area of Oracle database patches? Are we stuck in the old age of infrequent patching activity even though Oracle themselves have moved on?

Repetition

For many, it just seems overwhelming to think about patching. And honestly – it is. At first. The key is actually a little counter-intuitive: it’s painful, so you should in fact do it a lot! Believe it or not, it will actually become very easy once you get over the initial hump.

In my experience working at one small org (two dba’s), the key is doing it regularly. Lots of practice. You keep decent notes and setup scripts/tools where it makes sense and then you start to get a lot faster after several times around. By the way, my thinking has been influenced quite a bit here by the devops movement (like Jez Humble’s ’12 berlin talk and John Allspaw’s ’09 velocity talk). I think they have a nice articulation of this basic repetition principle. And it is very relevant to people who have Oracle databases.

So with all that said, happy patching! I know that I’ll be working with these PSUs over the next week or two. I hope that you’ll be working with them too!

Grid/CRS AddNode or runInstaller fails with NullPointerException

Fri, 2014-08-08 13:43

Posting this here mostly to archive it, so I can find it later if I ever see this problem again.

Today I was repeatedly getting this error while trying to add a node to a cluster:

(grid)$ $ORACLE_HOME/oui/bin/addNode.sh -silent -noCopy CRS_ADDNODE=true CRS_DHCP_ENABLED=false INVENTORY_LOCATION=/u01/oraInventory ORACLE_HOME=$ORACLE_HOME "CLUSTER_NEW_NODES={new-node}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={new-node-vip}"
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 24575 MB    Passed
Oracle Universal Installer, Version 11.2.0.3.0 Production
Copyright (C) 1999, 2011, Oracle. All rights reserved.

Exception java.lang.NullPointerException occurred..
java.lang.NullPointerException
        at oracle.sysman.oii.oiic.OiicAddNodeSession.initialize(OiicAddNodeSession.java:524)
        at oracle.sysman.oii.oiic.OiicAddNodeSession.<init>(OiicAddNodeSession.java:133)
        at oracle.sysman.oii.oiic.OiicSessionWrapper.createNewSession(OiicSessionWrapper.java:884)
        at oracle.sysman.oii.oiic.OiicSessionWrapper.<init>(OiicSessionWrapper.java:191)
        at oracle.sysman.oii.oiic.OiicInstaller.init(OiicInstaller.java:512)
        at oracle.sysman.oii.oiic.OiicInstaller.runInstaller(OiicInstaller.java:968)
        at oracle.sysman.oii.oiic.OiicInstaller.main(OiicInstaller.java:906)
SEVERE:Abnormal program termination. An internal error has occured. Please provide the following files to Oracle Support :

"Unknown"
"Unknown"
"Unknown"

There were two notes on MOS related to NullPointerExceptions from runInstaller (which is used behind the scenes for addNode in 11.2.0.3 on which I had this problem). Note 1073878.1 describes addNode failing in 10gR2, and the root cause was that the home containing CRS binaries was not registered in the central inventory. Note 1511859.1 describes attachHome failing, presumably on 11.2.0.1 – and the root cause was file permissions that blocked reading of oraInst.loc.

Based on these two notes, I had a suspicion that my problem had something to do with the inventory. Note that you can get runInstaller options by running “runInstaller -help” and on 11.2.0.3 you can debug with “-debug -logLevel finest” at the end of your addNode command line. The log file is produced in a logs directory under your inventory. However in this case, it produces absolutely nothing helpful at all…

After quite a bit of work (even running strace and ltrace on the runInstaller, which didn’t help one bit)… I finally figured it out:

(grid)$ grep oraInst $ORACLE_HOME/oui/bin/addNode.sh
INVPTRLOC=$OHOME/oraInst.loc

The addNode script was hardcoded to look only in the ORACLE_HOME for the oraInst.loc file. It would not read the file from /etc or /var/opt/oracle because of this parameter.

On this particular server, there was not an oraInst.loc file in the grid ORACLE_HOME. Usually the file is there when you do a normal cluster installation. In our case, it’s absence was an artifact of the specific cloning process we use to rapidly provision clusters. As soon as I copied the file from /etc into the grid ORACLE_HOME, the addNode process continued as normal.

Sometimes it would be nice if runInstaller could give more informative error messages or tracing info!

OSP #3a: Build a Standard Cluster Platform

Thu, 2014-04-17 06:15

This is the fifth article in a series called Operationally Scalable Practices. The first article gives an introduction and the second article contains a general overview. In short, this series suggests a comprehensive and cogent blueprint to best position organizations and DBAs for growth.

We’ve looked in some depth at the process of defining a standard platform with an eye toward Oracle database use cases. Before moving on, it would be worthwhile to briefly touch on clustering.

Most organizations should hold off as long as possible before bringing clusters into their infrastructure. Clusters introduce a very significant new level of complexity. They will immediately drive some very expensive training and/or hiring demands – in addition to the already-expensive software licenses and maintenance fees. There will also be new development and engineering needed – perhaps even within application code itself – to support running your apps on clusters. In some industries, clusters have been very well marketed and many small-to-medium companies have made premature deployments. (Admittedly, my advice to hold off is partly a reaction to this.)

When Clustering is Right

Nonetheless there definitely comes a point where clustering is the right move. There are four basic goals that drive cluster adoption:

  1. Parallel or distributed processing
  2. Fault tolerance
  3. Incremental growth
  4. Pooled resources for better utilization

I want to point out immediately that RAC is just one way of many ways to do clustering. Clustering can be done at many tiers (platform, database, application) and if you define it loosely then even an oracle database can be clustered in a number of ways.

Distributed Processing

Stop for a moment and re-read the list of goals above. If you wanted to design a system to meet these goals, what technology would you use? I already suggested clusters – but that might not have been what came to your mind first. How about grid computing? I once worked with some researchers in Illinois who wrote programs to simulate protein folding and DNS sequencing. They used the Illinois BioGrid – composed of servers and clusters managed independently by three different universities across the state. How about cloud computing? The Obama Campaign in 2008 used EC2 to build their volunteer logistics and coordination platforms to dramatically scale up and down very rapidly on demand. According to the book In Search of Clusters by Gregory Pfister, these four reasons are the main drivers for clustering – but if they also apply to grids and clouds then then what’s the difference? Doesn’t it all accomplish the same thing?

In fact the exact definition of “clustering” can be a little vague and there is a lot of overlap between clouds, grids, clusters – and simple groups of servers with strong & mature standards. In some cases these terms might be more interchangeable than you would expect. Nonetheless there are some general conventions. Here is what I have observed:

CLUSTER Old term, most strongly implies shared hardware resources of some kind, tight coupling and physical proximity of servers, and treatment of the group as a single unit for execution of tasks. While some level of single system image is presented to clients, each server may be individually administered and strong standards are desirable but not always implied. GRID Medium-aged term, implies looser coupling of servers, geographic dispersion, and perhaps cross-organizational ownership and administration. There will not be grid-wide standards for node configuration; individual nodes may be independently administered. The grid may be composed of multiple clusters. Strong standards do exist at a high level for management of jobs and inter-node communication.

Or, alternatively, the term “grid” may more loosely imply a group of servers where nodes/resources and jobs/services can easily be relocated as workload varies. CLOUD New term, implies service-based abstraction, virtualization and automation. It is extremely standardized with a bias toward enforcement through automation rather than policy. Servers are generally single-organization however service consumers are often external. Related to the term “utility computing” or the “as a service” terms (Software/SaaS, Platform/PaaS, Database/DaaS, Infrastructure/IaaS).

Or, alternatively, may (like “grid”) more loosely imply a group of servers where nodes/resources and jobs/services can easily be relocated as workload varies.

Google Trends for Computers and Electronics Category

Google Trends for Computers and Electronics Category

These days, the distributed processing field is a very exciting place because the technology is advancing rapidly on all fronts. Traditional relational databases are dealing with increasingly massive data volumes, and big data technology combined with pay-as-you-go cloud platforms and mature automation toolkits have given bootstrapped startups unforeseen access to extremely large-scale data processing.

Building for Distributed Processing

Your business probably does not have big data. But the business case for some level of distributed processing will probably find you eventually. As I pointed out before, the standards and driving principles at very large organizations can benefit your commodity servers right now and eliminate many growing pains down the road.

In the second half of this article I will take a look at how this specifically applies to clustered Oracle databases. But I’m curious, are your server build standards ready for distributed processing? Could they accommodate clustering, grids or clouds? What kinds of standards do you think are most important to be ready for distributed processing?

Chicago Oracle User Community Restart

Mon, 2014-03-24 13:47

Chicago is the third largest city in the United States. There are probably more professional Oracle users here than most other areas in the country – and yet for many years now there hasn’t been a cohesive user group.

But right now there’s an opportunity for change. If the professional community of Chicago Oracle users steps up to the plate.

Chicago Oracle User Group

First, the Chicago Oracle User Group has just elected a new president. Alfredo Abate is bringing a level of enthusiasm and energy to the position which we’ve been missing for a long time. He’s trying to figure out how to restart the COUG and re-engage the professional community here – but he needs input and assistance from you! If you’re an administrator or developer anywhere near Chicago and you have Oracle software anywhere in your company, then please help Alfredo get the user group going! Here are a few specific things you can do:

  1. Send Alfredo an email saying congrats and offering suggestions for the COUG. You can find him on LinkedIn or the COUG site below.
  2. Join the LinkedIn group that Alfredo set up for the COUG.
  3. Sign up for a free account at the COUG site: chicago.oracle.ioug.org
  4. Complete the survey at the COUG website (must sign up for free account, then look for “survey” link in the top navigation bar). This will help Alfredo think about planning the next event.
Lunch Huddles

A few years ago, I was part of a group of Oracle database users from different companies in Chicago who started hanging out regularly for lunches downtown. It was never a big event but it was a lot of fun to get together and catch up regularly. However I stopped organizing the lunches after a job change back into travel consulting and the birth of our daughter. I live on the north side of the city, I worked from home when I wasn’t traveling, and I wasn’t able to make trips downtown anymore.

Ever since, I’ve missed hanging out with friends downtown and I’ve always wanted to do these group lunches again. Besides the fact that I really enjoy catching up with people, I think that face-to-face meetups really help strengthen our sense of community as a whole in Chicago.

So – after far too long – I started the lunches again last week.

Oracle DB Lunch Downtown

Oracle DB Lunch Downtown

But it’s improved – there are now lunches happening all over ChicagoLand!

Tomorrow: Deerfield
This wednesday: Des Plaines
Next week wednesday: Downtown

Coming soon: Naperville?

Please join us for a lunch sometime! I promise you’ll find it to be both beneficial and fun! And also, please join the group on meetup.com – then you’ll get reminders about upcoming lunches in Chicago.

Spread the Word

Even if you don’t live in Chicago, you can help me out with this – send a brief tweet or quick email to any Oracle professionals you know around Chicago and direct them to this blog post. I hope to see some new life in the Oracle professional community here. It won’t happen by accident.

Command Line Attachment to Oracle Support Service Request

Fri, 2014-02-07 13:55

For those who haven’t looked at this in awhile: these days, it’s dirt simple to attach a file to your SR directly from the server command line.

curl –T /path/to/attachment.tgz 
     –u "your.oracle.support.login@domain.com" 
     "https://transport.oracle.com/upload/issue/0-0000000000/"

Or to use a proxy server,

curl –T /path/to/attachment.tgz
     –u "your.oracle.support.login@domain.com"
     "https://transport.oracle.com/upload/issue/0-0000000000/"
     -px proxyserver:port
     -U proxyuser

There is lots of info on MOS (really old people call it metalink); doc 1547088.2 is a good place to start. There are some other ways to do this too. But really you can skip all that, you just need the single line above!

OEM CLI Commands for Bulk Property Changes

Tue, 2014-01-14 14:05

This will be a brief post, mostly so I can save this command somewhere besides the bash_history file on my OEM server. It may prove useful to a few others too… it has been absolutely essential for me on several occasions! (I was just using it again recently which reminded me to stick it in this blog post.) This is how you can make bulk property changes to a large group of targets in OEM:

(oracle)$ emcli login -username=jeremy
(oracle)$ emcli get_targets -noheader -script | sed \
  's/Metric Collection Error/MCE/;s/Under Blackout/Blackout/;s/Status Pending/Pending/' >targets

(oracle)$ less targets
(oracle)$ awk '{print$4" "$5" "$6" "$7" "$8"~"$3"~Department~default"}' targets >inp

  or... awk '{print$4" "$5" "$6" "$7" "$8"~"$3"~Line of Business~test"}' targets >inp
  or... awk '{print$4" "$5" "$6" "$7" "$8"~"$3"~Location~chicago"}' targets >inp
  or... awk '{print$4" "$5" "$6" "$7" "$8"~"$3"~LifeCycle Status~Production"}' targets >inp

(oracle)$ less inp
(oracle)$ emcli set_target_property_value -property_records=REC_FILE \
  -input_file=REC_FILE:inp -separator=property_records="\n" -subseparator=property_records=~

(oracle)$ emcli logout

Note that the property name is case-sensitive: “Lifecycle” won’t work but “LifeCycle” does. Also, the commands above are of course intended to be tinkered with. Use grep to filter out targets; search on any regular expression you can dream up.

This process is important here because we use Administration Groups to automatically propagate monitoring templates (with standardized metric thresholds for paging) to all of our OEM targets. There have been a number of times when I’ve needed to make bulk property changes and it takes a very long time to do that through the UI. These commands are much faster.

November/December Highlights

Thu, 2014-01-02 12:04

In the Oracle technical universe, it seems that the end of the calendar year is always eventful. First there’s OpenWorld: obviously significant for official announcements and insight into Oracle’s strategy. It’s also the week when many top engineers around the world meet up in San Francisco to catch up over beers – justifying hotel and flight expenses by preparing technical presentations of their most interesting and recent problems or projects. UKOUG and DOAG happen shortly after OpenWorld with a similar (but more European) impact – and December seems to mingle the domino effect of tweets and blog posts inspired by the conference social activity with holiday anticipation at work.

I avoided any conference trips this year but I still noticed the usual surge in interesting twitter and blog activity. It seems worthwhile to record a few highlights of the past two months as the year wraps up.

First, four new scripts that look very interesting:
1. Utility: getMOSpatch (doc: blog)- useful script for downloading a specific patch from MOS. I had tried something similar for RACattack back in 2011. This script written by Maris Elsins looks to be very good. I’ve downloaded this and read up on it but haven’t tried it out yet.
2. Perf: ashtop and ash_wait_chains (doc: blog 1, blog 2) – from the author of snapper, here are two more excellent tools for general performance troubleshooting and quickly pulling information from the ASH. The chains script is an especially brilliant idea – it reports from ASH using a hierarchical join on the blocking_session column. Check out Tanel’s blog posts for details and examples. I’ve used both of these scripts while troubleshooting production issues during the past month.
3. Perf/CPU: fulltime (doc: slides) – Linux specific utility to drill down into CPU-intensive processes. Similar to Tanel’s OStackProf but seems a bit more robust (runs server-side without the windows/vbscript dependencies, also brings cpu/kernel together with wait info in a single report). Rather than oradebug, this uses new lightweight linux kernel instrumentation (perf) to report a sample-based profile of what the Oracle kernel is doing by internal function. This was a collaborative effort by Craig Shallahamer and Frits Hoogland and there are several related articles on both blogs about how it works. I’ve downloaded this but haven’t tried it out yet.
4. Perf/Visualization: [Ora/Py] LatencyMap (doc: blog/sqlplus, blog/python) – very cool looking program which gives a heatmap visual representation of metrics such as I/O. I’m a huge fan of visualizations and use graphical tools daily as a DBA. Make sure to check out the recorded demo of this utility!

I love exploring utilities like these. It brings out my nerdy side a little, that’s why I mentioned them first… :) But there are a few other highlights that come to mind from the past few months too!

On the topic of utilities, I have been working with Tanel’s “tpt” script collection quite a bit during the course of my day-to-day job. I fired out a question this month to the oracle-l mailing list about other publicly posted script collections, and I got the impression that there just aren’t many script collections posted publicly! Here’s the list I came up with:

Script Collections:
Tanel Poder (tpt), see also E2SN for even more
Tim Hall (oracle-base)
Dan Morgan
Kerry Osborne (2010 Hotsos Presentation), see also various blog articles for many updated scripts
Tim Gorman
Jeff Hunter

I’ve also read lots of other interesting stuff this month. Three things I remember off the top of my head:
– In 2012, Yury Velikanov wrote up a presentation about oracle database backups. Last month, Maris Elsins made a few tweaks and delivered the updated presentation at UKOUG. The slide deck is a worthwhile read – everybody should be able to learn something from it. If you didn’t see it when Yury first released it last year then take a few minutes to check it out.
– I was interested to read Kellyn Pot’Vin‘s slides about Database as a Service (DBaaS). This is an area I’ve been working on a lot lately and it intersects with my Operationally Scalable Practices series of articles. She’s always got good content about OEM on her blog too – being a heavy OEM user these days, I tend to read what Kellyn’s writing.
Kyle Hailey recorded Christo Kytrovsky‘s excellent OakTable World talk about Oracle, Memory and Linux. Worth listening to sometime.

You may already be aware but I have to mention that RAC Attack has hit the accelerator lately! Through an international collaborative effort, the curriculum was updated to version 12c of the database before OpenWorld 2013 and this was followed by a rapid series of workshops. During the past three months, there have been four workshops in three different countries – and there are more coming on the calendar!

Finally, two quick “news” type mentions. First, I’ve personally tryed to avoid much of the “engineered systems” buzz (not sure why)… but I did notice the new exadata release this month. Second, oracle made an acquisition this year which was particularly significantly to me: a chicago-based company called BigMachines. You may not have heard of this company – but it happens to be mentioned on my LinkedIn profile.

These are a handful of interesting things I remember seeing over the past two months. Please leave me a comment and mention anything else that you noticed recently – I’m very interested to hear any additional highlights!

Readable Code for Modify_Snapshot_Settings

Mon, 2013-12-16 12:32

It annoyed me slightly that when I googled modify_snapshot_settings just now and all of the examples used huge numbers for the retention with (at best) a brief comment saying what the number meant. Here is a better example with slightly more readable code. Hope a few people down the road cut-and-paste from this article instead and the world gets a few more lines of readable code as a result. :)

On a side note, let me re-iterate the importance of increasing the AWR retention defaults. There are a few opinions about the perfect settings but everyone agrees that the defaults are a “lowest common denominator” suitable for demos on laptops but never for production servers. The values below are what I’m currently using.

BEGIN 
  DBMS_WORKLOAD_REPOSITORY.modify_snapshot_settings(
    retention => 105 * 24 * 60,   -- days * hr/day * min/hr  (result is in minutes)
    interval  => 15);             -- minutes
END;
/



SQL> select * from dba_hist_wr_control;
Pivoting output using Tom Kyte's printtab....
==============================
DBID                          : 3943732569
SNAP_INTERVAL                 : +00000 00:15:00.0
RETENTION                     : +00105 00:00:00.0
TOPNSQL                       : DEFAULT

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.10

Largest Tables Including Indexes and LOBs

Wed, 2013-12-04 16:04

Just a quick code snippit. I do a lot of data pumps to move schemas between different databases; for example taking a copy of a schema to an internal database to try to reproduce a problem. Some of these schemas have some very large tables. The large tables aren’t always needed to research a particular problem.

Here’s a quick bit of SQL to list the 20 largest tables by total size – including space used by indexes and LOBs. A quick search on google didn’t reveal anything similar so I just wrote something up myself. I’m pretty sure this is somewhat efficient; if there’s a better way to do it then let me know! I’m posting here so I can reference it in the future. :)

with segment_rollup as (
  select owner, table_name, owner segment_owner, table_name segment_name from dba_tables
    union all
  select table_owner, table_name, owner segment_owner, index_name segment_name from dba_indexes
    union all
  select owner, table_name, owner segment_owner, segment_name from dba_lobs
    union all
  select owner, table_name, owner segment_owner, index_name segment_name from dba_lobs
), ranked_tables as (
  select rank() over (order by sum(blocks) desc) rank, sum(blocks) blocks, r.owner, r.table_name
  from segment_rollup r, dba_segments s
  where s.owner=r.segment_owner and s.segment_name=r.segment_name
    and r.owner=upper('&schema_name')
  group by r.owner, r.table_name
)
select rank, round(blocks*8/1024) mb, table_name
from ranked_tables
where rank<=20;

The output looks like this:

Enter value for schema_name: someschema

      RANK         MB TABLE_NAME
---------- ---------- ------------------------------
         1      14095 REALLYBIGTABLE_USESLOBS
         2       6695 VERYBIG_MORELOBS
         3       5762 VERYLARGE
         4       5547 BIGBIG_LOTSOFINDEXES
         5        446 MORE_REASONABLE
         6        412 REASONABLE_TABLE_2
         7        377 ANOTHERONE
         8        296 TABLE1235
         9        280 ANOTHER_MADEUP_NAME
        10        141 I_CANT_POST_PROD_NAMES_HERE
        11         99 SMALLERTABLE
        12         92 NICESIZETABLE
        13         89 ILIKETHISTABLE
        14         62 DATATABLE
        15         53 NODATATABLE
        16         48 NOSQLTABLE
        17         30 HERES_ANOTHER_TABLE
        18         28 TINYTABLE
        19         24 ACTUALLY_THERES_400_MORE_TABLES
        20         19 GLAD_I_DIDNT_LIST_THEM_ALL

20 rows selected.

And just a quick reminder – the syntax to exclude a table from a data pump schema export is:

expdp ... exclude=SCHEMA_EXPORT/TABLE:[TABNAME],[TABNAME],...

Hope this is useful!

OSP #2c: Build a Standard Platform from the Bottom-Up

Mon, 2013-12-02 15:07

This is the fourth of twelve articles in a series called Operationally Scalable Practices. The first article gives an introduction and the second article contains a general overview. In short, this series suggests a comprehensive and cogent blueprint to best position organizations and DBAs for growth.

This article – building a standard platform – has been broken into three parts. We’ve already discussed standardization in general and looked in-depth at storage. Now it’s time to look in-depth at three more key decisions: CPU and memory and networking.

Consolidation

One of the key ideas of Operationally Scalable Practices is to start early with standards that don’t get in the way of consolidation. As you grow, consolidation will be increasingly important – saving both money and time. Before we dig into specifics of standardizing CPU and memory, we need to briefly discuss consolidation in general.

Consolidation can happen at many levels:

  1. Single schema and multiple customers
  2. Single database and multiple schemas or tenants (12c CDB)
  3. Single OS and multiple databases
  4. Single hardware and multiple OS’s (virtualization)

Two important points about this list. First, it works a lot like performance tuning: the biggest wins are always highest in the stack. If you want to save time and money then you should push to consolidate as high as possible, ideally in the application. But there are often forces pushing consolidation lower in the stack as well. For example:

  • Google doesn’t spin up new VMs every time a new customer signs up for Google Apps. Their existing webapp stack handles new customers. This is a great model – but if your app wasn’t designed this way from the beginning, it could require a massive development effort to add it.
  • It’s obvious but worth stating: you can only push consolidation up a homogenous stack. If the DB runs on linux and the app runs on windows then naturally they’ll each need their own VM. Same goes for the other three tiers.
  • Server operating systems have robust multiuser capabilities – but sharing an Operating System can still be tricky and these days virtualization offers a strong value proposition (especially when combined with automation). Then there are containers, which fall somewhere in between single OS and virtualization.
  • Security or regulatory or contractual requirements may require separate storage, separate databases or separate operating systems.
  • A requirement for independent failover may drive separate databases. In data guard, whole databases (or whole container databases) must be failed over as a single unit.

The second important point is that realistically you will encounter all four levels of consolidation at some point as you grow. Great standards accommodate them all.

CPU

In my opinion, batch workloads can vary but interactive workloads should always be CPU-bound (not I/O-bound). To put it another way: there are times when your database is mainly servicing some app where end-users are clicking around. At those times, your “top activity” graph in enterprise manager should primarily be green. Not blue, not red, not any other color. (And not too much of that green!) I’m not talking about reports, backups, or scheduled jobs – just the interactive application itself. (Ideally you even have some way to distinguish between different categories of activity, in which case there are ways to look at the profile of the interactive traffic even when there is other activity in the database!)

This leads into the question of how much CPU you need. I don’t have any hard and fast rules for CPU minimums in a standard configuration. Just two important thoughts:

  1. Maximum unit of consolidation: CPU is a major factor in how many applications can be consolidated on a single server. (Assuming that we’re talking about interactive applications with effective DB caching – these should be primarily CPU-bound.)
  2. Minimum unit of licensing: If partitioning or encryption becomes a requirement for you six months down the road then you’ll have to license the number of cores in one server. Oracle requires you to license all CPUs physically present in the server if any feature is used on that server.

The goal is to limit future purchasing to this configuration. And as with storage, if you really must have more than one configuration, then try to keep it down to two (like a high-CPU option).

Memory

I don’t have a formula to tell you how much memory you should standardize on either. It’s surprising how often SGAs are still poorly sized today – both too small and too large. You need to understand your own applications and their behavior. It’s worthwhile to spend some time reading sar or AWR reports and looking at historical activity graphs.

Once you start to get a rough idea what your typical workload looks like, I would simply suggest to round up as you make the final decision on standard total server memory capacity. There are two reasons for this:

  1. OS and database consolidation have higher memory requirements. Application and schema/multitenant consolidation will not be as demanding on memory – but as we pointed out earlier, your standards should support all levels of consolidation.
  2. You’re probably not maxing out the memory capacity of your server and it’s probably not that expensive to bump it up a little bit.
Consolidation Level Common Bottleneck Single Schema (Multiple Customers) CPU Multiple Schemas/PDBs CPU Multiple Databases Memory Multiple OS’s (VMs) Memory Networking

Small companies generally start with one network. But these days, networking can quickly get complicated even at small companies since network gear allows you to define and deploy multiple logical networks on the physical equipment. Early on, even if it doesn’t all seem relevant yet, I would recommend discussing these networking topics:

  • Current traffic: Are you gathering data on current network usage? Do you know how much bandwidth is used by various services, and how bursty those services are?
  • Logical segregation: Which network should be used for application traffic? What about monitoring traffic, backup traffic, replication traffic (e.g. data guard or goldengate) and operations traffic (kickstarts, data copies between environments, etc)? What about I/O traffic (e.g. NFS or iSCSI)? What is the growth strategy and how will this likely evolve over the coming years?
  • Physical connections: How many connections do we need, accounting for redundancy and isolation/performance requirements and any necessary physical network separation?
  • Clustering: Clustering generally require a dedicated private network and tons of IPs (on both the private cluster network and your corporate network). Sometimes it has higher bandwidth and latency requirements than usual. Generally it is recommended to deploy RAC on at least 10G ethernet for the interconnect. Is there a general strategy for how this will be addressed when the need arises?

It will benefit you greatly to take these discussions into consideration early and account for growth as you build your standard platform.

Slots

One design pattern that I’ve found to be helpful is the idea of slots. The basic idea is similar to physical PCI or DIMM slots – but these are logical “slots” which databases or VMs can use. This is a simplified, practical version of the service catalog concept borrowed from ITIL for private cloud architectures – and this can provide a strong basis if you grow or migrate to that point.

  1. Determine the smallest amount of memory which a standardized database (SGA) or VM will use. This will determine a slot size.
  2. Determine the largest amount of memory which can be allocated on the server. For databases, about 70% of server memory for SGA is a good starting point if it’s an interactive system. For VMs it’s possible to even allow more memory than is physically present but I don’t know the latest conventional wisdom about doing this.
  3. Choose additional DB or VM size options as even multiples of the minimum size.

For example, a database server containing 64GB of memory might have a slot size of 5GB with 9 total available slots. Anyone who wants a database can choose either a small or large database; a small database uses 1 slot and its SGA is 5GB. A large database uses 5 slots and its SGA is 25GB.

After the basic slot definition has been decided, CPU limits can be drafted. If the database server has 8 physical cores then the small database might have a hard limit of 2 CPUs and a large database might have a hard limit of 6 CPUs.

One area which can be confusing with CPU limits is factoring in processor threads. When determining your limits for a consolidation environment, make sure that individual applications are capped before pushing the total load over the physical number of CPUs. But allow the aggregate workload to approach the logical number of CPUs in a period of general heavy load coming from lots of applications.

In practice, that means:

  1. For multiple databases, set cpu_limit on each one low according to the physical count and calibrate the aggregate total against the logical count.
  2. For multiple schemas in a single database: use resource manager to limit CPU for each schema according to physical count and set cpu_count high according to logical count.

Slot Example

Now you have a first draft of memory and CPU definitions for a small and large database. The next step is to define the application workload limits for each database size. As you’re consolidating applications into a few databases, how many instances of your app can be allowed in a small and large database respectively?

Suppose you’re a SAAS company who hosts and manages lots of SAP databases for small businesses. I don’t actually know what the CPU or memory requirements of SAP are so I’m making these numbers up – but you might decide that a small database (5GB/2cpu) can support one SAP instance and a large database (25GB/6cpu) can support 25 instances (with PDBs).

Remember that schema/multi-tenant consolidation is very efficient – so you can service many more applications with less memory compared to multiple databases. For a starting point, make sure that the large database uses more than half of the slots then use server CPU capacity to determine how many app instances can be serviced by a large database.

Another observation is that your production system probably uses more CPU than your test and dev systems. You may be able to double or triple the app instance limits for non-production servers.

It’s an iterative process to find the right slot sizes and workload limits. But the payoff is huge: something that’s easy to draw on a whiteboard and explain to stakeholders. Your management has some concrete figures to work with when projecting hardware needs against potential growth. The bottom line is that you have flexible yet strong standards – which will enable rapid growth while easing management.

Example Slot Definitions Database Size Slots Max SAP Instances (Production Server) Max SAP Instances (Test Server) Small 1 1 2 Large 5 25 50
Standard Server:
- 8 Cores
- 64 GB Memory
- 9 Slots

nr_hugepages = 23552 (45 GB plus 1 GB extra)
Standard Database:

sga_target/max_size = [slots * 5] GB
pga_aggregate_target = [slots * 2] GB
cpu_count = [slots + 1]
processes = [slots * 400]