Doug Burns
10053 Trace Files - Different Plan in Different Environments
Rather than just describing the contents of the trace file, I thought it might be a good idea to tie the various sections into how they might help you solve Real WorldTM problems. Which might not be immediately obvious when the first example I use is the trace file for :-
SELECT * FROM DUAL;
But here it is. The first thing to note is that it's a 66KB file of over 2000 lines, even for something so trivial, which is just a taste of just how massive these files can be. It will also be environment and version-specific, as you'll see. Such is the nature of low-level trace files.
Going through the initial sections at a very high level, we have ....
Lines 1-20 - The standard type of trace file pre-amble that you might have seen in other trace files including
- The trace file name
- Instance and version information
- Host and O/S information
- Session Instrumentation of the type set by calls to DBMS_APPLICATION_INFO
Then we get to the first 10053-specific information (lines 22-27) which registers the various Query Blocks in this query. Understandably there is only one entry here in the QUERY BLOCK SIGNATURE section, were Oracle automatically names the query block to SEL$1
Line 28 is a note from SQL Plan Management highlighting that this statement does not already exist in the SQL Management Base.
Lines 29-32 contain a note that 11g Auto-DOP is disabled and at this point hopefully you'll start to see that if you already have a reasonable understanding of the CBO and related features, the trace file is actually pretty descriptive and verbose. From memory, I'm not sure it was always as easy to read.
For more information on Predicate Move-Around (as mentioned in lines 33-35), this 1994 VLDB paper is worth a look. Of course, when your statement is SELECT * FROM DUAL, there aren't exactly a lot of predicates to move around! ![]()
Next we go into a long section describing OPTIMIZER INFORMATION gathered from a variety of sources.
Line 40 shows the SQL statement and dont underestimate how important this is as further confirmation that you're looking at the right tracefile.
Lines 42-94 are a very handy Legend that lists the abbreviations that are used in the trace file. Some of these might have been guessable but, with so many terms used, it's great that you don't have to guess any more.
Lines 95-419 are a section that I often find very handy in solving issues with bad plans in two different database environments. The classic case of a developer telling me that it runs fine in Test but not in Production. The developer might send along the two plans and, even with a couple of good SQL Monitoring reports or the output of DBMS_XPLAN, that doesn't really tell me why the two plans are different, just that they are different. Working in an environment with multiple Dev, Test and Prod environments, it's not unusual to find that there is some drift in the instance configurations or someone has different session parameters set. It's a quick job to just open up the two trace files in a visual diff tool and make absolutely sure that the parameters the optimizer references (and you'll see just how many there are these days!) are truly identical.
It's just a small tip, but you'd be surprised by the number of issues that's helped me identify!
10053 Trace Files - Getting Started
Essentially, setting event 10053 causes the Cost Based Optimizer to write information to a trace file describing the information it is using and the results of it's calculations whilst walking through the decision-making process to determine the best execution plan. It includes the options that it has considered and discarded, those that it has accepted and options which are unavailable for various reasons. Because the decision making process is detailed and extensive, the files tend to be large for all but the most trivial statements and I'd challenge most people to read and understand an entire 10053 trace file! However, you are often looking for the reason for a particular bad decision, which helps to narrow the scope and, personally, I've found recent versions of 10053 trace files much more verbose and readable.
The best and most detailed reference I've seen is Wolfgang Breitling's paper 'A Look under the Hood of the CBO' although it was written many years ago and, as with all low-level undocumented Oracle information, things change frequently and without warning. At the start of the paper he describes how to set event 10053 to generate the related trace file in the instance's user dump destination. However, since Oracle 11g, there is a more flexible way to generate the trace file that Maria Colgan describes in a couple of posts on the Optimizer Development Groups blog here and here. The latter post is particularly interesting because that approach automatically triggers a hard parse of the statement in order to generate the trace file.
Which solves what I suspect is one of the most confusing aspects of generating 10053 trace files when you're first getting used to it. The statement needs to be hard-parsed to ensure that the trace file will be generated. It might help you to remember what it is that is being traced - the CBO making it's decisions as it chooses an optimal execution plan. So, if the plan has already been generated then no trace file! One simple technique to get around this prior to 11g is to add a new comment to the statement to force a hard parse. Remember that if you find that the trace file is not being produced as expected, that might be the reason.
In the next post, I'll start to look at the contents of a 10053 trace file.
10053 Trace Files
I remember a while ago that Neil Chandler wrote a blog post about why you probably don't need 10046 trace files that I didn't completely agree with and I kept thinking I must comment on it or write a post from a different perspective. Neil's a mate, a London Oracle Beers regular and I understood where he was coming from in that post. He makes some excellent points, so I recommend you read it. But I don't think it's so hard to get a trace enabled these days as people think. You don't necessarily need to change a line of code with the availability of the DBMS_MONITOR package and (although I accept this is unusual) the client I've worked for most recently makes trace file access very easy for developers. We also have ASH, AWR and SQL Monitoring though and I tend to agree with Neil that the times when 10046 trace files are truly required to solve problems are limited. Now I read this back, it sounds like I agree with him, but I know that I use 10046 trace files more regularly than the post suggests. I would say 10046 trace files are incredibly useful sometimes. In fact, when they are useful they are the only correct tool for the job.
But the real reason for referencing Neil's post is this statement.
"I have never used a 10053 trace on a Production system. I have simply never needed to know the decisions taken by the optimizer in that much detail."
I've had this discussion with most well-known experts in the Oracle community at some point or other and there's general agreement that there's no need to bother with those pesky and ridiculously geeky 10053 trace files most of the time. Most SQL performance problems simply aren't that complicated. I've probably agreed whenever the argument has come up but the fact is that I increasingly find myself using them and it worries me a little that 'most of the time you don't need them - concentrate on the basics' is heard as - 'don't ever bother looking at them - they won't help' and I've been tempted to redress the balance for a while.
But something stopped me. 10053 trace files are exceptionally long and contain a lot of information that I wouldn't begin to pretend I understand and so if you can't cover a subject properly and it is quite a technical subject then, not only are you in danger of doing people a disservice but you are also opening yourself up to all sorts of challenges, corrections and debates. But, hey, that's what a community and what blogging is all about - people learning from each other. I'm also not a big fan of any technical writing which is about how clever the writer is above it actually being useful! The more I know, the more I find myself avoiding the wilfully geeky stuff.
I still wasn't sure but at least one person at my current client has been badgering me about this for ages so I've decided, what the hell, I'm going to write a few blog posts about the things I find useful about 10053 trace files and hopefully give some very high level hints about how you might use them too.
If it's not technical enough for some people then tough and if I go astray, there are plenty of people out there who probably know the subject much better who can keep me straight
Enough of the intro, but let me finish by saying what I think is the most important reason to use 10053 trace files. There are many posts about inaccurate row source cardinalities leading to bad plans (This is one of my favourites). But I believe that it's actually only the experts who can compare E-ROWS and A-ROWS and make educated guesses about why the cardinalities are wrong. (To give you an example, many is the time that Jonathan Lewis has said to me - 'Oh, that looks like a classic Optimizer 5% guess' - or words to that effect, anyway.) But most of us can't just 'see' those things when we look at a plan. At best, a 10053 trace file offers the possibility of knowing why the CBO picked the wrong plan.
Not all Deadlocks are created the same
I've blogged about deadlocks in Oracle at least once before. I said then that although the following message in deadlock trace files is usually true, it isn't always.
The following deadlock is not an Oracle error. Deadlocks of
this type can be expected if certain SQL statements are
issued. The following information may aid in determining the
cause of the deadlock.
So when I came across another example recently, it seemed worth a quick blog post. Not least for the benefit of other souls who hit the same issue (and probably hit Google moments later).
But while it's easy to say - "Hey! Look! I found an exception! Aren't I clever?" - it occurred to me that actually Oracle's capabilities in this area might be underrated by raising the occasional anomaly. Because the truth is
1) In most cases, deadlock errors are down to the way you've written your application or some documented restriction in Oracle. The type of restrictions that you're more likely to hit if you're handling high degrees of concurrency with lots of DDL, parallel query, partition management and the like. Such activities often have unusually restrictive locking requirements and most locking issues can be turned into deadlock issues quite easily if you have a few sessions running concurrently.
2) It's still the case that Oracle will handle the deadlock situation, at least to the extent of rolling back one of the statements causing the issue. (Although, whilst writing this post, I noticed that Jonathan Lewis raised the question of what exactly people mean when they suggest that Oracle resolves deadlock issues.
3) Deadlock trace files are typically very useful and not the most difficult to read. Yes, they tend to use Oracle kernel terminology (not surprising) but I'd wager that most people could have a rough idea of the root cause with some initial analysis and could have a very detailed idea, given more time. Even if you can't decipher the things yourself, it gives Oracle Support detailed information to help root cause analysis.
So, to the particular issue we hit. Towards the end of a data loading process that loads around a billion rows in a short period of time (30/60 minutes that also includes a bunch of surrounding activities), we would hit the occasional deadlock error. Fortunately, the site I'm working at just now has a very enlightened policy towards developer access to the alert log and trace files, so I can do my own initial investigation. On digging out the relevant deadlock trace file, it looked like this (some details changed)
Trace file /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_dia0_1627682.trc Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /app/ora/local/product/11.2.0.3/db_1 System name: Linux Node name: prod_server.ldn.orcldoug.com Release: 2.6.32-220.13.1.el6.x86_64 Version: #1 SMP Thu Mar 29 11:46:40 EDT 2012 Machine: x86_64 Instance name: PRD Redo thread mounted by this instance: 1 Oracle process number: 8 Unix process pid: 1627682, image: oracle@prod_server.ldn.orcldoug.com (DIA0) *** 2013-01-16 09:09:10.925 *** SESSION ID:(201.1) 2013-01-16 09:09:10.925 *** CLIENT ID:() 2013-01-16 09:09:10.925 *** SERVICE NAME:(SYS$BACKGROUND) 2013-01-16 09:09:10.925 *** MODULE NAME:() 2013-01-16 09:09:10.925 *** ACTION NAME:() 2013-01-16 09:09:10.925 ------------------------------------------------------------------------------- DEADLOCK DETECTED (id=0xd0102292) Chain Signature: 'library cache lock'<='row cache lock' (cycle) Chain Signature Hash: 0x52a8007d The following deadlock is not an Oracle error. Deadlocks of this type can be expected if certain SQL statements are issued. The following information may aid in determining the cause of the deadlock. Resolving deadlock by signaling ORA-00060 to 'instance: 1, os id: 3443329, session id: 161' dump location: /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_ora_3443329.trc Performing diagnostic dump on 'instance: 1, os id: 3443222, session id: 779' dump location: /app/ora/local/admin/PRD/diag/rdbms/PRD_prod_server/PRD/trace/PRD_ora_3443222.trc ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (PRD_prod_server.PRD) os id: 3443222 process id: 127, oracle@prod_server.ldn.orcldoug.com session id: 779 session serial #: 605 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x8 p2: 'mode'=0x0 p3: 'request'=0x5 time in wait: 1 min 58 sec timeout after: never wait id: 1655 blocking: 1 session current sql: Begin run_manager_pkg.finalize_all_values_prc(:v0); End; wait history: * time between current wait and wait #1: 0.002176 sec 1. event: 'enq: PS - contention' time waited: 0.000082 sec wait id: 1654 p1: 'name|mode'=0x50530006 p2: 'instance'=0x1 p3: 'slave ID'=0x2f * time between wait #1 and #2: 0.000013 sec 2. event: 'PX Deq: Slave Session Stats' time waited: 0.000001 sec wait id: 1653 p1: 'sleeptime/senderid'=0x0 p2: 'passes'=0x0 * time between wait #2 and #3: 0.000001 sec 3. event: 'PX Deq: Slave Session Stats' time waited: 0.000002 sec wait id: 1652 p1: 'sleeptime/senderid'=0x0 p2: 'passes'=0x0 } and is blocked by => Oracle session identified by: { instance: 1 (PRD_prod_server.PRD) os id: 3443329 process id: 134, oracle@prod_server.ldn.orcldoug.com session id: 161 session serial #: 247 } which is waiting for 'library cache lock' with wait info: { p1: 'handle address'=0x101f8eac98 p2: 'lock address'=0xfdef83738 p3: '100*mode+namespace'=0x10f2000010003 time in wait: 1.739719 sec timeout after: never wait id: 508 blocking: 1 session current sql: ALTER INDEX "DOUG"."VALUE_PK" REBUILD PARTITION "SYS_P4089"
wait history: * time between current wait and wait #1: 0.000973 sec 1. event: 'enq: CR - block range reuse ckpt' time waited: 0.003220 sec wait id: 507 p1: 'name|mode'=0x43520006 p2: '2'=0x10086 p3: '0'=0x1 * time between wait #1 and #2: 0.000008 sec 2. event: 'reliable message' time waited: 0.000107 sec wait id: 506 p1: 'channel context'=0x101c5afa98 p2: 'channel handle'=0x101c0f2260 p3: 'broadcast message'=0x101b5cfd58 * time between wait #2 and #3: 0.003791 sec 3. event: 'db file sequential read' time waited: 0.000321 sec wait id: 505 p1: 'file#'=0x5e p2: 'block#'=0x118784 p3: 'blocks'=0x1 } and is blocked by the session at the start of the chain.
I would hope that a few things would be immediately obvious, particularly if it's 'your' application that generated this issue
1) The two sessions are running the following parts of the application.
Session 1: Begin run_manager_pkg.finalize_all_values_prc(:v0); End;
Session 2: ALTER INDEX "DOUG"."VALUE_PK" REBUILD PARTITION "SYS_P4089"
Which happened to fit in with what we were seeing. We have two concurrent runs which perform similar actions using different input files that load into different partitions.
2) The first session is using Parallel Query (note the enq: PS - contention and different PX Deq wait events)
3) The deadlock is a little unusual because it's not between two transactions trying to lock database objects or rows being locked by the other session but between in-memory structures. One session is waiting on 'row cache lock' and the other is waiting on
'library cache lock', as opposed to waiting for specific row or
table-level locks. This is also visible from the chain signature at the start of the trace file.
Chain Signature: 'library cache lock'<='row cache lock' (cycle)
Armed with 2) and 3) in particular, my next step was to go to My Oracle Support, as usual. I find that Google isn't too great with issues like this because some of them are quite specific and might not be affecting too many others. Sure enough, a search turned up :-
Bug 14356507 Deadlock between partition maintenance and parallel query operations
Which is confirmed as affecting versions 11.2.0.2 and 11.2.0.3. The fix is in Bundle Patch 12 for Exadata, in Oracle 12.1 and is also available as one-off patch that we're in the process of applying to different environments.
The issue is that "When a parallel query is hard parsed, first QC hard parses the query and then all the slaves. When a partition maintenance operation (DDL) comes in between the hard parses of QC and Slaves.", then you can hit the deadlock. There's more detail in the bug notes, but it's worth noting this phrase "This is basically a timing issue, in high concurrency environments.", which means it only affects us very intermittently and is a nightmare to prove we've eliminated without a lot of testing.
What I find a little disconcerting is that there seem to be quite a few of these library cache deadlock issues kicking around in recent versions that I haven't been used to seeing in prior versions. Given some of the library cache madness I've seen in my few years with 11g, I do wonder what on earth they've done to it!
TOAD Ambulance Button (and a little Google)
I've recently spent quite a while working in a performance team that supported Production performance incidents and worked with developers on application performance before the code hit Production. Whenever a developer asked me to look at the execution plan for their poorly performing statement, they would send over some evidence of the poor plan and in the majority of cases I found myself saying - "You didn't produce this using that damn Ambulance Button in TOAD, did you?"
Even since I've moved on to a new role, the last time I found myself saying that was only last week to an experienced Development Lead who knows Oracle pretty well for a Java lady. I then went on to show her DBMS_XPLAN and SQL Monitoring.
So when I was presenting on SQL Monitoring last week, I gave my usual minor rant about the Ambulance Button but I didn't realise until the end of the presentation and was looking over the online questions, that I appear to have upset one of the webinar attendees, who had subsequently left so I was unable to answer directly. That's why I've decided to blog about just exactly what it is I hate about that damn button, or at least the way people use it.
(Immediately after the presentation, a friend and colleague who knows TOAD inside out pointed out that TOAD doesn't even have an Ambulance Button any more! In version 11.6, there's been a major visual redesign and Explain Plan is now accessed using a button that looks like it has a little tree-like execution plan on it. Clearly my knowledge is out of date, but I need to balance that against the fact that at my current site, the vast majority of users are still using version 10.something.)
The webinar questioner raised a more important question though. Apparently you can configure the Ambulance Button so that it shows that actual plan used when the statement was executed, rather than using EXPLAIN PLAN to generate a projected plan. Because therein lies my main issue with the button. It effectively runs EXPLAIN PLAN and shows you the output. Experienced people probably understand that EXPLAIN PLAN can lie sometimes (sometimes is enough, though) and realise the limitations of not being able to compare Expected and Actual Cardinalities for the different row sources. If you use it and are aware of the limitations, then fair enough, I suppose (although it still wouldn't be my preference). But a lot of TOAD users are not necessarily Oracle experts and the Ambulance Button gives them easy access to enough information to start making assumptions, without the background knowledge to question those assumptions. In the end, I have spent far too much of my time having to explain to people why what they think they're seeing is not what is actually happening and that frustrates the hell out of me.
Maybe it's unfair for me to criticise a tool when it's really the way that people use it? Well, in the end, whether it's the tool or the people, the combination of the two is what sucks up time and if I can show people a better tool and get them to use that instead, then everyone's a winner, right?
In a similar vein, I often level the accusation at Google (the search engine) that it is the enemy of good software development. Why? Because it feels like the world is increasingly populated with people who think you can learn anything from Google and that seems to be the sum total of their training as a programmer - being a bright person and then being let lose on systems with Google at hand to help them when they get stuck. Am I really the only person who doesn't think of this as a good thing? Am I being a complete dinosaur to think that occasionally people might want to have a little formal training and mentoring first? Why should you be able to learn PL/SQL from Google any more than you could learn brain surgery or flying an aeroplane or ... You get my drift.
I should be clear on one important thing. I use TOAD every single day and have relatively few issues with it as a development tool, and have stuck with it, despite trying many others. I was a very early adopter! It does what I need it to do and it does it well. But when it comes to performance analysis, I hardly ever use it (other than to run queries against ASH data and the like). There is already an excellent execution plan generation tool in DBMS_XPLAN and, if you're lucky enough to have Diagnostics and Tuning Pack, then you have the OEM Performance Pages, Real-Time SQL Monitoring and lots of other cool graphical goodness. Even if you don't, you always have Extended Tracing and the like.
I just think they're superior tools and the people I've met who use them or who I've shown how to use them tend to make fewer inaccurate assertions about performance issues.
However, I was troubled that I'd made an inaccurate statement about TOAD (at the bare minimum) because that is a bad thing to do during a presentation, so I went to track down how I configure the Ambulance Button to show me the actual plan used, as the questioner suggested you can. I'm damned if I can work out how to, though, so if someone knows, I would appreciate an explanation in the comments below. Please don't tell me it's an 11 feature though because that isn't what most of the people who hassle me are using! Then again, I suppose I should tell them to upgrade to 11. In the meantime, the degree of annoyance that the questioner showed suggests to me that I have completely missed something really valuable that I should know about!
Presentation Updates
The first is later today and part of the ongoing series of excellent webinars hosted by the good people at Red Gate Software. This one will be on SQL Monitoring - by far the most interesting new performance analysis feature in Oracle 11g. I've presented on SQL Mon many times now so, as well as covering the basics and some of the most important features for me personally, I'll try to cover a few of the quirks and annoyances as well. It's such a practical feature that I use almost every working day that I think everyone should know more about it. It's at 4pm UK time on Wednesday and you can register here. Particularly for my Australian friends ;), Red Gate always records the sessions and the video usually appears very soon afterwards.
In April I'll be attending two of my favourite conferences in the same week - the Finnish and Norwegian Oracle User Group seminars, on their respective boats! I'm writing two new presentations for these.
The first is based on the client work I've been doing over the past 6 months or so and is called 'Fast ETL processes using Native Oracle Features'.
This is the type of presentation I haven't done for a while and should be clearly labelled Case Study or User Experiences and I'm a firm believer that its these presentations that really make a User Group conference, not just the deeply techie stuff. It will definitely be slightly technical but will be based on real work, real design requirements, real compromises, real organisational challenges (you get the idea), rather than experimentation. Just to be clear, I still like the latter, but we need some balance here. I'm also quite proud of some of the things the team have achieved in a short period of time, despite those challenges, so want to crow about it a little ;), particularly as I think that other people might be able to pick up on some of the ideas, even as a reminder of just how well Oracle can work when you use it well.
The second is about a subject dear to my heart - ASH Analytics. Although I've done the 'Pretty Pictures' version, (and I still think that is utterly compelling), I'd also like to do a slightly more technical version.
I'm not sure many of you will make it on to two boat cruises in the space of a week (although I'm sure you would be more than welcomed) but maybe I (won't) see some of you later today ...
That is just *weird*
It's not something I do often (... and, yes, I know dates on my blog posts would be nice. It isn't like people haven't told me. I've just had other stuff to do and that one is filed under 'Migrate to WordPress'. But that isn't the point here.)
Regardless of the lack of clear dates on posts, I can inform you reliably that my very first post was on 10th February, 2005. I don't know what possessed me to try blogging again today, but it's precisely 8 years to the day since I started .... (As usual, it's hardly mind-blowing or informative


