Feed aggregator

Troubleshooting done to make root.sh work after a 10gR2 CRS ( installation on HP-UX PA RISC 64-bit OS

Gaurav Verma - Mon, 2008-03-10 08:48

PrefaceThis posting describes the troubleshooting done to make root.sh work after a 10gR2 CRS ( installation on HP-UX PA RISC 64-bit OS. We were upgrading from 9i ( to 10gR2.
The problem(root@testdb):/stage11i/software# /u01/oracle/testdb/CRS/root.sh
WARNING: directory '/u01/oracle/testdb' is not owned by root
WARNING: directory '/u01/oracle' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured
Checking to see if any 9i GSD is up

Setting the permissions on OCR backup directory
Setting up NS directories
PROT-3: Failed to create export file
Failed to upgrade Oracle Cluster Registry configuration


We had to create sub-directory  "crs" under /u01/oracle/testdb/CRS/cdata/crs10g and re-run root.sh. The root.sh expects to find a subdirectory for the previous RAC cluster's name.

More trouble..There were some more errors while running root.sh
Format of 1 voting devices complete.

/u01/oracle/testdb/CRS/install/rootconfig[938]: /sbin/init.d/init.crs: Execute permission denied.
/u01/oracle/testdb/CRS/install/rootconfig[941]: /sbin/init.d/init.cssd: Execute permission denied.
CRS daemons not set to start. See /var/adm/messages for details.

(root@testdb):/u01/oracle/testdb/CRS/cdata/crs10g# echo $?


As per the SR 6807470.992 we performed the following action Plan.

Login as root and grant the following permissions -

# cd /sbin/init.d
# chmod 755 init.crs
# chmod 755 init.crsd
# chmod 755 init.cssd
# chmod 555 init.evmd

Then check the permissions are correct -

(root@testdb):/sbin/init.d# ls -rlt init*
-rwxr-xr-x 1 root sys 2226 Mar 3 18:32 init.crs
-rwxr-xr-x 1 root sys 4845 Mar 3 18:32 init.crsd
-rwxr-xr-x 1 root sys 36798 Mar 3 18:32 init.cssd
-rwxr-xr-x 1 root sys 3185 Mar 3 18:32 init.evmd

When we tried to execute root.sh we are getting following warning:

(root@testdb):/stage11i/software# /u01/oracle/testdb/CRS/root.sh
WARNING: directory '/u01/oracle/testdb' is not owned by root
WARNING: directory '/u01/oracle' is not owned by root
WARNING: directory '/u01' is not owned by root
Checking to see if Oracle CRS stack is already configured

At this point, we edited the file /var/opt/oracle/scls_scr/testdb/oratest1/cssfatal from disable to enable in both the nodes (there is only 1 line in this file)

(root@testdb):more /var/opt/oracle/scls_scr/testdb/oratest1/cssfatal   

Also give permissions as follows:

(root@testdb):/var/opt/oracle# chmod -R 777 /var/opt/oracle/scls_scr

Then started the CRS.

# $ORA_CRS_HOME/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly

Backups, cheaper and faster?

Hampus Linden - Sun, 2008-03-09 18:21
When did backups become easy and affordable?
I've been sort of slacking off in the backup game in the last year or so, our backups have been managed by other parts of the company and some backups have been outsourced to BSP's. I recently spec and deploy a basic Netbackup solution for our office servers at work and was amazed how extremely simple the whole thing was. The first backup system I worked with was Legato Networker (now owned by EMC) and it was at the time a pretty great product but it was quite difficult to manage, lots of things needed to be scripted and setting up a new tape library required quite a lot of tweaking.
Secondly, the budget was at first glance fairly limited but when we speced up what we needed and got a couple of quotes in I was amazed of how much we actually got for our money.
I wanted basic disk staging then duplication to tape. The obvious choice for tape was LTO4, we get pretty good price / performance, market leading capacity and it's a solid standard.
Most suppliers actually quoted the same tape library, just branded by different vendors. HP, Dell, IBM. All where IBM 3573 tape libraries, it's a excellent and affordable library. Dell came in best in price as usual. We opted for SAS attached. The backup server, my first choice server these days, is the Dell PowerEdge 2950. I buy it for pretty much 75% of all requirements. Speced out with a pair of Quad-Core 1.86GHz processors, 4Gb RAM, 6x500Gb SATA drives (in RAID5) for the disk staging pool and a SAS5 to attach the library.
Software wise the server is running CentOS 5.1 and Symantec/Vertias Netbackup 6.5 “Linux Starter Edition (LSE)”. The LSE is a pretty good deal, you get a server license, tape library license and 5 client licenses (with bare metal, open file etc.) for a fairly low cost.
Installing and configuring Netbackup was as easy as it gets, the tape library was detected on boot, no sysctl.conf changes needed. Stepped through the netbackup installation process, added the tape library (less than 5 clicks!!), defined a couple of tape label policies, created a 2Tb disk staging pool with my new tape group as the target, just add water (ehm, clients actually).

The total cost was under £13.000 (about $22.000), installation time was less than one day, included was:

  • Veritas Netbackup 6.5 Linux starter edition
  • Netbackup clients 5-pack

  • Netbackup Exchange agent

  • Dell PowerEdge 2950 server

  • Dell PowerVault TL2000 library w/ one LTO4 SAS drive

  • 25x LTO4 tapes

  • 3x LTO cleaning tapes

  • CentOS Linux 5.1 (free)

Job Vs Position

RameshKumar Shanmugam - Sun, 2008-03-09 16:23
As a Functional consultant the first thing that we should decide before we can design the solution for a customer is whether the system going to be a Single Business Group or Multi Business Group.

The second main important thing we need to decide is whether the customer is going to use Job or position.

When we put this question to the customer the first expected question for the Customer side would be what is the difference between Job and Position

The content in this blog is more of my own view and the simplistic approach I always like. review the documentation before you can decide on the approach you want to take

To explain it in a very high level
Jobs are Generic Title or Role within a Business Group, independent of any single organization. Required. Usually more specific if positions are not used.

Position are Specific occurrence of one job, fixed within an organization. Not required

If you are in US legislation your job will drive you FLSA and EEO reporting. Personally I feel maintaining the Position is hard in an unstructured organization. Position Hierarchy will suite for the University/college/School and Government Organization

The Maintenance is more in the Position hierarchy than in the Job. If your customer feel they need less maintenance activity then you should recommend Job not the Position
Categories: APPS Blogs

Horrible practice of storing users passwords

Pawel Barut - Sun, 2008-03-09 13:24
Written by Paweł Barut
I've read today terrifying article about un-ethical storage of users password. It is often common practice that users passwords are stored unencrypted in database. But this case is even more dreadful. It about software, G-Archiver, that is available to everyone. If you ever used this software, you should change your G-mail password right now.
So what is the problem:
  • G-Archiver has build in user-name and password for one of g-mail accounts.
  • Whenever someone uses G-Archiver and provides his credentials for g-mail, user-name and password is send to author of this software
  • Additionally, any one who finds out the user-name and password stored in G-Archiver, can get passwords of thousand of previous users.
Go and read. Hope there are not too many of software that works like this one. It is really terrifying, that some people are so irresponsible. It could be well planed action to collect gmail users and passwords, or it could be just lack of imagination what are the consequences. I would like to believe that it was the second one.

But also big corporations do not care for security enough. It is common practice, that users password is stored internally as plain text. It can be easily found out by using "I forgot my password" feature. If in return you will get email with your original password, then it is a crappy web side. And you should never reuse password passed to such web side as this is big risk for you. Take a look at Password Security: It’s Not That Hard (But You Still Can’t Get It Right) for further explanation and examples.

Cheers Paweł

Related Articles on Paweł Barut blog:
Categories: Development

Troubleshooting case study for 9i RAC ..PRKC-1021 : Problem in the clusterware

Gaurav Verma - Sat, 2008-03-08 09:20

Preface This is a troubleshooting case study for a 9i RAC environment on HP-UX PA RISC 64 bit, which was using HP ServiceGuard solution (third party) for cluster maintenance.
The problem The problem was that the srvctl command was giving the following error:

(oracle@rac2):  gsdctl stat
PRKC-1021 : Problem in the clusterware
Failed to get list of active nodes from clusterware

(oracle@rac1):  srvctl config
PRKC-1021 : Problem in the clusterware

(oracle@rac1):  pwd
(oracle@rac1):  ./lsnodes

Troubleshooting Approach Followed note 178683.1 to edit $ORACLE_HOME/bin/gsd.sh like this and traced gsd.sh:

# added -DTRACING.ENABLED=true -DTRACING.LEVEL=2 before -classpath
SSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME

$ $ORACLE_HOME/bin/gsd.sh
[main] [13:15:42:852] [line# N/A]  my property portnum=null
[main] [13:15:42:942] [NativeSystem.<init>:123]  Detected Cluster
[main] [13:15:42:946] [NativeSystem.<init>:124]  Cluster existence = true
[main] [13:15:42:947] [UnixSystem.<init>:118]  Going to load SRVM library
[main] [13:15:42:950] [UnixSystem.<init>:118]  loaded libraries
[main] [13:15:42:950] [OPSMDaemon.main:726]  Initializing the daemon ...
[main] [13:15:42:951] [OPSMDaemon.<init>:188]  NativeDebug is set to true
[main] [13:15:42:952] [OPSMDaemon.<init>:188]  UnixSystem.initializeDaemon: groupName is opsm
[main] [13:15:42:953] [OPSMDaemon.<init>:188]  Unsatisfied Link Error caught. Could not initialize the cluster
[main] [13:15:42:954] [OPSMDaemon.main:726]  initializeDaemon status = false
[main] [13:15:42:955] [OPSMDaemon.main:726]  Failed to initialize and register with clusterware
[main] [13:15:42:955] [OPSMDaemon.main:726]  OPSMErrCode = 1003
[main] [13:15:42:958] [OPSMDaemon.main:729]  java.rmi.RemoteException: Unable to initialize with the clusterware
java.rmi.RemoteException: Unable to initialize with the clusterware
        at oracle.ops.mgmt.daemon.OPSMDaemon.<init>(OPSMDaemon.java:195)
        at oracle.ops.mgmt.daemon.OPSMDaemon.main(OPSMDaemon.java:726)

[main] [13:15:42:958] [line# N/A]  Exiting from main..no point trying to start the daemon

At this point, one option was to initialize the srvconfig raw device (OCR) and then add the RAC instances etc manuall using srvctl add command:

(oracle@rac1):  srvconfig -init
java.lang.UnsatisfiedLinkError: readRawObject
        at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
        at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
Exception in thread "main" (oracle@rac1): 

(oracle@rac1):  srvconfig  -version  
java.lang.UnsatisfiedLinkError: readRawObject
        at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
        at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)

If the config file is pointing to a raw device the following type of output should be returned:

     $ raw device version ""

Since we were not getting that output, there was either a problem in accessing the OCR raw device or the soft link was not working due to a permissions issue. Outputting the contents of the OCR on the standard output using $ dd if=/dev/orar1/rrawuat.conf bs=1500 showed that the OCR device was readable all right.

$  more /var/opt/oracle/srvConfig.loc

$ ls -l /dev/orar1/rrawuat.conf
crw-rw----   1 oracle    dba         64 0x110004 Apr 11  2007 /dev/orar1/rrawuat.conf

Then one idea was to relink the srv* binaries using the make command, but that also resulted in error:

(oracle@rac2):  cd $ORACLE_HOME/srvm/lib  
(oracle@rac2):  make -f ins_srvm.mk install
nm -gp /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a | grep T  | grep Java | awk '{ print "-u " $3 }' >  /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def;
/u01/oracle/uatdb/9.2/bin/echodo ld +s -G -b -o libsrvm.sl -c
/u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a                           -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/  -L/usr/lib -lc  -lclntsh -lwtc9 -lnls9  -lcore9 -lnls9 -lcore9  -lnls9 -lxml9 -lcore9
-lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm `cat /u01/oracle/uatdb/9.2/lib/sysliblist` ;
rm /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def
ld +s -G -b -o libsrvm.sl -c /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/
-L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm -l:libcl.sl -l:librt.sl -lpthread -l:libnss_dns.1 -l:libdld.sl
ld: Mismatched ABI (not an ELF file) for -lclntsh, found /u01/oracle/uatdb/9.2/lib32//libclntsh.sl
Fatal error.
*** Error exit code 1


The turning point Saw bug 6281672 and kind of got hint from there.

Compared the file $ORACLE_HOME/lib32/libsrvm.sl on another RAC system (duat).

(oracle@duat1):  ls -l $ORACLE_HOME/lib*/libsrvm*
-rwxr-xr-x   1 oracle    dba          57344 Nov  7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-rwxr-xr-x   1 oracle    dba          57344 Nov  7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Nov  7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-rwxr-xr-x   1 oracle    dba          36864 Nov  7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0

On rac1/2, the /u01/oracle/uatdb/9.2/lib32/libsrvm.sl file was missing. Saw that the .sl and .sl0 files were copied of each other.

so on rac1/2, did the following:

$ cd /u01/oracle/uatdb/9.2/lib32
$ cp libsrvm.sl0 libsrvm.sl


(oracle@rac2):  ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x   1 oracle    dba          57344 Oct 17  2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Feb  9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x   1 oracle    dba         106496 Nov 30  2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0


(oracle@rac2):  ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x   1 oracle    dba          57344 Mar  5 15:14 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-r-xr-xr-x   1 oracle    dba          57344 Oct 17  2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Feb  9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x   1 oracle    dba         106496 Nov 30  2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0

The fact that the size of libsrvmocr.sl0 was not matching with libsrvmocr.sl did not seem to be a showstopper. The Oracle system seemed to be taking libsrvm.sl into account for srv* related commands.

After this, the gsdctl and srvctl commands started working:

(oracle@rac1):  gsdctl stat                                             
GSD is running on the local node

(oracle@rac1):  srvctl config

(oracle@rac2):  gsdctl stat
GSD is not running on the local node

(oracle@rac2):  gsdctl start
Successfully started GSD on local node

(oracle@rac2):  srvctl status database -d uat
Instance uat1 is not running on node rac1
Instance uat2 is not running on node rac2
But there was "just" one more problem.. Interestingly, srvctl stop database/instance worked fine at this point, but srvctl start did not.

(oracle@rac2):  srvctl start instance -d uat -i uat1
PRKP-1001 : Error starting instance uat1 on node rac1
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
ORA-01078: failure in processing system parameters local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.

(oracle@rac2):  srvctl start instance -d uat -i uat2
PRKP-1001 : Error starting instance uat2 on node rac2
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
ORA-01078: failure in processing system parameters
SQL> Disconnected
ORA-00132: syntax error or unresolved network name 'uat2' reserved.


(oracle@rac1):  strings spfileuat1.ora | grep listener

Tnsping utility was working for both UAT1 and UAT2 service names..

(oracle@rac2):  tnsping uat1
TNS Ping Utility for HPUX: Version - Production on 05-MAR-2008 16:19:36

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat1)))
OK (0 msec)

(oracle@rac2):  tnsping uat2
TNS Ping Utility for HPUX: Version - Production on 05-MAR-2008 16:19:39

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat2)))
OK (0 msec)

Added entries for local_listener and remote_listener to $TNS_ADMIN/tnsnames.ora on rac1 (and rac2 resp) and made sure that tnsping to them was working:


(oracle@rac2):  tnsping local_listener
TNS Ping Utility for HPUX: Version - Production on 05-MAR-2008 16:44:05

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)))
OK (10 msec)

(oracle@rac2):  tnsping remote_listener
TNS Ping Utility for HPUX: Version - Production on 05-MAR-2008 16:44:13
Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)))
OK (10 msec)

The final clincher I saw the light when I realized that the Database ORACLE_HOME was using autoconfig, which meant that the $TNS_ADMIN variable had $CONTEXT_NAME in it, and was not just plain vanilla $ORACLE_HOME/network/admin. Therefore, the listener.ora and tnsnames.ora files in $TNS_ADMIN were not the same as those in $ORACLE_HOME/network/admin

To get around this issue graciously, srvctl setenv command could be used:

$ srvctl setenv instance -d uat -i uat1 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1

$ srvctl getenv instance -d uat -i uat1

$ srvctl setenv instance -d uat -i uat2 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2

$ srvctl getenv instance -d uat -i uat2

(oracle@rac1):  srvctl start instance -d uat -i uat1

(oracle@rac1):  srvctl status instance -d uat -i uat1
Instance uat1 is running on node rac1

(oracle@rac1):  srvctl start instance -d uat -i uat2

(oracle@rac1):  srvctl status instance -d uat -i uat2
Instance uat2 is running on node rac2
Some pending issues.. Some issues were still pending, pointing towards some more mis-configuration, but at least we were able to get over the initial error.

(oracle@rac1):  srvconfig  -version
        at java.text.MessageFormat.format(Compiled Code)
        at java.text.MessageFormat.format(MessageFormat.java)
        at java.text.MessageFormat.format(MessageFormat.java)
        at oracle.ops.mgmt.nls.MessageBundle.getMessage(MessageBundle.java:225)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)

SOA on SOA !! - Bring the discipline of SOA to service development and creation in your organization.

Arvind Jain - Fri, 2008-03-07 18:22

SOA on SOA!!

It was difficult to put the most appropriate words to my thoughts but what I am trying to bring out is that SOA implementation should not burden the service providers and consumers to go through the burden of learning all the latest standards, tools & technologies.

They should just worry about their business logic and there should be a framework which transparently takes care of making that business logic a service as in SOA world while adhering to their company's enterprise policies, processes and standards.

How to enable this? Enterprise architects should closely watch two upcoming standards - SCA & JBI.

JBI is JSR 208 and called as Java Business Integration. SCA is Service component architecture.

JBI is used by system integrators during physical deployment (customers and end users do not see this). It helps in management & interoperability of your SOA infrastructure.

SCA has a design and composition perspective. It is used by developers to annotate or put notes in their code to describe service and their dependencies.

The aim is to create a virtual container for hosting services. This way services can be plugged into ESB or into an existing Policy Manager. It will be independent of language and will help as a framework for exposing business logic as service.

The other significant benefits I see are
- Consistent deployment & management
- Location Transperancy (Virtualization)
- Policy Enforcement
- Consistent Security Model
- SOA does not means every developer needs to know about WSDL or WS-* or other standards. They need to know the core business logic.
- It might possibly help in transaction coordination.

So let us try to use our own methodology SOA to help in implementation & adoption of SOA.


Upgrade from Jinitiator 1.3 to Java Plugin 1.6.0.x

Aviad Elbaz - Fri, 2008-03-07 05:51

Lately Oracle announced the end of Error Correction Support for Jinitiator 1.3 for E-Business Suite 11i, effective July 2009.

This is the sign it’s about time to upgrade to the native Java Plug-in… :-)

Among other things, one of the main advantages of upgrading from Jinitiator to the native Java Plug-in is the prevention of conflicts between them.

This upgrade is great news to all are working with Oracle Discoverer Plus (with Java plug-in) and Oracle EBS 11i (with Jinitiator) and experiencing those conflicts.

I’ll skip all the others advantages and disadvantages of upgrading to Java Plug-in as they are well described in Steven Chan’s post - Jinitiator 1.1.8 To Be Desupported for Apps 11i and in Metalink Note: 290807.1 - Upgrading Sun JRE with Oracle Applications 11i.

So I will focus on the upgrade process itself - step by step.

I tested the upgrade on the following test environment:

  • EBS
  • Database
  • ATG Rollup 5
  • Developer 6i patchset 18
  • OS RHEL4.

Be aware that before upgrading to Java Plug-in you must upgrade to Developer 6i patchset 18 or later (currently the latest patchset is 19).

* You can use my previous post in order to Upgrading Developer 6i with Oracle Apps 11i to patchset 18.

  1. Download JRE plug-in Oracle E-Business Suite interoperability patch - 6863618
  2. Download the Sun JRE Plug-in 
    • Select Java Runtime Environment (JRE) 6 Update X (select the latest available update, currently it’s 5)
    • Select Windows offline installation, multi-language
  3. Rename the downloaded installation file jre-6_uX-windows-i586-p.exe to j2se1600x.exe
    In my case rename jre-6_u5-windows-i586-p.exe to ==>> j2se16005.exe
  4. Copy the j2se1605.exe file to $COMMON_TOP/util/jinitiator on the Apps Tier node
  5. If you are on Developer 6i patchset 18 you should apply forms patches 6195758 & 5884875.
    ** Skip this step if you are on Developer 6i patchset 19.
    • Download Patches 6195758 & 5884875
    • Apply patch 6195758
      • Stop all applications processes by adstpall.sh
      • Unzip p6195758_60827_GENERIC.zip
      • cd 6195758
      • cp -r $ORACLE_HOME/forms60/java/oracle/forms/
        handler/UICommon.class $ORACLE_HOME/forms60/java/oracle/forms/
      • cp -r $ORACLE_HOME/forms60/java/ oracle/forms
        /handler/ComponentItem.class $ORACLE_HOME/forms60/java/oracle/forms/
      • cp oracle/forms/handler/UICommon.class $ORACLE_HOME/forms60/java/oracle/forms/
      • cp oracle/forms/handler/ComponentItem.class $ORACLE_HOME/forms60/java/oracle/forms/
    • Apply Patch 5884875
      • Unzip p5884875_60827_GENERIC.zip
      • cd 5884875
      • cp -r $ORACLE_HOME/forms60/java/oracle/forms/engine/Main.class $ORACLE_HOME/forms60/java/oracle/forms/
      • cp -r $ORACLE_HOME/forms60/java/ oracle/forms/
        handler/AlertDialog.class $ORACLE_HOME/forms60/java/oracle/forms/
      • cp oracle/forms/engine/Main.class $ORACLE_HOME/forms60/java/oracle/forms/engine/Main.class
      • cp oracle/forms/handler/AlertDialog.class $ORACLE_HOME/forms60/java/oracle/forms/
      • Run adadmin -> Generate Applications Files menu -> Generate product JAR files
  6. Apply the Interoperability patch 6863618
    • Make sure all applications processes are down
    • Enable maintenance mode:
      Execute adadmin -> Change Maintenance Mode (5) -> Enable Maintenance Mode (1)
    • Unzip p6863618_11i_GENERIC.zip
    • Cd 6863618
    • Run adpatch to apply patch driver u6863618.drv
    • cd [PATCH_NUMBER]/fnd/bin
    • Execute the following command where X represent the update number:
      $ txkSetPlugin.sh 1600X
      In my case:
      $ txkSetPlugin.sh 16005
    • Disable maintenance mode:
      Execute adadmin -> Change Maintenance Mode (5) -> Disable Maintenance Mode (2)
  7. Start all applications processes by adstrtall.sh
  8. Verify installation by sign in Oracle EBS and select a forms based responsibility.

For those who worry about performance, take a look at this benchmark:
Benchmark comparison test with Jinitiator and Java Plug-in 1.5.0_05 – performance whitepaper

For more information - Note: 290807.1 - Upgrading Sun JRE with Oracle Applications 11i


Categories: APPS Blogs


Mary Ann Davidson - Thu, 2008-03-06 09:26

You know there are too many labor-saving devices in the world when you see the sheer number of professional do-gooders trying to solve problems hardly anybody else worries about. If you have a day job, having someone with too much free time tell you why you need to be concerned about "Making the World Better Through FOO" is often just about as irritating as those old TV commercials moaning about ugly yellow wax buildup on your kitchen floors (my solution: paint your kitchen walls yellow to match the floor).


There are, of course, many people who devote time and passion to making the world a better place. I'm not talking about them here. I am talking about the people who seize on something they care about without bothering to find out if there is an actual problem that needs to be solved. Or, if there is a "problem," asking what the cost is of fixing it and what one could do with those same resources that might solve a more pressing problem (a concept known as "opportunity cost" to economists). It's all you can do, when confronted with an earnest but clueless do-gooder, not to say, "Ask me if I care."


Where I live in Idaho, there are a couple of professional Do-Gooder Projects that engender a lot of whining in the local newspapers. One of them is the Relocate the Airport lobby. The claim is that we need to 1) build a new airport 2) with longer landing strips 3) so that larger commercial planes will fly here. (Never mind the fact that commercial airlines have said repeatedly they will not land larger planes here because there isn't enough demand to support it.) There isn't actually a problem the community needs to solve via a new airport, but we'd create a bunch of new problems, like people having to drive an hour or more to get to Sun Valley instead of the current half hour from Friedman Memorial Airport.


The other local Do-Gooder Project relates to "affordable housing." Mind you, there is no actual housing shortage in town: if you want to work here, you can easily find an affordable place to rent. Many people who work here who want to own property live in another county - where they can get a lot more land for a lot less money. The idea that anyone who works here - regardless of income - should be entitled to own a free-standing home isn't reasonable given market (and geographic) realities (e.g., the land around us is Bureau of Land Management land and cannot be developed). As one of my friends put it to a local Affordable Housing Do-Gooder: "You didn't live next door to your gardener in Marin, either."


My personal opinion is that a lot of these do-gooders retired early, miss running something and want to run everyone else in town by solving problems that don't exist.


There are Do-Gooder Initiatives in the IT industry, too, a number of which are in security.  Security Do-Gooder Initiatives sometimes come under the guise of a laundry list of 5,000 things that everyone should do to be more secure. Even if all 5,000 of those things are somewhat useful, just like New Year's Resolutions, they are likely to be more actionable and "accomplishable" if the list is shorter. Putting it differently, I know very well that I should eat less, exercise more, eat more nutritious food, read better books, improve my skate skiing technique by lengthening my glide and so on. I can't actually process 5,000 "should dos" so I try to parse them down to a smaller list of things that I can actually do that will also make the most difference to my health, my skate skiing, or whatever it is I am trying to improve upon. Many Do-Gooder Initiatives do not have any sense of "nobody can do everything all at once, so maybe doing something now and doing more later is a better way to slice the pie." The initiatives fail due to the expectations - and failure to prioritize - that they entail. You might actually just give up from the frustration of trying to comply with 5,000 "shoulds." 


(It turns out that the people who actually do make good on their New Year's Resolutions start with a small, actionable list instead of a 30-page life plan. A small list of things you can do and will do is better than a much larger list of things that you are never going to get to. Less really is more.)


The reality is that some things matter more than others if you are trying to make constructive change. If I drink a bottle of wine a night (I don't) and have 40 "better health things" I want to do, saving my liver might be among the most important ones. So maybe, trying to cut down to a glass or so a night would be the biggest payoff on my better health things list and I can skip the other 39 items or relegate them to next year. Unfortunately, there are a lot of Do-Gooder Initiatives that not only have too many things on the list; the list is not weighted at all for where the value is in making change. (Opportunity cost again: what could I do with the same resources that would have a bigger payoff?)


I wonder if a lot of Do-Gooders get out enough in the real world. Maybe they are academics who think "theory" is enough. ("Theory" of baking doesn't get you a pie.) Or think-tankers who are paid to develop secure Internet toaster protocols that they then want to standardize. (Does anybody really worry about who is accessing their bagels remotely?)


Whenever I participate in public-private partnerships where a lot of "improve security" initiatives are generated and where there is typically a broad tent of participants (a good thing, in general), I try to ask that the people putting the laundry lists together grab someone who is either a cost accountant or an economist to look at where the bang for the buck goes in what's being proposed. Because if they do not do that, these initiatives are doomed to fail. Or, they will be so expensive that nobody does them because they can't afford the entire megillah.


The one take-away lesson I got from my nerdy quantitative methods class in business school is that when you are trying to solve an optimization problem, you can't optimize on all parameters. Time is constrained. Resources are (ultimately) constrained. Answering the question, "How can do X while making best use of scarce resources?" means I need to take account of what I most want to accomplish and how valuable is it to me that I accomplish those things.


For example, there are security initiatives around "what metrics and artifacts at every stage of development you should produce to 'prove' assurance claims."  People measuring the assurance of software believe that there are things you ought to be able to produce and measure at each stage of development. However, there is a cost to producing metrics and artifacts. If the cost of producing these is greater than the value of more information, you shouldn't put the work in to produce them. Even if everything has some value, some things are more critical than others or provide greater value for the work you put into getting them. One of the way I tranche our metrics project is to look at a) what can we data mine today to give us security metrics? b) what else would we like to know (in some order)? c) what will it cost to get that information? and d) is the cost less than or greater than the benefit of the information?


If you are a small company, maybe you can't - in the beginning - do every single Best Practice Recommendation (or produce every single metric or every single artifact that anybody in a theoretically perfect world would want). But you can do something, and you'd be willing to do something if someone helped you by telling you what the most important things are to do first that make the biggest impact. Something is almost always better than nothing.


Even people who know they ought to do more in security - and are willing to improve - will fight tooth and nail if they are confronted with a "my way or the highway" mandate that takes little account of real world constraints.


For example, consider the Federal Desktop Core Configuration (FDCC), a recent initiative to mandate that US Federal agencies lock down their environments to a specific Windows configuration (which, as a matter of course, means packaged applications will need to run on those locked down Windows configurations). I have said often and publicly that I think one of the easiest things vendors can do to help improve security is to lock down default configurations - better security out-of-the-box, cheaper lifecycle cost for customers. I've also said that one of the things customers can do to be "smart buyers" is to insist that their vendors lock down default configurations: "You don't ask; you don't get." I don't have any issue with the goodness of this concept (and we have a company-wide initiative related to locking down default configurations). In that sense, FDCC is not a "Do-Gooder Initiative" the way I've defined it since it actually does address a problem that people worry about, that needs looking after.


The problem with the way FDCC has been mandated is that it did not, first of all, define what a "desktop" configuration is. Is it desktop software? Or anything installed on the Microsoft operating system (which can and is used on desktops)? There might be a huge (and legitimate) difference between the configuration of middleware or servers on Windows and the client piece of an application configured on Windows. There's certainly a big scope difference between "validating how client pieces of applications running on desktops are configured to run with FDCC" and "validating how every single component of every application that runs on Windows is configured with FDCC." What problem, exactly, is it that is being solved? "Desktops used to launch attacks?" or "locking down the Windows operating system for every single application running on it?" Nobody knows, especially since this is called a "desktop" configuration initiative, and nobody on the mandate side of this issue has yet answered that basic question.


Most vendors have product lifecycles such that they do not make configuration changes in anything other than a major product release. That is, when customers uptake patch sets, their expectation is that there won't be configuration changes that could break their existing applications. One time in almost 20 years at Oracle, I tried to change a configuration parameter in a patch set (for good security reasons). The configuration change broke all our business applications, so we backed it out before the patch set shipped and I've been apologizing to the release manager ever since. (We later made the configuration change in a major product release.) Unfortunately, FDCC was mandated without adequately taking into account vendors' product lifecycles. Some vendors simply will need more time to phase in needed configuration changes. A lot more, if your major release product lifecycle is years and not months.


Nobody was evil-minded here, but even people who support the idea of FDCC are dead in the water until they can get some basic questions answered and a dialogue going. Ideally, this dialogue should have taken place before FDCC was mandated. Industry (including Oracle) is still working to try to get clarification on the specifics of FDCC and also asking that in future these types of configuration mandates be developed with industry and with adequate phase-in that allows for product lifecycles. How you implement change is as important as what the change is if you want people to move the security ball down the field. Otherwise, even a worthy initiative like FDCC can sink into the morass of Do-Gooder Projects.


A better example (where really, "there is no there there," to quote Gertrude Stein) is the recent proposal to develop an ISO standard for vulnerability disclosure. I know of no vendor who thinks this is a good idea. For a start, what problem are we trying to solve? Does anybody think that we can come up with a one-size-fits-all standard for how long it should take to fix a security bug, the exact "rules" on how much information gets put into security advisories and the specific format of how that vulnerability information is expressed? Software vendors have different release cycles, customer bases, risk profiles, and more. (One-size-fits-all pantyhose, as any woman knows, only fits Hilda Mae Throckmorton of Muncie, Indiana.) There are plenty of industry guidelines for good practice on vulnerability disclosure already. Most of these acknowledge that you can't standardize this business practice any more than you can standardize apple-pie making ("Allspice? Death to infidels!"). There are also existing standards on vulnerability disclosure that vendors are adopting, such as the Common Vulnerability Scoring System (CVSS). Oracle was an early adopter of CVSS and customers have told us that it's really useful to them.


It is unwise (no, make that "really stupid") to try to standardize what is in effect both a business process and a set of business practices. Ira Gershwin (who knew he was a security maven?) penned the perfect lyric that applies to this Unneeded Standard Attempt: "You say po-TAY-to, I say po-TAH-to, let's call the whole thing off."


I offer one last example that isn't quite in line with Do-Gooder Initiatives but relates to what problem to solve and at what price. It's also a big pet peeve of mine: I get a lot of phone calls from vendors trying to shill their security products to Oracle. (Though I do not have operational security responsibility - wonderful, capable colleagues look after that - vendors assume that since my title is "CSO," I am the person who buys Cool Security Products for the IT department.)


I hate to mention how many cold callers do not even do basic homework before trying to sell me true love and security happiness. My favorite was the cold caller who said his firm had expertise in securing Oracle Applications deployments. I had to point out to him that, "Uh, we are Oracle, we run on Oracle Applications, and since we build the software, we'd be unlikely to hire a third party to 'securely deploy' it for us." Or, the vendors selling solutions that run on a non-Oracle database. You know, that's just a religious problem for us: we are not going to deploy a third party security solution that runs on <insert name of competitor database here>.


My basic pet peeve is the people who do not think about the customer perspective before they launch into their "cure cancer, raise the dead, protect against every attack known to mankind with zero false positive" shill. They claim this shill will only be "twenty minutes of your time" (only "twenty minutes" is measured on a calendar, not a watch).


Forthwith, here is my script for parsing through shill-meisters as quickly as possible:


1. "What problem does this solve?" (If you can't articulate that in 25 words or less, do not waste my time or anyone else's.)

2. "Is it a problem we are worried about or care about solving?" (Secure remote bagel access is not something that concerns me, so forget the 'Internet Toaster Protocol' pitch.)

3. and 4. "Does it address the problem better, cheaper or faster than what I am doing now? How much better, cheaper or faster?" (If it doesn't, why would I switch from something that may not be sexy or "a breakthrough technology" but gets the job done? I don't have an electric salad tosser, either, because the salad spinner I have - or a pair of tongs - works just fine and has fewer moving parts.)

5. "How can it be broken?"  (Especially for a security product, knowing and being honest about how it can be broken is important. A claim of "zero false positives," for example, should cause anyone to run screaming in the opposite direction.)


Do-Gooders, the next time you come up with A Cause, a small request. Please, in the interests of making it a better world without wasting everyone else's time, use your skills on a problem that really needs a solution (or on a better, faster, or cheaper way of solving an existing problem), not on a solution in search of a problem to solve.


For More Information:


Book of the week: Hog Pilots, Blue Water Grunts by Robert Kaplan (who also wrote Imperial Grunts). If you want to know what the military really does, this is a great read. Robert Kaplan was embedded with a number of different types of units, in multiple services, around the globe: special forces, marines, aviators, and submariners. A really cool read. Mahalo nui loa, all you soldiers, sailors, airmen and marines for keeping us safe.




We aren't going to have "oldies" rap stations anytime in the future. If anybody has written a more clever lyric than Ira Gershwin (OK, maybe Cole Porter) I have yet to hear it. Songs with lyrics by Ira Gershwin:




Totally off topic, but Go! Airlines has just done a web page where you can book your next interisland trip totally in Hawaiian. E ola mau ka 'olelo Hawai'i (May the language of Hawai'i live!).


Check it out at:




When all else fails, follow instructions.

Claudia Zeiler - Tue, 2008-03-04 18:26
  1. Today I tried to open my test database and got:

ORA-16038: log 3 sequence# 729 cannot be archived
ORA-19809: limit exceeded for recovery files
ORA-00312: online log 3 thread 1: 'C:\ORACLE\ORA10\ORADATA\DB10\REDO03.LOG'

Researching the problem I found:

  • Maybe your Oracle instance has reached the maximum db recovery file destination file size assigned to it? If that is so, probably your oracle instance will stop working and you will get the 3 error messages listed above while trying to start it.
  • The relevant parameter is: DB_RECOVERY_FILE_DEST_SIZE - specifies (in bytes) the hard limit on the total space to be used by target database recovery files created in the flash recovery area.
  • It works in concert with DB_RECOVERY_FILE_DEST which specifies the default location for the flash recovery area. The flash recovery area contains multiplexed copies of current control files and online redo logs, as well as archived redo logs, flashback logs, and RMAN backups.

  1. What is the maximum db recovery file destination file size assigned in my DB?

SQL> show parameter db_recovery_file_dest_size
db_recovery_file_dest_size 2G

At least this wasn't a surprise.

  1. I removed excess files

I shutdown the database. I then removed the archivelog files and the flashback files from the file system – even though they didn't amount to the 2gb in total that the system is complaining about. (This is a play database so I can do things like kill these files.)

  1. Still had a problem.

I brought the database up to mount state, but immediately there was a trace file message:

ORA-19815: WARNING: db_recovery_file_dest_size of 2147483648 bytes is 99.85% used, and has 3320320 remaining bytes available.

*** 2008-03-03 10:03:19.635


You have following choices to free up space from flash recovery area:

1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard, then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating system command was used to delete files, then use RMAN CROSSCHECK and DELETE EXPIRED commands.

  1. The files were still being seen by RMAN! When all else fails, follow instructions, in this case #4 above. “4. Delete unnecessary files using RMAN DELETE command. If an operating system command was used to delete files, then use RMAN CROSSCHECK and DELETE EXPIRED commands.”

RMAN> crosscheck archivelog all;
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1

channel ORA_DISK_1: sid=146 devtype=DISK
validation failed for archived log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_733_3WRQ14DY_.ARC recid=70 stamp=648389669

validation failed for archived log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_734_3WRQ4V7R_.ARC recid=71 stamp=648389787

What nice list of all the files that aren't there, and all the locations where they aren't; 68 files that don't exist! [apologies to any reader of English as a second language]

RMAN> delete expired archivelog all;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=146 devtype=DISK
List of Archived Log Copies
Key Thrd Seq S Low Time Name
------- ---- ------- - --------- ----



Do you really want to delete the above objects (enter YES or NO)? yes

deleted archive log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_733_3WRQ14DY_.ARC recid=70 stamp=648389669

deleted archive log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_734_3WRQ4V7R_.ARC recid=71 stamp=648389787


Deleted 68 EXPIRED objects

  1. Now all is well. (with the database if not with this blog-entry's numbers). I bounced the database and the alert log shows:

db_recovery_file_dest_size of 2048 MB is 0.38% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup.

Guest Blog: Smart Space from an end-user’s perspective

Oracle EPM Smart Space - Mon, 2008-03-03 13:54
I had the pleasure of being an early user of Smart Space while Mike, Matt and the team were building it.

I’ve been using software since the days of Wordstar, dBase II, and Visicalc (on CP/M before DOS no less). Using Smart Space was not like using software. This was an environment I was experiencing. I think Matt’s vision of “I just want to ride a bike” is spot on.

Smart Space is:

Intuitive: I didn’t know I was doing enterprise performance management, data mining, query & analysis – I thought I was making decisions (I further didn’t know I was making those decisions based on judgments & knowledge learned from information that came from data!). Gadgets solve problems.

Social: my favorite part of using Smart Space was collaborating with other users. I could slice & dice data and quickly see an anomaly (variance outside of a threshold, for example), and instantly ping a colleague about it, share my view with her and record the decisions/actions taken as a result.

Personal: my desktop was role & function specific, that is it was set up (by me) to only contain those gadgets, workflow, and points of view that had to do with the role I played. So it should be with enterprise performance management: finance cares about the balance sheet, Operations cares about the supply chain, sales cares about revenue and pipeline, and so on. Executives care about summarized results and trends, managers care about tracking targets and variance, and analysts care about cause and effect. And, in an “EPM enlightened” organization, people across all functions care about cross-functional things. They would have gadgets that let them get to revenue growth, operational efficiency, customer satisfaction, and so on.

Ready at hand:” Smart Space is always-on EPM. It’s always there, always ready, and becomes an extension of my work – like the phone, IM, e-mail, and coffee.

Expandable: there is a fairly large (and growing) community of EPM and BI devotees: customers, consultants, vendors, analysts. These folks are candidates for contributing (dreaming-up & developing) the library of gadgets available to get your job done. I just thought of one while typing this (I’ll let you know what it is as soon as I get a developer to write it for me, any volunteers?)

This is pretty exciting stuff, and I can’t wait to see what Matt & Mike have in store for us in ’08.
Ron Dimon
Business Foundation
EPM Advisory
Categories: Development

Soft-skills (not oracle related post)

Fadi Hasweh - Sun, 2008-03-02 23:20
As a part of my study in my masters program we have to create a blog related to e-business course and it must be evaluated and one of the evaluation criteria is the number of users accessing it and comments so if you have free time can you access it and leave comments the blog has a good and helpful information.
the blog talks about soft skills and how they are important in our life check for example (speed reading post and managing a busy schedule post and many others)
please check and leave comments if you can
Thank you

Interesting Corrupted Redo Log Case

Renaps' Blog - Sun, 2008-03-02 11:59

It has been a while since I wrote on Renaps’ Blog .. most probably because I didn’t run through any interesting case to talk about for a while !

Yesterday, at one of our main clients, the production (soon to be – on Monday) database hung. Was it because it was a Feb 29th of a bisextile year ? or simply because the week-end arriving in the next 4 hours was the go-live of the main data warehouse ? — Its hard to say, I believe that it was just a little wink from our old friend Murphy !

The environment is a 16 Itanium CPUs Windows Server running; ASM is used even though the database runs on a single instance mode.

When the database hung, the alert.log showed the following error messages (of course, no one could connect or run any transaction on the database):

ARC0: All Archive destinations made inactive due to error 354
Committing creation of archivelog ‘+FLASH/inf1/archivelog/arc17972_0627748574.001’ (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
Fri Feb 29 11:48:05 2008
Errors in file c:oraadmininf1bdumpinf1_arc0_10236.trc:
ORA-16038: Message 16038 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [17972]
ORA-00354: Message 354 not found; No message file for product=RDBMS, facility=ORA
ORA-00312: Message 312 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [1] [+FLASH/inf1/onlinelog/group_1.257.627748575]

The first line of this message stack made me believe that the database just hung because there was no more space for the archiver to do its job. But after further analysis, it appeared that the problem was much more serious.

The actual cause of the issue was related to Redo-Log Corruption. The cause of the corruption was a “write or rewrite” SAN related error.

When Redo-Log corruption occurs, Oracle first tries to use the second (or third) member of the same group showing corruption to avoid un-availability of the database. In our case, this was impossible since there was only one member per group of Redo-Log — I was surprised to see that this client did not multiplex the Redo-Logs …

My first attempt to get the database back on its legs was to dump the content of the redo-log showing corruption issues (in this case, group 1).

SQL> ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;
ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575’
ERREUR à la ligne 1 :
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 53504 change 70389305 time 02/29/2008 11:48:02
ORA-00334: archived log: ‘+FLASH/inf1/onlinelog/group_1.257.627748575’

As you can see, this attempt failed, I then attempted to clear the unarchived redo-log:
SQL> alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;

at this point, my command hung indefinitely, and I realized that pmon was holding a lock on the Redo-log Header block…

My margin of maneuver has become much smaller, here is the sequence of commands that brought the database back on track:

  1. shutdown abort;
  2. startup mount;
  3. issue the same command that was previously hanging (this time, it worked right away) :
    alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;
  4. alter database open;
  5. shutdown immediate;
  6. Immediately take a FULL backup at this point;
  7. startup;
  8. Don’t forget to multiplex your Redo-Logs to avoid to run into this situation again !

Here you go, Redo-Log corruption on the header block of a group when only one member is used does look bad at first sight but I hope this article can help you if you run into a similar scenario.


Metalink Notes:

Patrick Hamou
E-Mail: phamou@renaps.com
Web: www.renaps.com

Categories: DBA Blogs

APEX 3.0.1 on XE

Jornica - Sat, 2008-03-01 09:17

On 29 Februari 2008 Oracle released Application Express 3.1 . With this release you can upgrade the Application Express within Oracle XE to 3.1. For more information on this topic see Oracle Database XE and Application Express 3.1 and APEX 3.1 on XE .

Oracle released Application Express 3.0.1 recently. With this release you can upgrade APEX within Oracle XE from 2.1 to 3.0.1 (see Oracle Database XE and Application Express 3.0.1). But how to upgrade when you already installed APEX 3.0 on XE?

There is a patch available on Metalink to upgrade from APEX 3.0 to 3.0.1. According to the Patch Set Notes Oracle9i release 2 and higher support this patch if you have APEX already installed (the Patch Set Notes also includes a list of fixed bugs). In order to download the patch from Metalink, you will need a support contract (which is not included with XE).

Another option is to remove APEX 3.0 completely. After the removal, XE has APEX 2.1. (still) installed and the next step is of course to install APEX 3.0.1. The big advantage of this option is you following the main route of upgrading XE/APEX as proposed by Oracle. However there are some things to keep in mind.

  • As always start with making a backup of your database.

  • Export all applications you would like to see again in APEX 3.0.1. If you remove APEX 3.0 completely, you also remove all applications stored in the FLOWS_030000(APEX 3.0) schema! Note that applications stored in the FLOWS_020100 schema will be migrated to FLOWS_030000 (APEX 3.0.1).

  • Make a backup of changed or added cascading style sheets (CSS), images and java scripts separately.

  • Run the apex_3.0.1\apxremov.sql script (the script is the same for both versions).

  • Run the apex_3.0.1\apxins.sql script. Check the installation log for errors.

  • Copy the apex_3.0.1\images directory to the /i/directory. Also copy the previous backed up CSS, images and java scripts to the /i/ directory.

  • Import all applications exported in the first step. Note that existing applications in APEX 2.1 are migrated to APEX 3.0.1 as part of the upgrade process.

  • Time to make a backup again.

Have a look at the APEX reports in SQL Developer, no errors any more.

APEX 3.0 on XE part 2

Jornica - Sat, 2008-03-01 09:16

On 29 Februari 2008 Oracle released Application Express 3.1 . With this release you can upgrade the Application Express within Oracle XE to 3.1. For more information on this topic see Oracle Database XE and Application Express 3.1 and APEX 3.1 on XE .

If you followed the guidelines of APEX 3.0 on XE, you have a working APEX 3.0 interface. However the administrative interface of APEX is not installed on XE by default. Using this interface you can manage all aspects of the service including creating, removing, and managing workspaces. Barney Mattox discovered a workaround to install this interface as well.

In order to install the administrative interface perform the following steps:

  1. Start SQL*Plus. Make sure your working folder is the APEX installation folder (where apexins.sql is located).

  2. Connect as user SYS.


  4. The following command imports the administrative interface: @script\f4050.sql

  5. The following command imports the request for workspace interface: @script\f4700.sql

  6. Change the password for the ADMIN user in order to login in the administrative interface: @apxxepwd.sql

  7. Use the following URL to access the administrative interface http://server:port/apex/f?p=4550:10 i.e. .

Additonal notes:

  • Issue a spool install.log after logging in to SQL* Plus. If there occurs an error SQL*Plus will exit. You can use the logfile to find out what went wrong.

  • I experienced problems with step 3: Illegal security group id value for this schema. A workaround is to login as FLOWS_030000 and run the scripts. First you need to unlock user FLOWS_030000:

Workflow performance case study: Dont Repeat History, Learn from it

Gaurav Verma - Sat, 2008-03-01 02:06

A Sanskrit Shlok underlying the importance of careful planning

Preface They say: "Those who do not know history are condemned to repeat it."
But why repeat history if you dont have to?
Many questions come to light on this topic of discussion:
  • Is there a better way of designing conditional implementation in workflow?
  • What are the merits/demerits of repeated checking of a particular condition as opposed to executing something when the actual condition happens? What are the performance implications?
  • Are there workflow APIs available which simulate event based subscription model?
  • What are the performance advantages of using the basic Workflow APIs that simulate this event based subscription model?
This case study discusses the performance benefits of keeping unnecessary looping execution in workflow design a a bare minimum and if possible, eliminate it. If there are several short timed loops in the workflow design, that get executed very often, the bad news is that the System spends valuable resources spinning the wheels in vain.
The intended audience is workflow designers, support personnel and Oracle Apps DBA alike. This case study can potentially help users to design their workflow more optimally and succintly, which would have a minimal impact on the execution. The I/O profile and time taken for Workflow Background Process bears out all this information. As far as possible, trends have been depicted from the customer's Production system. The Business Scenario

The business requirement to be implemented was during the ATP (Available to Promise) phase in the life of an Order line. ATP simply means that we can guarentee the customer that a particular Item can be made available from a particular warehouse on a particular date.

When scheduling an Order Line, if a financial hold existed on the Order line due to some reason, it needed to be unschedule. Also, as soon as the financial hold was released by the Accounts Receivables department, the requirement was to proceed with the remaining business flow in the OEOL workflow design.

How It Was Designed

This is how it was designed:

Let us translate the flow:

  • When the WF background program executes custom ATP process, it checks if the order is on hold or not.
  • If the order header or one of the lines is on hold the custom activity "Unschedule Lines On Hold" will unschedule all those lines and waits for an hour before checking the line status again.
  • This loop repeats until all holds on the header/line are released.
Lets look more closely at the Wait Activity definition:

Checking its Node Attributes tab:

Now, checking its actual timeout value:

So, the bottomline is that the WAIT activity was timing out after every 15 mins (60 x 0.0416 ~ 15 mins). Just imagine the volume of OEOL item keys at this activity on any given hour. What kind of a load would they be putting on the system.
Now, is that an intelligent way of designing the workflow for this requirement? Some shocking revelations.. A little analysis of the volume in WF_ITEM_ACTIVITY_STATUSES_H (history) table showed something shocking (!!):

The XXONT_CUSTOM_ATP->WAIT step was generating the highest history records. In this case, it had created 27 million rows (75% of the entire volume) out of 35 millions rows in in WF_ITEM_ACTIVITY_STATUSES_H table.
PS: This is one of the reasons as to why this case study is very interesting: simply because its impact was tremendous. How are history records created?

Lets clear up some air and get our basics right about how and when are history records created.

When the workflow engine encounters an activity which has a status of COMPLETE, it moves (figuratively) that record from WF_ITEM_ACTIVITY_STATUSES to WF_ITEM_ACTIVITY_STATUSES_H and creates a new record in WF_ITEM_ACTIVITY_STATUSES with status ACTIVE.

Therefore, the more times a COMPLETed activity is re-visited, the more history records for that activity are created. This can happen when a loop having a WAIT activity with a short timeout is designed. The more the workflow background process flows through the activities of such a loop, the more records will get created in WF_ITEM_ACTIVITY_STATUSES_H.

The performance implications...

Lets try to think as to what are the other implications of such poor design?

  • The system keeps looping on idle timeout activity every 15 mins
  • Immediately after the timeout completes, the activity becomes eligible to be processed by the workflow.
  • The next activity (check if hold has been released) checks if the hold has been released?
  • Statistically speaking, as the probability of a financial hold being released is not immediate, the control goes back to the WAIT activity, wherein the cycle repeats
While this is happening:
  • Every 15 mins, the backlog (timedout or deferred) count of unprocessed OEOL item type bloats
  • Due to such frequent time-out activities, other more important processes/activities in the OEOL workflow are not executed in a realistic timeframe
  • Valuable system resource are spent on scheduling and executing the loop's activities on the Production machine
  • The OS on Production machine spends vital I/O resources in servicing these application calls
Simply speaking, that is very in-efficient.
This is like a vicious cycle. The more data you have in in WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H tables, the more time it takes to query and insert  into them (also, the more indexes you have, the more time it would take to delete/update/insert due to automatic index re-organization effort required). It just kills the online performance at times.
The following diagram bears out the above ideas. It shows the top activity (in terms of number of item_keys) in the breakup of OEOL backlog. It can be seen that, without fail, XXONT_CUSTOM_ATP was the top wait activity in the OEOL backlog. Not only was it a liablity to itself, it was also a liability to other legitimate activites perfectly eligible to be processed:
 Design Considerations: The Danger Of Introducing Infinite Loops

There is another big danger here: that of introducing infinite loop in the execution of Workflow Background Process (also referred to as FNDWFBG) for OEOL.

Let us assume that there is a standard WAIT activity with a very short timeout/wait time in a loop with a condition checking activity. When the condition is checked, the activity will be deferred or timed out for a very short time.
Meanwhile, Workflow Background Process would be busy processing all other deferred or timed out activities. If FNDWFBG is not done processing other deferred activities by the time the wait/time-out time of previous activities expires, the RE-timedout or deferred activities will be enqueued and re-processed AGAIN by the very same FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) job over and over, as long as the condition (e.g "line on hold") is met.
This creates an infinite loop, so FNDWFBG process almost never completes, eating up CPU time, creating massive DBMS archive logs (if turned on), and pumping rows into the WF activity statuses table. This would also create unnecessary rows in the WF_ITEM_ACTIVITY_STATUS_H table, which can be nemesis of online order entry performance in the 11i EBusiness Suite later on.

It is a good idea for workflow activities in custom workflows to have Wait/Timeout time greater than the longest normal execution time of Workflow Background process.
Also, a workflow activity having a near zero Relative (Wait) Time, or Wait Mode stated as Absolute Date, but no Absolute Date specified, which is equivalent to zero Relative Wait Time can cause the same issue (infinite loop) to occur for FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) runs with Process Deferred = Yes.

Hence, these poor design practices should be avoided, because, remember, once a runtime for a particular WF definition gets initiated and even if the workflow definition is corrected later and uploaded, the previous runtime data will not be automatically corrected. So its prudent to take in good design considerations before letting your production data run on it.

More design considerations..

Whenever possible, standard Wait activity ( WF_STANDARD.WAIT() ) should not be placed in a loop with a small wait time (shorter than several hours). Alternative workflow designs should be evaluated.
WF_STANDARD.WAIT() with wait time of zero or near zero, or standard Defer activity (WF_STANDARD.DEFER()), should be NEVER be placed in a loop in a workflow process design diagram.

Extending the Wait time should only be used if this solution is not acceptable for a very good reason.

The Superior Design Alternative....

For the above business scenario, the best solution is not to use Wait activity in a loop at all. A much more attractive option is to use published Workflow blocking/contnuation APIs like WF_STANDARD.BLOCK() or OE_STANDARD_WF.STANDARD_BLOCK() should be used for stopping processing flow of a workflow item_key. WF_ENGINE.COMPLETEACTIVITY() should be used for resuming processing flow of a workflow item_key.

An additional advantage of using OE_STANDARD_WF.STANDARD_BLOCK is that  any OEOL workflow item which is blocked on OE_STANDARD_WF.STANDARD_BLOCK can also be progressible through Order Entry form via "Actions" button Progress Order option. This gives more flexibility to the user/designer.

If one wants to progress the flow of a workflow item programatically, WF_ENGINE.COMPLETEACTIVITY() API can be called. This API notifies the Workflow Engine that the specified activity has been completed for a particular item. It signals the Workflow Engine that an asynchronous activity has been completed. This procedure requires that the activity currently has a status of 'Notified' (Internal status equivalent to BLOCKED). An optional activity completion result can also be passed. The result can determine what transition the process takes next.


/* Complete the BLOCK activity which is used in multiple
* subprocesses in parallel splits. */

wf_engine.CompleteActivity('ORDER', '10045',

To enhance the design AND achieve the same business functionality, we need to replace the WAIT activity with a BLOCK activity which will wait till the last hold is released from the order header/line without causing any looping. The COMPLETEACTIVITY API would be called from the Financial Hold release form at Order Header level.

Its as simple as that. The new design would look like:

Seeing the Performance Benefits in Production..
Adopting this approach will create only 1 record in WF_ITEM_ACTIVITY_STATUSES table and 2 records in WF_ITEM_ACTIVITY_STATUSES_H table.

Contrast this against continuous looping till a condition is met: for each execution of the WF logic, records are created in WF_ITEM_ACTIVITY_STATUS* tables.

As a result of this change, the landscape of OEOL backlog also changed drastically.

The average runtime of the Workflow background Process for OEOL also came down consistently.

Summary of Learnings

  • It pays to know the published Workflow APIs. The Workflow Developer Guide is a good place to start. Spend time reading the fine manual. More information can never hurt.
  • It pays to spend time and effort in evaluating re-design of loops in workflow design early, especially since once initiated, a specific version of workflow runtime data does not automatically get updated after a newer version of workflow design is uploaded (through WFLOAD)
  • Poor design involving WAIT and poll activities in a loop can cause infinitely running Workflow Background Process (FNDWFBG)
  • Short-timeout activities in loops should be replaced with OE_STANDARD_WF.STANDARD_BLOCK or WF_STANDARD.BLOCK APIs and WF_ENGINE.COMPLETEACTIVITY APIs
  • Simple design changes can bring HUGE performance benefits, especially when a lot of volume is involved
  • One thing which was also learnt in hindsight was that the Development/Testing team needs to be extremely careful to account for all instantiated financial Holds.  We ran into a problem with the removal of the loop where a certain set of financial holds were not being completed.  For this set of scenarios (automatically released holds not going through the quick order form) we created a custom concurrent program to identify the blocks and call the WF_ENGINE.COMPLETEACTIVITY API. 

Simply put, Know your facts, Be smart and Dont repeat history if you dont have to. Above all, do you testing well.

Suggested Related Readings

When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization --

Posted on 17 May 2007; 12:09:31 PM


A Performance Case Study: Taming the Workflow Purge --

Posted on 21 March 2007; 6:58:11 PM


We're Skewed! Using 10053 event for being Histogram-wise --

Posted on 27 March 2007; 3:02:33 PM


When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization

Gaurav Verma - Sat, 2008-03-01 01:57


Order management workflow interacts with a variety of components like core workflow, order management and pricing application call interfaces.

Customizing it without understanding the implications can be a double edged sword and the nature of customization may unwittingly worsen the situation. An undesirable combination of the aspects can cause a very difficult to diagnose and troubleshoot situation as shall be demonstrated in this case study.

This article is essentially the distilled wisdom of severity 1 situation at a client site who was unable to ship ordered items to their customers since OM order line workflow background process had a severe performance issue. The business impact was tremendous: $ 1.5 to 2 million worth of shipping and invoicing was being prevented.

Many a times, performance troubleshooting involves having some functional insight as well. TKPROF and 10046 trace is not the panacea for all performance issues. In this case, some insight into the nature of Order Management APIs was also required for getting traction on the performance problem. A painful discovery path followed, riddled with bumps and insightful discoveries. The troubleshooting approach used was really out of the box and involved breaking down the code being executed and using logical deduction.

The generic learnings from the ordeal are presented in this case study. It is hoped that the learnings will help oracle applications user community to be concious of hidden implications of customizing OM workflow.

Feeling the pinch...
The client had heavily customized their order management workflow during an implementation of Order to cash business cycle of Oracle applications 11i.

Immediately after the April 2007 financial quarter end, the OEOL workflow background process started exhibiting extremely long execution times, such that the backlog of unprocessed OEOLs came to ~150,000. This was a sharp contrast from the normal processing throughput.

This was a moving problem. The hanging of OEOL workflow background process was quite apparenly due to a 'few' order lines, belonging to a few orders. Once the 'problematic' orders' OEOL workflow were either CANCELLED or ERRORED, the backlog started clearing at a slow pace. This went on for a little while, until similar Order lines OEOL workflow seemed to take inordinately long to process and held up the entire workflow queue.

Coming to grips with it.. As a knee jerk approach, many things were tried, including killing the currently running workflow background process and re-submitting it. The clincher was that once or twice the workflow background process was killed , it marked the OEOL ITEM_KEY being processed into ERROR state and hence took it out of the deferred queue. This forced the Workflow background process to consider the next deferred OEOL item_keys in line.

In ~50% of the cases, it would work and the newly thread would start exhibiting a more acceptable throughput. But after a certain point, even killing and re-submitting workflow background process was simply not working. It did seem that the logic being executed by FNDWFBG was taking extremely long for processing one order line.

The observation was baffling. The following query was used to find out approximately which order line was a single thread of workflow background:

Query 1:

SQL> select * from (select to_char(enq_time,'dd-mon-rr hh24:mi') , 

from wf_deferred_table_m where state = 0 and corrid = 'APPSOEOL'

order by priority, enq_time) where rownum < 5;



02-apr-07 14:25 WF_PAYLOAD_T('OEOL', '30273337', 141255, NULL, NULL, NULL)

02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458068', 140850, NULL, NULL, NULL)

02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458075', 140850, NULL, NULL, NULL)

02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458089', 140850, NULL, NULL, NULL)

4 rows selected.

Tying it back to the order management tables, the following query tells us order level details like order_number, line_number, ordered_item etc, which is even more useful:

Query 2:

select distinct oh.org_id, oh.order_number,
ol.line_number, ol.shipment_number,              
ol.option_number, ol.ordered_item, ol.line_id, wiash.begin_date,
priority, wdtm.enq_time
from applsys.wf_deferred_table_m wdtm,
    applsys.wf_process_activities wpa,
     applsys.wf_item_activity_statuses wiash,
     apps.oe_order_lines_all ol,
     apps.oe_order_headers_all oh
where substr (wdtm.corrid,5) = 'OEOL'
and (wdtm.delay < sysdate or wdtm.delay is null)
and wpa.instance_id=wdtm.user_data.actid
and wiash.item_type=wdtm.user_data.itemtype
and wiash.item_key=wdtm.user_data.itemkey
and wiash.process_activity=wdtm.user_data.actid
and wiash.item_key = ol.line_id
and oh.header_id = ol.header_id
and wdtm.state = 0
group by
oh.org_id, oh.order_number, ol.line_number, ol.shipment_number,
ol.ordered_item, ol.line_id, wiash.begin_date, priority, wdtm.enq_time
order by priority, wdtm.enq_time ;

Observing the output over sometime gave the capability of identifying if workflow background process was stuck on a specific OEOL workflow.

Limited insight by conventional analysis..

The conventional debugging approaches provided limited insight into the root cause of the issue.

10046 trace could not bring out the drill down of time spent on pl/sql API calls clearly. In addition, there was an OS file limit on the trace file on most production servers, even if the max_dump_file_size=unlimited is used.

Another option was to use dbms_profiler API on FNDWFBG call. It could have worked, except the FNDWFBG API call never completed, so the data could never be collected.

Searching the metalink gives a host of notes for getting OM debugs (level 5), Pricing debug etc.

1. Note 255040.1:bde_wf_data.sql - Profiling Workflow Runtime Data related

2. Note 163208.1:bde_last_analyzed.sql - Verifies Statistics for all

3. Note 353991.1:OMSuiteDataChk.sql and OMSuiteDataChk11i10.sql

This is all good information, except that there can also be too much of un-meaningful and un-directed information. Interpreting detailed application debug logs can be useful when it is known what should be checked.

Taking the whitebox approach....

Lot of debug/trace information could be generated, but it was probably best to take the FNDWFBG execution apart in slow motion. Once the execution model has been understood, more brainstorming using divide and conquer strategy is effective.

An alternate way of looking at FNDWFBG is that it is nothing but a serial simulator of steps outlined in the oracle workflow designer.

A whitebox (a software engineering term, as opposed to blackbox) approach was the need of the hour, coupled with some simple thinking. Using the workflow status monitor from Workflow Administrator responsbility in Oracle Applications 11i, the step at which an OEOL is currently at can be seen.

This may also be verified by the output of $FND_TOP/sql/wfstat.sql. Combining the observations from Query 2 above and drilling down via Workflow Monitor, it was clear that almost all the lines having performance issue were at Fulfill-Deferred step, which means that if FNDWFBG were to pick them up, it would then execute the next activity in line, which was Fulfill step.

Note: A quick workaround was to 1) Kill the FNDWFBG process 2) SUSPEND (this status avoids any processing by workflow) the problematic OEOL workflow via workflow monitor and 3) resubmit the FNDWFBG process.

Irreconcilable paradoxes of a Test case..

We really needed to analyze the drill down on the bad test case.

From the online examination of sqls being executed by FNDWFBG session, it was seen that a lot of time was being spent on XXONT Cascade Terms and Agreement, multiple times for apparently processing one OEOL ITEM_KEY. This was a really paradoxical observation and did not make any sense.

The only acid test of this was to call OE_FULFILL_WF.START_FULFILLMENT for specific OEOL line id or ITEM_KEY. The script run_start_fulfillment_with_om_debug.sql (Listing in Appendix A) was used to simulate a single execution of Fulfill step, which in turn called OE_FULFILL_WF.START_FULFILLMENT and did a rollback at the end. One such case ran for ~6 hours and finally finished.

Most surprisingly, an examination of raw trace file did show that sub-sequent steps like XXONT Cascade Terms and Agreement and Invoice Interface- Line had indeed been called multiple times. Therefore, it seemed that a single call to OE_FULFILL_WF.START_FULFILLMENT for one kind of OEOL was triggering off sequential execution of OM workflow for other OEOLs in the same Order, which were somehow related to the order line in question.

$ grep -i xxont_cascade_pkg.set_ln_trm_agr_proc prod_ora_677526.trc 
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end; 

This simply did not make any sense because Fulfill and XXONT Cascade Terms and Agreement step were discrete steps, as verfiable from viewing the OM workflow diagram flow above.

Another very interesting and puzzling observation was that without fail, the top 2 sql calls in TKPROF output (raw trace sorted with sort=prsela,exeela,fchela) showed that ~45% of the time was spent on OE_FULFILL_WF.START_FULFILLMENT API and ~45% of it was spent on the call to xxont_cascade_pkg.set_ln_trm_agr_proc API.

"tkprof output for 10046 of start_fulfillment API call"

Resolving the paradox...

The answer, which is difficult to digest, does make sense. Order management has unique processing requirements for configured ordered items created using Oracle Configurator module. Since OE_FULFILL_WF.START_FULFILLMENT is a workflow aware API (notice the _WF in its name), it was obviously calling more Workflow routines inside it to progress the business flow to next logical set of activities.

We first need to understand the concept of Fulfillment set. A fulfillment set is a set of related order lines that represent a configurable item. Whenever a fulfillment set pr a model is ready to fulfill or get built in production floor - i.e., all members' OEOL workflow are internally in FULFILL_LINE:Notified (blocked) status. The last member reaching this activity signals completion of FULFILL_LINE activities of all other members before proceeding itself. Since the XXONT Cascade Terms and Agreement step comes right after FULFILL_LINE step, XXONT Cascade Terms and Agreement step will be executed for all members of the current model or fulfillment set in that one go.

The relation between other lines whose OM workflow was being resumed and the problematic line in question could be found using the following sql:

Select top_model_line_id from oe_order_lines_all where line_id=<problematic_line>;

This query defines a FULFILLMENT SET (a set of lines which are related together

and constitute an orderable item):

Select line_id from oe_order_lines_all where top_model_line_id=<top_model_line_id_from_above_query>;

The user first configures the components of a confgurable product/item and then the item is actually built. Depending on the kind of configured model (Assemble to order -- ATO or Pick to order -- PTO), the oe_order_lines_all.flow_status_code (order management workflow status) stays at either SHIPPED or BOOKED status or becomes AWAITING_FULFILLMENT (internally blocked).

Here is an example of this intermediate stage of processing for different kind of configured Items (ATO/PTO/Stock):

Now, since a lot of time was being spent on XXONT Cascade Terms and Agreement step, After reviewing the code in the package XXONT_CASCADE_PKG.pkb, it could be seen that OE_ORDER_PUB.PROCESS_ORDER() API was being called to sync up the terms and agreements across Order header and line in question.

There were some penalties for doing this: the payment term and agreement fields on the order line triggered repricing of the order lines. Also, for especially large orders (~4000 lines) with mostly mis-matched payment terms and agreement lines, repeated calls to OE_ORDER_PUB.PROCESS_ORDER() were causing a huge overhead.

The reason why the timings for OE_FULFILL_WF.START_FULFILLMENT() showed high in the trace files was that FULFILL_LINE step syncronizes accross all components in a configurable model or fulfillment set. The following output from a workflow query in Appendix B lists out the time recorded by workflow for executing XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE step:

Visualizing it in slow motion...

This means that, whenever a fulfillment set pr a model is ready to fulfill - i.e., all members OEOL workflow is internally in FULFILL_LINE:Notified (blocked) status, the last member reaching this activity will complete FULFILL_LINE activities of all other members before proceeding itself.

Since the Cascading activity comes right after FULFILL_LINE, it means that it will be executed for all members of the current model or fulfillment set in that one go. The aggregate timings for completing all other related OEOL workflows in the same fulfillment set goes into the timing recorded for OE_FULFILL_WF.START_FULFILLMENT() step.

The following diagrams attempt to represent the dynamics more clearly:



Projecting the poor performance..

Based on the situation described above, a simple projection of poor performance can be done for a particular kind of order. Lets assume that there are 100 lines in a configurable model/Item. Also, lets assume that for 75% of the order lines, payment terms or/and agreements do not match with those in the Header. For each mis-matched (terms and agreements - with header) line, the time taken for pricing varied from 150 seconds to 750 seconds.

When we calcualte the cumulative time taken for executing xxont_cascade_pkg.set_ln_trm_agr_proc API for about 75% mis-matched lines of 100 lines, with ~5 mins taken for each line, the overall time taken comes to 75 x 5 mins ~ 8 hrs. That's the time for processing OE_FULFILL_WF.START_FULFILLMENT() step for one model. That was exactly the performance problem being experienced.

Now, just imagine the performance strain if the order had 10 such configurable items.

Mathmatically speaking:

The Solution: Simplify by taking the Highest Common Factor...

There were two important dimensions to the solution:

There was a bug in 11.5.9 Order management module which prevented the cascading order header changes to order lines from working properly (Profile option OM: Sales Order Form: Cascade Header Changes to Line should be Yes). This was prompting the calls to OE_ORDER_PUB.PROCESS_LINE() APIs.

Patch 5981436 is to be delivered by Oracle order management development for resolving this issue. This fix is also made part of 5950291 -11.5.9 OM CUMULATIVE PATCH. This would automatically match the payment terms and agreements from Order Header to order lines for majority of the cases, except if the user purposely chose to say No to the cascade option.

Another smart option was to do batch processing of all the mis-matched component lines of a configured model in ONE call to PROCESS_LINE() API by passing them as a pl/sql table argument. Also, this was to be done ONLY if the order line in question was for a Shippable Item. This was especially beneficial since only 1-5% of sub-components in a configured model are shippable. So that made it quite restrictive. This single change took out 95-97% of the repetitive work being done by OE_ORDER_PUB.PROCESS_ORDER() calls.

       -- Calling API to update Terms and/or Agreement value(s) on the order line
       oe_order_pub.process_line (p_line_tbl      => l_line_tbl,
                                  x_line_out_tbl  => o_line_tbl,
                                  x_return_status => l_return_status,
                                  x_msg_count     => l_msg_count,
                                  x_msg_data      => l_msg_data);

Seeing the performance Boost...

Before putting in the code change:

For a fulfillment set with top_model_line_id=9905479, there were 59 mis-matched lines, and the avg time for processing one call of process_line() was ~ 580 seconds.

After putting in the code change:

The time taken to same 59 mis-matched lines (of the same fulfillment set with top_model_line_id=29905479) in one pls/sql table (as an argument) in process_line() API call was ~ 625 seconds.

Putting statspack data into perspective...

This also explained why statspack reports showed xxont_cascade_pkg.set_ln_trm_agr_proc as one of  the major I/O offenders per execution:

Summary of learnings from the Case study..

  • Don't have expensive custom activities defined after START_FULFILLMENT step in OM workflow. It will pronounce the performance hit many times, especially when Oracle configurator and order management modules are being used in tamdem (Better, don't  have expensive workflow custom activities at all)
  • Batch processing should be over piece-meal processing, especially when an API has provision for the same. This reduces most of the repetitive processing
  • The whitebox (drill down) approach works for taking apart a baffling performance problem. Trying to simulate a test case usually leads to the heart of the performance issue in a more reliable way
  • Getting extensive debugs and traces is great, but only when it is known what is being looked for. Asking the right probing questions is very important. Question, question, question.
  • A well though out plan with minimal tracing and using a drill down approach, can bring better results than a shot gun or blunderbuss approach
  • Sometimes, a high level functional knowledge of the processing being done can be very useful in understanding the nature of problem. A balance between strict technical and pure functional knowledge can be fruitful towards solving performance issues

TRIVIA: Why was the customization put in the first place?

Cascading is the process in which values from the sales order header level automatically populate to the corresponding fields at the sales order line level.

Due to a bug in 11.5.9, Agreement field at Header level was not populating to the child lines of a configuration.

Due to this, the autoinvoicing program split lines into separate invoices by agreement. There were thousands of invoices where the components were being invoicing separately from the top model. It was an invoicing fiasco.

Appendix A:

sandbox:dev> more run_start_fulfillment_with_om_debug.sql

var resultout varchar2(1000)
set serveroutput on time on timing on

prompt setting userid, responsbility and application id for JHASLAGE FOR order management super user

define user_id=7962
define resp_id=52560
define appl_id=660

define item_key=29904309
define activity_id=141236


l_debug_file varchar2(400);
l_org_id NUMBER;
l_count NUMBER;
l_result VARCHAR2(30);
l_spid varchar2(20);



select p.spid
into l_spid
from v$process p, v$session s
where s.audsid=userenv('sessionid')
and paddr = addr;

(p_itemtype => 'OEOL'
,p_itemkey => '&&item_key'
,p_actid => 12345
,p_funcmode => 'SET_CTX'
,p_result => l_result


--enable OM debug
l_debug_file := OE_DEBUG_PUB.Set_Debug_Mode('FILE');

dbms_output.put_line('Debug file is : '|| l_debug_file);

--enable trace
EXECUTE IMMEDIATE ('alter session set max_dump_file_size=unlimited');
EXECUTE IMMEDIATE ('alter session set tracefile_identifier=''gverma_start_fulfillment_with_om_debug_&&item_key''');


--disable trace

--turn debug off
dbms_output.put_line('File name '||OE_DEBUG_PUB.G_DIR||'/'||OE_DEBUG_PUB.G_FILE);
dbms_output.put_line('Trace file generated for this request:'||'/*'||l_spid||'*.trc');



dbms_output.put_line('reached error in script ' ||sqlerrm);


select :resultout from dual

Appendix B:

col ITEM_KEY for a10 Head "Item Key"

col BEGIN_DATE for a18 head "Begin Date"

col END_DATE for a18 head "End Date"

col ROOT_ACTIVITY for a30 head "Root Activity"

col RESULT for a15

col STATUS for a8 Head Status

col time_taken_mins for 99999 head "Time(Mins)"

col step format a60

set pages 100 time on timing on lines 150

spool get_details_for_wf_step

select ias.item_key,

to_char(ias.begin_date,'DD-MON-RR HH24:MI:SS') begin_date, to_char(ias.end_date,'DD-MON-RR HH24:MI:SS') end_date,

(ias.end_date-ias.begin_date)*24*60 time_taken_mins,

pa.process_name||'->'||pa.instance_label step,

ias.activity_status Status,

ias.activity_result_code Result

from wf_item_activity_statuses ias,

wf_process_activities pa

where ias.item_type = 'OEOL'

and ias.process_activity = pa.instance_id

and ias.begin_date >= sysdate-14

and pa.process_name||'->'||pa.instance_label = 'XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE'

and ias.activity_status = 'COMPLETE'


-- and the step was not cancelled


and ias.activity_result_code != '#FORCE'


-- it took more than so many mins to complete this step


and (ias.end_date-ias.begin_date)*24*60 > 30

order by (ias.end_date-ias.begin_date)*24*60 desc, ias.begin_date


spool off

A SQL Tuning Case Study: Could we K.I.S.S. Please?

Gaurav Verma - Sat, 2008-03-01 01:47

 Preface They say, it pays to K.I.S.S (Keep It Simple and Smart). The simplest ideas are the greatest ideas.This philosophy also applies to sql coding when we tell the oracle SQL engine what to fetch and how to fetch it (Indirectly). There are great rewards for simplistic thinking.
Lets discuss this principle through a case study in which the user was trying to fetch very simple data through an over-kill query, involving  RA_* views which were internally using some of the same tables. The SQL engine was therefore just trying to fetch and filter too much data from the same sources. It was not certainly efficient.
In this case study, we see how its important to understand the "Why" of a query, rather than the "How" of a query. Many times, just rethinking the whole approach pays great dividends. The Poorly Performing Query The client had a custom concurrent program which had this query as the worst offender.. it was taking 90% of the time and the program used to take ~ 1 hour to finish.
The formatted query is listed below:
Lets look at the TKPROF stats and runtime SQL plan used along with row counts.

Its interesting to observe that while only ~3000 rows are returned as the output of this sql, a huge number of rows were fetched from one of the HZ* tables.
Lets check the underlying tables being used by the Views RA_ADDRESSES_ALL and RA_CUSTOMERS. RA_ADDRESSES_ALL was a synonym for RA_ADDRESSES_MORG, while RA_CUSTOMERS was a synonym for RA_HCUSTOMERS.
  1  select referenced_name, referenced_type
  2  from dba_dependencies
  3* where name='RA_HCUSTOMERS'
SQL> /
----------------------- -----------------
HZ_PARTIES              TABLE
SQL> l
  1  select referenced_name, referenced_type from dba_dependencies
  2  where name='RA_ADDRESSES_MORG'
  3* and referenced_type='TABLE'
SQL> /
------------------------------ -----------------
HZ_PARTY_SITES                 TABLE
HZ_LOCATIONS                   TABLE
Lets also check the data volume involved in the base HZ* tables being referred and the percetange of estimation sample Percentage used.. (~70%) by Gather statistics program:
 Do we need this complexity?
This is a good time for Reality Check: This is certainly overkill.
Why would you want to join complex views with 3-4 base tables being referred inside the views. Not an intelligent approach, shall we say? Better, why wouldnt one want to query the same data by understanding the data model relationship between the base tables involved and see if the same results can be achieved through them.
Spending time on checking the data volume involved, the % Estimation by Gather Stats, index cardinality etc would not lead to anything conclusive or direct inference in this case.
Another good learning from the http://etrm.oracle.com website (requires a Metalink Login: Note:150230.1) regarding the RA_* views is that they should not be used directly in ANY CUSTOMIZATIONS ever! All the more reason to eliminate the views from the query altogether and find an alternate way of getting the same result.

 The Better Way...
Now, Only 1 column was being really queried. Inspite of this, after understanding the object model relationship diagrams for HZ* tables, at least five HZ* tables were required to be queried for a supplied customer_number value.
What was the SQL trying to achieve:
For a particular Customer_number, It wanted the find the Last_update_date for an ACTIVE Site which was being used as an ACTIVE BILL_TO location, with an address of ICC LEASE in its information.
Lets look at the index information:
select column_position, column_name
from dba_ind_columns
where index_name='&1'
Enter value for 1: HZ_LOCATIONS_N1
--------------- ------------------------------
              1 ADDRESS1
SQL> /
Enter value for 1: HZ_PARTY_SITES_U1
--------------- ------------------------------
              1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_PARTY_SITES_N2
--------------- ------------------------------
              1 LOCATION_ID
SQL> /
Enter value for 1: HZ_CUST_ACCT_SITES_N1
--------------- ------------------------------
              1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_CUST_SITE_USES_N1
--------------- ------------------------------
              1 CUST_ACCT_SITE_ID
              2 SITE_USE_CODE
              3 PRIMARY_FLAG
The same SQL query could be re-written such that it fetched the same data using base HZ_* tables, after connecting the foreign Key relationships.
Such an approach had many advantages:

  • RA_* views avoided as they are not supported except for in standard 11i products, also, its highly inefficient to access 7 tables/views for getting 1 column
  • The new query fetches less than half the number of consistent read buffers as compared to the previous query (verified using set autotrace on sql command)
  • The new query has just 14 access steps as opposed to 21 access steps for previous un-tuned query
  • The new query uses three unique Primery Key indexes (primary key) out of the five tables and two Foriegn key indexes for the remaining two, which is a really selective way of querying data
It simply cannot get better than that!
The performance improvement was very visible as the improved query ran within 2 minutes. Compare that with its previous cousin which used to run for ~1 hour.
       SUBSTR(hca.account_NUMBER,1,8) BILL_CUST_ID
   hz_cust_accounts       hca,
   hz_cust_acct_sites_all hcas,
   hz_cust_site_uses_all  hcsu,
   hz_party_sites         hps,
   hz_locations           loc
-- Joining all the tables through primray keys
-- and foreign key relationships first
      hca.account_number = :B1
  and hca.cust_account_id = hcas.cust_account_id
  and hcas.party_site_id = hps.party_site_id
  and hca.party_id = hps.party_id
  and hps.location_id = loc.location_id
  and hcas.cust_acct_site_id = hcsu.cust_acct_site_id
-- the actual data query criteria starts here
  AND hcas.STATUS = 'A'
  AND hcsu.STATUS = 'A'
-- Could perhaps add this criteria too ???
-- AND hcas.bill_to_flag = 'Y'

 Summary It pays to ask the "why" questions sometimes, rather than focusing on "How" and "What" and going on a wild goose chase. The "Why" questions can get you more quickly to the answer by re-thinking the strategy.
Einstein himself said, "everything should be made as simple as possible, but no simpler."
The defence rests.
Related Links
Other Performance Related Articles

An Application Tuning Case Study: The Deadly Deadlock

Gaurav Verma - Sat, 2008-03-01 01:36


Lets face it. Deadlock and Latching issues have always been hard to debug. Many a times, we don't focus too well on application tuning, even though, it always advisable to look at the application design and user access trends first. These things don't come to the mind too readily.
Many a times, we do face situations where the users are deadlocked, sitting and watching blue screens or the application was freezing up. While there are many reasons for this to happen: Poor OS implementation, (swapping/disk/Memory issues) , RDBMS bugs, RDBMS initialization parameter values, having a lot of non sharable SQL, a good share of cases can also be caused due to Poor application coding, without undertanding the sequence in which resources should be accessed or acted upon (updated/deleted etc).
This is one such case study in which a deadlock was traced back to a deadlock situation un-aware code and the solution was to enhance the code to not cause deadlocks in the future. While the change was really simple, picturizing the deadlock in action was really difficult.
This article also goes into the mechanics of contructing the deadlock situation in slow motion. This makes it an instructive case study to read through for future reference. Getting the Bad Press...
The client had a big OLTP implementation of Order management (Order to cash business flow) 11i E Business suite and the main bread and butter was to take orders for their configurable products.
Many a times during any business day, the customer service representatives (~200 people) would get blue screens or get hung up, while trying to work on Orders. Getting frantic helpdesk calls for angry customer service reps was normal and obviously this was under the management's scanner since it was highly visible.
Customer service reps would keep making small talk to their clients who would call in to order chairs etc and eventually, the clients would give up on the wait. As a fallback of this, the client was irate (they were losing their customers) and frustrated at implementing order management.
Everybody in customer service would say: Why is Oracle so slow? Scratching the surface...
This continued on for months and it was normal for several myths about bad oracle performance to float around. More so, no one had any idea what was happening. Someone blamed workflow, someone blamed the network and so on..
The customer had done a lot of customizations. Preliminary investigations revealed that often there was a deadlock detected between a concurrent program session and a forms user session.
As a corollary, the custom "Order Acknowledgment" program would sometimes hang inordinately for > half hour, whereas otherwise, it would "normally" finish in 1-2 mins.
Analyzing wait events (using v$session_wait) and 10046 event session tracing didnt lead anywhere. The pattern was not predictable too. It was really very puzzling behaviour. Understanding the dead-lock in slow motion...
The "XX Sales Order Acknowledgement XML publishing" program used to run every minute. There had to be some co-relation between the programs code (table/object access/locking pattern) and the activity being done on Order entry forms.
After analyzing the code and seeing how underlying tables were being accessed or updated, It became clear that there was a problem in the "XX Sales Order Acknowledgement XML publishing" code.
The complete picture of the deadlock pattern was available now. Lets see it in slow motion here: Deadlock Situation 1:
From the entire bunch of qualifying Order header records to be processed for Order Acknowledgements, a batch of 25 Orders were updated at a time. After the updation of 25 orders, a commit was issued, which then freed the locks.
Now, out of these 25 records, if "XX Sales Order Acknowledgement XML publishing" concurrent program (say A) was trying to update the first Order, and there were be a Customer Rep (say B) who was working on the very same order in OE screen, "A" HAD TO WAIT till B freed the record by either commit/rollback action in Order Entry form.
TILL THE TIME B freed the same record, "A" was not be able to proceed with the remaining 24 Orders in the batch and kept on waiting for the row level exclusive lock.
It eventually either timed out or acquired the lock on the row. Similarly, "A" was not be able to proceed till "B" was done with the updates on the Order Header in OEXOETEL (Order Entry) form.
The same applied for any other Order which A had to update. Thus the situation when a single run of "XX Sales Order Acknowledgement XML publishing" concurrent program would be taking 20 mins or so. During this time, one would witness a high Order Entry udpating activity at the Customer Support site.
Now, lets reverse the situation. Deadlock Situation 2:
At anytime, there may be 20 (say) Customer Reps (say "B"), who are working on 25 or 30 orders. Therefore, in theory, around 25-20 Order Entry Headers rows may be locked in exclusive mode.
Now, every next minute, the "XX Sales Order Acknowledgement XML publishing" concurrent program was trying to update these very Order Entry Headers rows and was having a hard time waiting for the row level exclusive locks to get released.
When "B" saved its changes, the commit released the row level exclusive locks. Only then was "A" able to proceed with its job. The real time deadlocks seen during the day were caused by a mix of Deadlock situation 1 and 2.
Both the above situations were happening at the same time and vying for the same resource, the same row in the Order Entry headers table, to become free.
This was a vicious, endless cycle and caused locking grief for people working on Order Entry. Breaking the cycle
Once the nature of the application design and sequence of accessing resources were understood, it was clear that the vicious cycle had to be broken.
It was best to make the custom program more intelligent by checking first if the Order header record was locked (can be done by using the SELECT ..... FOR UPDATE NOWAIT clause).
Any already locked record was skipped for update. The logic used was that the skipped Order header record would be picked up for processing by the custom program next time AFTER the custom service rep would finally commit or rollback on the Order Entry form (the program was running every minute, remember). Summary
The problem was with the custom application design after analyzing the table/object access/locking pattern & understanding the code. Essentially, in this case, what was happening was that the Custom order acknowledgment program used to update a batch of 25 Order header records, in a sequential fashion, and only then issue a commit.
The update statement in culprit concurrent program never checked if the Order header record was already locked or not by another session and hence, as per default behavior, the update statement waited till the lock timeout happened -- this is a long time until NOWAIT clause is specified.
This could happen for any of the 25 Order header records it would try to update. When a customer service rep wanted to update the same Order which was being processed by the order acknowledgment program, the person would hang since the form session couldnt acquire a lock on the Order header record till the time a commit was issued by order acknowledgment program.
On the other hand, when a customer service rep would update an Order through the Order Entry screen, they would lock up an order for a good 10-20 mins or so (they kept doing stuff to it, while talking to the customer on the phone) and the custom order acknowledgment program would keep waiting for that exclusive lock on the order header record.
Again, this could happen for any of those 25 records in the batch, as per the design. So the deadlock was happening both ways.
And it was really subtle. NO amount of tracing or latch analysis or metalink research would have brought the troubleshooter to this stage of understanding the real issue.
Related Links
Other Performance Related Articles

Yet Another Case Study: The over-commit trap!

Gaurav Verma - Sat, 2008-03-01 01:33

Introduction Its always rewarding to talk about lesser known performance tuning scenarios, which are not all that well appreciated in the DBA or developer world. Ask anyone as to what performance tuning has one worked upon and quite likely, you may hear - Ah well, I made sure the stats were gathered latest, the right index was being used or I used a hint to use the right index. How often do we look at other aspects like the frequencey of commits? An atypical performance problem scenario At a particular client site, a custom concurrent progam had a history of taking more than 8-10 hrs to complete.
The Applicaton load nature was a mixture of high OLTP during the regular business day hours and more than moderate batch processing during the evening to early morning hours. Needless to say, a lot of long running batch jobs were contending for shared resources like Rollback & temp segments.
A lot of diagnosis was done as to why the custom concurrent program would take so long. Quite often, the program would terminate with the notorious ORA-1555 (snapshot too old error), even though the data volume involved was quite less (7000 records)  and abundant rollback segment space -- 120g.
Even setting a high rollback segment retention initialization parameter undo_retention  as 8 hours did not help out in ironing out the performance issue. Examining the possibilities... There could have been several reasons why the performance lag was manifesting:

  • The query had a bad plan and the compile time plan was different than runtime plan. This happens many times, but it didnt seem to be the case
  • A hint for using unique index (from using index range scans) to avoid db sequential file read wait event seemed to make no difference
  • The level of  DML activity against tables within the process could be simply too much for the batch load to complete successfully in a suitable time window. This theory did not sound right since rollback retention period ~ 8 hours
  • The program was not doing commit or rollback often enough to be stay within the rollback retention period?
Foxy observations..  While sql  tracing (event 10046) identified that a simple select on a custom table having not so much data was having a hard time, the corresponding SQL plan was using a unique index (!!). What more could one ask for? The situation was almost wierd.
At the same time, pl/sql profiling (dbms_profiler) identified that FETCH  phase of that very simple select was taking long. The HTML output generated from plsql profiler runs using the profiler.sql script of Metalink Note 243755.1 was especially helpful in getting to this stage of analysis to determine where was the maximum delay in the entire process. Getting to the bottom of it  Again, one more thing identified, which could be missed easily, in the tkprof'ed output of raw sql trace was that the CR blocks read for the fetch phase of the cursor were high. The sql engine had to read a huge number of block reads for forming a read consistent image for the FETCH phase. This aspect is not readily discernable since the regular thought process is to look at elapsed time and SQL plan.
What could the reason for such a high  number of block reads for forming a read consistent image ? Too few commits/rollbacks or too much  commits/rollbacks? Overdoing anything is usually not a rewarding experience. Looking closer into the code...  On re-examining the PL/SQL profiler HTML output, it was seen that the program logic was executing a commit statement within a loop for every input record processed.
This would obviously overtax the rollback segments and causing the user server process to record a read consistent image of the corresponding transaction data at every commit.
Taking cue from this observation, when commits were done in a batch fashion at some threshold, the entire load ran in 4 hours (Input volume: 7000 lines). Conclusion

Neither is it good to do commits less frequently, nor is it good to over-commit. Moderation is the key for a happy medium and at times, it can have bearings on performance problems as well.
Who said software cannot learn from Philosophy?

A tuning case study: The goofy optimizer (9i.x RDBMS )

Gaurav Verma - Sat, 2008-03-01 01:23

Preface They say: Tune the application first. And then, many a times, the CBO behaviour is not predictable or favourable, inspite of its best intentions or internal hueristic rules or availability of latest statistics with good sampling percent.
This is one such example and how the usage of hints had to be used as one of the last options to improve the performance of a concurrent program.
Ideally speaking, the CBO chooses what is usually either the best or one of the best available data access plan for executing and fetching the sql output. It should take into considerations all the underlying objects statistics and use the least costly data access plan. Sometimes, it doesnt happen out of the box.
Note: Some seeded 11i Oracle Application modules also depend heavily on hints to make the performance of their code acceptable. One such heavy user is the QP or Advanced Pricing module. The Problem Scenario... At a client site, a shipping related custom concurrent program started ~5 hrs to complete on an average, rather then the usual ~3 hrs. From preliminary investigation using RDBMS trace (event 10046) and PL/SQL dbms profiler, that the maximum time (about ~4.5 hrs) was taken up by the following innocent looking SQL itself!!
         ORDER_SHIP_TOT ,
       DECODE (H.ORG_ID, 362, 'US', 365, 'CAD',
                         5624, 'MEX', H.ORG_ID) OPERATING_UNIT ,
AND OS.NAME <> 'Satcom Closed'

The top time taking code section from pl/sql DBMS profiler data. As can be seen, the maximum time is spent in the Fetch of the cursor executing this query:
 Evaluating options The first thing which comes to the mind is: Can we play with different sampling estimations? e.g. 10%, 50%, 70%. While the idea is appealing, it can destablize many other SQL queries and its an uphill task to judge the impact of such a change on the remaining application peformance as such. Heavy price to pay for tuning one lone SQL.
Again, the statistics of the objects were also recent enough (Stats gathered on a weekly basis as usual). There were no big changes in the data volumes as such. Getting into the optimizer's thought process From the SQL plan being used, It can be readily seen that the optimizer acted greedily like this: It saw that there are several small HZ* tables which could easily fit into PGA/SGA  memory and so it thought -- Cool, I can do a quick Hash join of these folks (that would also take care of the outer join predicate evaluation). After that, it proceeded to retrofit the remaining tables to be joined into the plan which was developing.
Fine approach. Only problem is that the index range scan of Header_id column using the non-unique index on the oe_order_lines_all table simply kills the performance of the fetch phase. It is very simply, very very inefficient.
19101503 INDEX RANGE SCAN OE_ORDER_LINES_N1 (object id 334899)
A much better approach would have been to first filter off the header_id values from OE_ORDER_HEADERS_ALL table so that it would use the unique primary key index of OE_ORDER_HEADERS_ALLand join with the OE_ORDER_LINES_ALL table using nested loop operation.
This itself would filter off a huge chunk of the data (which is mostly centered around OE_ORDER_LINES_ALL table). The remaining HZ* & miscellaneous data could easily be plugged in via Hash Joins. The idea was to drive the query using OE_ORDER_HEADERS_ALL to eliminate the maximum chunk of un-required data first.

Also, the predicate AND h.booked_date IS NOT NULL was replaced with
and h.booked_flag = 'Y'
. Why use a NOT NULL clause when a flag is available in OE_ORDER_HEADERS_ALL?

Also, a custom index xxoe_order_lines_n11 was found to exist on oe_order_lines_all.actual_shipment_date column. We thought we'll make use of it too. The more the merrier.

So then, after grouping together the related predicates together, formatting the query and Putting these ideas into practice, the original query was changed as following:
SELECT /*+ index (h,oe_order_headers_u1)
index (l,xxoe_order_lines_n11)
      hp.category_code  category_code
     ,hc.sales_channel_code sales_channel_code
     ,hc.customer_class_code customer_class_code
     ,SUM(l.shipped_quantity * l.unit_selling_pric order_ship_tot
     ,l.actual_shipment_date ship_date
     ,DECODE (h.org_id,
        362,  'US',
        365,  'CAD',
        5624, 'MEX',
        h.org_id)  operating_unit
     ,tt.NAME  order_type
         ont.oe_order_headers_all  h
        ,apps.oe_order_lines_all   l
        ,apps.hz_parties           hp
     ,apps.hz_cust_accounts        hc
     ,apps.hz_cust_site_uses_all   hcsu
     ,apps.hz_cust_acct_sites_all  hcas
     ,apps.oe_order_sources        os
     ,apps.oe_transaction_types_tl tt
     ,xxiw.xxiw_bs_om_order_types  xot
          h.header_id                    = l.header_id
      AND h.ORDER_SOURCE_ID              = os.ORDER_SOURCE_ID
      AND h.ORDER_TYPE_ID                = tt.TRANSACTION_TYPE_ID
      AND h.flow_status_code not in ('CANCELLED','ENTERED')
      -- AND h.booked_date IS NOT NULL
      -- Replaced with the following condition                 
      and h.booked_flag = 'Y'
      -- Replaced with the above condition
      AND tt.LANGUAGE                    = 'US'
      AND tt.NAME                 not like 'INTERNAL%'
      AND l.flow_status_code             = 'CLOSED'
      AND l.LINE_CATEGORY_CODE          <>'RETURN'
      AND h.invoice_to_org_id             = hcsu.site_use_id (+)
      AND hcsu.cust_acct_site_id = hcas.cust_acct_site_id(+)
      AND hcas.cust_account_id           = hc.cust_account_id(+)
      AND hc.party_id                    = hp.party_id(+)
      AND UPPER(hc.customer_class_code) <> 'INTERNAL'
      AND UPPER(hc.sales_channel_code)  <> 'INTERNAL'
      and tt.name                        = xot.order_type
      and xot.type                       = 'ALL'
      AND os.name                       <> 'Satcom Closed'

The Improvement

The same query now executed under 5 minutes!!! ~4.5 hrs to ~5 mins. Thats quantum.

The improved plan was as follows:


While it may not be ideal, at times, it may be required to supplement of coach the CBO to achieve acceptable levels of application performance. The effect may be more long lasting if the "relative" data volume pattern of the underlying objects is not expected to change drastically in the system.


Subscribe to Oracle FAQ aggregator