Feed aggregator

Upgrade from Jinitiator 1.3 to Java Plugin 1.6.0.x

Aviad Elbaz - Fri, 2008-03-07 05:51

Lately Oracle announced the end of Error Correction Support for Jinitiator 1.3 for E-Business Suite 11i, effective July 2009.

This is the sign it’s about time to upgrade to the native Java Plug-in… :-)

Among other things, one of the main advantages of upgrading from Jinitiator to the native Java Plug-in is the prevention of conflicts between them.

This upgrade is great news to all are working with Oracle Discoverer Plus (with Java plug-in) and Oracle EBS 11i (with Jinitiator) and experiencing those conflicts.

I’ll skip all the others advantages and disadvantages of upgrading to Java Plug-in as they are well described in Steven Chan’s post - Jinitiator 1.1.8 To Be Desupported for Apps 11i and in Metalink Note: 290807.1 - Upgrading Sun JRE with Oracle Applications 11i.

So I will focus on the upgrade process itself - step by step.

I tested the upgrade on the following test environment:

  • EBS 11.5.10.2
  • Database 10.2.0.3
  • ATG Rollup 5
  • Developer 6i patchset 18
  • OS RHEL4.

Be aware that before upgrading to Java Plug-in you must upgrade to Developer 6i patchset 18 or later (currently the latest patchset is 19).

* You can use my previous post in order to Upgrading Developer 6i with Oracle Apps 11i to patchset 18.

  1. Download JRE plug-in Oracle E-Business Suite interoperability patch - 6863618
  2. Download the Sun JRE Plug-in 
    • Select Java Runtime Environment (JRE) 6 Update X (select the latest available update, currently it’s 5)
    • Select Windows offline installation, multi-language
  3. Rename the downloaded installation file jre-6_uX-windows-i586-p.exe to j2se1600x.exe
    In my case rename jre-6_u5-windows-i586-p.exe to ==>> j2se16005.exe
  4. Copy the j2se1605.exe file to $COMMON_TOP/util/jinitiator on the Apps Tier node
  5. If you are on Developer 6i patchset 18 you should apply forms patches 6195758 & 5884875.
    ** Skip this step if you are on Developer 6i patchset 19.
    • Download Patches 6195758 & 5884875
    • Apply patch 6195758
      • Stop all applications processes by adstpall.sh
      • Unzip p6195758_60827_GENERIC.zip
      • cd 6195758
      • cp -r $ORACLE_HOME/forms60/java/oracle/forms/
        handler/UICommon.class $ORACLE_HOME/forms60/java/oracle/forms/
        handler/UICommon.class.PRE_BUG6195758
      • cp -r $ORACLE_HOME/forms60/java/ oracle/forms
        /handler/ComponentItem.class $ORACLE_HOME/forms60/java/oracle/forms/
        handler/ComponentItem.class.PRE_BUG6195758
      • cp oracle/forms/handler/UICommon.class $ORACLE_HOME/forms60/java/oracle/forms/
        handler/UICommon.class
      • cp oracle/forms/handler/ComponentItem.class $ORACLE_HOME/forms60/java/oracle/forms/
        handler/ComponentItem.class
    • Apply Patch 5884875
      • Unzip p5884875_60827_GENERIC.zip
      • cd 5884875
      • cp -r $ORACLE_HOME/forms60/java/oracle/forms/engine/Main.class $ORACLE_HOME/forms60/java/oracle/forms/
        engine/Main.class.PRE_BUG5884875
      • cp -r $ORACLE_HOME/forms60/java/ oracle/forms/
        handler/AlertDialog.class $ORACLE_HOME/forms60/java/oracle/forms/
        handler/AlertDialog.class.PRE_BUG5884875
      • cp oracle/forms/engine/Main.class $ORACLE_HOME/forms60/java/oracle/forms/engine/Main.class
      • cp oracle/forms/handler/AlertDialog.class $ORACLE_HOME/forms60/java/oracle/forms/
        engine/AlertDialog.class
      • Run adadmin -> Generate Applications Files menu -> Generate product JAR files
  6. Apply the Interoperability patch 6863618
    • Make sure all applications processes are down
    • Enable maintenance mode:
      Execute adadmin -> Change Maintenance Mode (5) -> Enable Maintenance Mode (1)
    • Unzip p6863618_11i_GENERIC.zip
    • Cd 6863618
    • Run adpatch to apply patch driver u6863618.drv
    • cd [PATCH_NUMBER]/fnd/bin
    • Execute the following command where X represent the update number:
      $ txkSetPlugin.sh 1600X
      In my case:
      $ txkSetPlugin.sh 16005
    • Disable maintenance mode:
      Execute adadmin -> Change Maintenance Mode (5) -> Disable Maintenance Mode (2)
  7. Start all applications processes by adstrtall.sh
  8. Verify installation by sign in Oracle EBS and select a forms based responsibility.

For those who worry about performance, take a look at this benchmark:
Benchmark comparison test with Jinitiator 1.3.1.23 and Java Plug-in 1.5.0_05 – performance whitepaper

For more information - Note: 290807.1 - Upgrading Sun JRE with Oracle Applications 11i

Aviad

Categories: APPS Blogs

Do-Gooderitis

Mary Ann Davidson - Thu, 2008-03-06 09:26

You know there are too many labor-saving devices in the world when you see the sheer number of professional do-gooders trying to solve problems hardly anybody else worries about. If you have a day job, having someone with too much free time tell you why you need to be concerned about "Making the World Better Through FOO" is often just about as irritating as those old TV commercials moaning about ugly yellow wax buildup on your kitchen floors (my solution: paint your kitchen walls yellow to match the floor).

 

There are, of course, many people who devote time and passion to making the world a better place. I'm not talking about them here. I am talking about the people who seize on something they care about without bothering to find out if there is an actual problem that needs to be solved. Or, if there is a "problem," asking what the cost is of fixing it and what one could do with those same resources that might solve a more pressing problem (a concept known as "opportunity cost" to economists). It's all you can do, when confronted with an earnest but clueless do-gooder, not to say, "Ask me if I care."

 

Where I live in Idaho, there are a couple of professional Do-Gooder Projects that engender a lot of whining in the local newspapers. One of them is the Relocate the Airport lobby. The claim is that we need to 1) build a new airport 2) with longer landing strips 3) so that larger commercial planes will fly here. (Never mind the fact that commercial airlines have said repeatedly they will not land larger planes here because there isn't enough demand to support it.) There isn't actually a problem the community needs to solve via a new airport, but we'd create a bunch of new problems, like people having to drive an hour or more to get to Sun Valley instead of the current half hour from Friedman Memorial Airport.

 

The other local Do-Gooder Project relates to "affordable housing." Mind you, there is no actual housing shortage in town: if you want to work here, you can easily find an affordable place to rent. Many people who work here who want to own property live in another county - where they can get a lot more land for a lot less money. The idea that anyone who works here - regardless of income - should be entitled to own a free-standing home isn't reasonable given market (and geographic) realities (e.g., the land around us is Bureau of Land Management land and cannot be developed). As one of my friends put it to a local Affordable Housing Do-Gooder: "You didn't live next door to your gardener in Marin, either."

 

My personal opinion is that a lot of these do-gooders retired early, miss running something and want to run everyone else in town by solving problems that don't exist.

 

There are Do-Gooder Initiatives in the IT industry, too, a number of which are in security.  Security Do-Gooder Initiatives sometimes come under the guise of a laundry list of 5,000 things that everyone should do to be more secure. Even if all 5,000 of those things are somewhat useful, just like New Year's Resolutions, they are likely to be more actionable and "accomplishable" if the list is shorter. Putting it differently, I know very well that I should eat less, exercise more, eat more nutritious food, read better books, improve my skate skiing technique by lengthening my glide and so on. I can't actually process 5,000 "should dos" so I try to parse them down to a smaller list of things that I can actually do that will also make the most difference to my health, my skate skiing, or whatever it is I am trying to improve upon. Many Do-Gooder Initiatives do not have any sense of "nobody can do everything all at once, so maybe doing something now and doing more later is a better way to slice the pie." The initiatives fail due to the expectations - and failure to prioritize - that they entail. You might actually just give up from the frustration of trying to comply with 5,000 "shoulds." 

 

(It turns out that the people who actually do make good on their New Year's Resolutions start with a small, actionable list instead of a 30-page life plan. A small list of things you can do and will do is better than a much larger list of things that you are never going to get to. Less really is more.)

 

The reality is that some things matter more than others if you are trying to make constructive change. If I drink a bottle of wine a night (I don't) and have 40 "better health things" I want to do, saving my liver might be among the most important ones. So maybe, trying to cut down to a glass or so a night would be the biggest payoff on my better health things list and I can skip the other 39 items or relegate them to next year. Unfortunately, there are a lot of Do-Gooder Initiatives that not only have too many things on the list; the list is not weighted at all for where the value is in making change. (Opportunity cost again: what could I do with the same resources that would have a bigger payoff?)

 

I wonder if a lot of Do-Gooders get out enough in the real world. Maybe they are academics who think "theory" is enough. ("Theory" of baking doesn't get you a pie.) Or think-tankers who are paid to develop secure Internet toaster protocols that they then want to standardize. (Does anybody really worry about who is accessing their bagels remotely?)

 

Whenever I participate in public-private partnerships where a lot of "improve security" initiatives are generated and where there is typically a broad tent of participants (a good thing, in general), I try to ask that the people putting the laundry lists together grab someone who is either a cost accountant or an economist to look at where the bang for the buck goes in what's being proposed. Because if they do not do that, these initiatives are doomed to fail. Or, they will be so expensive that nobody does them because they can't afford the entire megillah.

 

The one take-away lesson I got from my nerdy quantitative methods class in business school is that when you are trying to solve an optimization problem, you can't optimize on all parameters. Time is constrained. Resources are (ultimately) constrained. Answering the question, "How can do X while making best use of scarce resources?" means I need to take account of what I most want to accomplish and how valuable is it to me that I accomplish those things.

 

For example, there are security initiatives around "what metrics and artifacts at every stage of development you should produce to 'prove' assurance claims."  People measuring the assurance of software believe that there are things you ought to be able to produce and measure at each stage of development. However, there is a cost to producing metrics and artifacts. If the cost of producing these is greater than the value of more information, you shouldn't put the work in to produce them. Even if everything has some value, some things are more critical than others or provide greater value for the work you put into getting them. One of the way I tranche our metrics project is to look at a) what can we data mine today to give us security metrics? b) what else would we like to know (in some order)? c) what will it cost to get that information? and d) is the cost less than or greater than the benefit of the information?

 

If you are a small company, maybe you can't - in the beginning - do every single Best Practice Recommendation (or produce every single metric or every single artifact that anybody in a theoretically perfect world would want). But you can do something, and you'd be willing to do something if someone helped you by telling you what the most important things are to do first that make the biggest impact. Something is almost always better than nothing.

 

Even people who know they ought to do more in security - and are willing to improve - will fight tooth and nail if they are confronted with a "my way or the highway" mandate that takes little account of real world constraints.

 

For example, consider the Federal Desktop Core Configuration (FDCC), a recent initiative to mandate that US Federal agencies lock down their environments to a specific Windows configuration (which, as a matter of course, means packaged applications will need to run on those locked down Windows configurations). I have said often and publicly that I think one of the easiest things vendors can do to help improve security is to lock down default configurations - better security out-of-the-box, cheaper lifecycle cost for customers. I've also said that one of the things customers can do to be "smart buyers" is to insist that their vendors lock down default configurations: "You don't ask; you don't get." I don't have any issue with the goodness of this concept (and we have a company-wide initiative related to locking down default configurations). In that sense, FDCC is not a "Do-Gooder Initiative" the way I've defined it since it actually does address a problem that people worry about, that needs looking after.

 

The problem with the way FDCC has been mandated is that it did not, first of all, define what a "desktop" configuration is. Is it desktop software? Or anything installed on the Microsoft operating system (which can and is used on desktops)? There might be a huge (and legitimate) difference between the configuration of middleware or servers on Windows and the client piece of an application configured on Windows. There's certainly a big scope difference between "validating how client pieces of applications running on desktops are configured to run with FDCC" and "validating how every single component of every application that runs on Windows is configured with FDCC." What problem, exactly, is it that is being solved? "Desktops used to launch attacks?" or "locking down the Windows operating system for every single application running on it?" Nobody knows, especially since this is called a "desktop" configuration initiative, and nobody on the mandate side of this issue has yet answered that basic question.

 

Most vendors have product lifecycles such that they do not make configuration changes in anything other than a major product release. That is, when customers uptake patch sets, their expectation is that there won't be configuration changes that could break their existing applications. One time in almost 20 years at Oracle, I tried to change a configuration parameter in a patch set (for good security reasons). The configuration change broke all our business applications, so we backed it out before the patch set shipped and I've been apologizing to the release manager ever since. (We later made the configuration change in a major product release.) Unfortunately, FDCC was mandated without adequately taking into account vendors' product lifecycles. Some vendors simply will need more time to phase in needed configuration changes. A lot more, if your major release product lifecycle is years and not months.

 

Nobody was evil-minded here, but even people who support the idea of FDCC are dead in the water until they can get some basic questions answered and a dialogue going. Ideally, this dialogue should have taken place before FDCC was mandated. Industry (including Oracle) is still working to try to get clarification on the specifics of FDCC and also asking that in future these types of configuration mandates be developed with industry and with adequate phase-in that allows for product lifecycles. How you implement change is as important as what the change is if you want people to move the security ball down the field. Otherwise, even a worthy initiative like FDCC can sink into the morass of Do-Gooder Projects.

 

A better example (where really, "there is no there there," to quote Gertrude Stein) is the recent proposal to develop an ISO standard for vulnerability disclosure. I know of no vendor who thinks this is a good idea. For a start, what problem are we trying to solve? Does anybody think that we can come up with a one-size-fits-all standard for how long it should take to fix a security bug, the exact "rules" on how much information gets put into security advisories and the specific format of how that vulnerability information is expressed? Software vendors have different release cycles, customer bases, risk profiles, and more. (One-size-fits-all pantyhose, as any woman knows, only fits Hilda Mae Throckmorton of Muncie, Indiana.) There are plenty of industry guidelines for good practice on vulnerability disclosure already. Most of these acknowledge that you can't standardize this business practice any more than you can standardize apple-pie making ("Allspice? Death to infidels!"). There are also existing standards on vulnerability disclosure that vendors are adopting, such as the Common Vulnerability Scoring System (CVSS). Oracle was an early adopter of CVSS and customers have told us that it's really useful to them.

 

It is unwise (no, make that "really stupid") to try to standardize what is in effect both a business process and a set of business practices. Ira Gershwin (who knew he was a security maven?) penned the perfect lyric that applies to this Unneeded Standard Attempt: "You say po-TAY-to, I say po-TAH-to, let's call the whole thing off."

 

I offer one last example that isn't quite in line with Do-Gooder Initiatives but relates to what problem to solve and at what price. It's also a big pet peeve of mine: I get a lot of phone calls from vendors trying to shill their security products to Oracle. (Though I do not have operational security responsibility - wonderful, capable colleagues look after that - vendors assume that since my title is "CSO," I am the person who buys Cool Security Products for the IT department.)

 

I hate to mention how many cold callers do not even do basic homework before trying to sell me true love and security happiness. My favorite was the cold caller who said his firm had expertise in securing Oracle Applications deployments. I had to point out to him that, "Uh, we are Oracle, we run on Oracle Applications, and since we build the software, we'd be unlikely to hire a third party to 'securely deploy' it for us." Or, the vendors selling solutions that run on a non-Oracle database. You know, that's just a religious problem for us: we are not going to deploy a third party security solution that runs on <insert name of competitor database here>.

 

My basic pet peeve is the people who do not think about the customer perspective before they launch into their "cure cancer, raise the dead, protect against every attack known to mankind with zero false positive" shill. They claim this shill will only be "twenty minutes of your time" (only "twenty minutes" is measured on a calendar, not a watch).

 

Forthwith, here is my script for parsing through shill-meisters as quickly as possible:

 

1. "What problem does this solve?" (If you can't articulate that in 25 words or less, do not waste my time or anyone else's.)

2. "Is it a problem we are worried about or care about solving?" (Secure remote bagel access is not something that concerns me, so forget the 'Internet Toaster Protocol' pitch.)

3. and 4. "Does it address the problem better, cheaper or faster than what I am doing now? How much better, cheaper or faster?" (If it doesn't, why would I switch from something that may not be sexy or "a breakthrough technology" but gets the job done? I don't have an electric salad tosser, either, because the salad spinner I have - or a pair of tongs - works just fine and has fewer moving parts.)

5. "How can it be broken?"  (Especially for a security product, knowing and being honest about how it can be broken is important. A claim of "zero false positives," for example, should cause anyone to run screaming in the opposite direction.)

 

Do-Gooders, the next time you come up with A Cause, a small request. Please, in the interests of making it a better world without wasting everyone else's time, use your skills on a problem that really needs a solution (or on a better, faster, or cheaper way of solving an existing problem), not on a solution in search of a problem to solve.

 

For More Information:

 

Book of the week: Hog Pilots, Blue Water Grunts by Robert Kaplan (who also wrote Imperial Grunts). If you want to know what the military really does, this is a great read. Robert Kaplan was embedded with a number of different types of units, in multiple services, around the globe: special forces, marines, aviators, and submariners. A really cool read. Mahalo nui loa, all you soldiers, sailors, airmen and marines for keeping us safe.

 

http://www.amazon.com/Hog-Pilots-Blue-Water-Grunts/dp/1400061334

 

We aren't going to have "oldies" rap stations anytime in the future. If anybody has written a more clever lyric than Ira Gershwin (OK, maybe Cole Porter) I have yet to hear it. Songs with lyrics by Ira Gershwin:

 

http://en.wikipedia.org/wiki/Category:Songs_with_lyrics_by_Ira_Gershwin

 

Totally off topic, but Go! Airlines has just done a web page where you can book your next interisland trip totally in Hawaiian. E ola mau ka 'olelo Hawai'i (May the language of Hawai'i live!).

 

Check it out at:

 

http://www.lelegowau.com/

 

When all else fails, follow instructions.

Claudia Zeiler - Tue, 2008-03-04 18:26
  1. Today I tried to open my test database and got:

ORA-16038: log 3 sequence# 729 cannot be archived
ORA-19809: limit exceeded for recovery files
ORA-00312: online log 3 thread 1: 'C:\ORACLE\ORA10\ORADATA\DB10\REDO03.LOG'

Researching the problem I found:

  • Maybe your Oracle instance has reached the maximum db recovery file destination file size assigned to it? If that is so, probably your oracle instance will stop working and you will get the 3 error messages listed above while trying to start it.
  • The relevant parameter is: DB_RECOVERY_FILE_DEST_SIZE - specifies (in bytes) the hard limit on the total space to be used by target database recovery files created in the flash recovery area.
  • It works in concert with DB_RECOVERY_FILE_DEST which specifies the default location for the flash recovery area. The flash recovery area contains multiplexed copies of current control files and online redo logs, as well as archived redo logs, flashback logs, and RMAN backups.

  1. What is the maximum db recovery file destination file size assigned in my DB?

SQL> show parameter db_recovery_file_dest_size
db_recovery_file_dest_size 2G

At least this wasn't a surprise.


  1. I removed excess files

I shutdown the database. I then removed the archivelog files and the flashback files from the file system – even though they didn't amount to the 2gb in total that the system is complaining about. (This is a play database so I can do things like kill these files.)

  1. Still had a problem.

I brought the database up to mount state, but immediately there was a trace file message:

ORA-19815: WARNING: db_recovery_file_dest_size of 2147483648 bytes is 99.85% used, and has 3320320 remaining bytes available.

*** 2008-03-03 10:03:19.635

************************************************************

You have following choices to free up space from flash recovery area:

1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard, then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating system command was used to delete files, then use RMAN CROSSCHECK and DELETE EXPIRED commands.

  1. The files were still being seen by RMAN! When all else fails, follow instructions, in this case #4 above. “4. Delete unnecessary files using RMAN DELETE command. If an operating system command was used to delete files, then use RMAN CROSSCHECK and DELETE EXPIRED commands.”

RMAN> crosscheck archivelog all;
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1

channel ORA_DISK_1: sid=146 devtype=DISK
validation failed for archived log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_733_3WRQ14DY_.ARC recid=70 stamp=648389669

validation failed for archived log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_734_3WRQ4V7R_.ARC recid=71 stamp=648389787


What nice list of all the files that aren't there, and all the locations where they aren't; 68 files that don't exist! [apologies to any reader of English as a second language]

RMAN> delete expired archivelog all;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=146 devtype=DISK
List of Archived Log Copies
Key Thrd Seq S Low Time Name
------- ---- ------- - --------- ----
70 1 733 X 02-MAR-08 C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2
008_03_03\O1_MF_1_733_3WRQ14DY_.ARC

71 1 734 X 03-MAR-08 C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2
008_03_03\O1_MF_1_734_3WRQ4V7R_.ARC

(etc)

Do you really want to delete the above objects (enter YES or NO)? yes

deleted archive log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_733_3WRQ14DY_.ARC recid=70 stamp=648389669

deleted archive log
archive log filename=C:\ORACLE\FLASH_RECOVERY_AREA\DB10\ARCHIVELOG\2008_03_03\O1
_MF_1_734_3WRQ4V7R_.ARC recid=71 stamp=648389787


(etc)

Deleted 68 EXPIRED objects


  1. Now all is well. (with the database if not with this blog-entry's numbers). I bounced the database and the alert log shows:

db_recovery_file_dest_size of 2048 MB is 0.38% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup.




Guest Blog: Smart Space from an end-user’s perspective

Oracle EPM Smart Space - Mon, 2008-03-03 13:54
I had the pleasure of being an early user of Smart Space while Mike, Matt and the team were building it.

I’ve been using software since the days of Wordstar, dBase II, and Visicalc (on CP/M before DOS no less). Using Smart Space was not like using software. This was an environment I was experiencing. I think Matt’s vision of “I just want to ride a bike” is spot on.

Smart Space is:

Intuitive: I didn’t know I was doing enterprise performance management, data mining, query & analysis – I thought I was making decisions (I further didn’t know I was making those decisions based on judgments & knowledge learned from information that came from data!). Gadgets solve problems.

Social: my favorite part of using Smart Space was collaborating with other users. I could slice & dice data and quickly see an anomaly (variance outside of a threshold, for example), and instantly ping a colleague about it, share my view with her and record the decisions/actions taken as a result.

Personal: my desktop was role & function specific, that is it was set up (by me) to only contain those gadgets, workflow, and points of view that had to do with the role I played. So it should be with enterprise performance management: finance cares about the balance sheet, Operations cares about the supply chain, sales cares about revenue and pipeline, and so on. Executives care about summarized results and trends, managers care about tracking targets and variance, and analysts care about cause and effect. And, in an “EPM enlightened” organization, people across all functions care about cross-functional things. They would have gadgets that let them get to revenue growth, operational efficiency, customer satisfaction, and so on.

Ready at hand:” Smart Space is always-on EPM. It’s always there, always ready, and becomes an extension of my work – like the phone, IM, e-mail, and coffee.

Expandable: there is a fairly large (and growing) community of EPM and BI devotees: customers, consultants, vendors, analysts. These folks are candidates for contributing (dreaming-up & developing) the library of gadgets available to get your job done. I just thought of one while typing this (I’ll let you know what it is as soon as I get a developer to write it for me, any volunteers?)

This is pretty exciting stuff, and I can’t wait to see what Matt & Mike have in store for us in ’08.
-Ron
Ron Dimon
Business Foundation
EPM Advisory
Categories: Development

Soft-skills (not oracle related post)

Fadi Hasweh - Sun, 2008-03-02 23:20
Dears,
As a part of my study in my masters program we have to create a blog related to e-business course and it must be evaluated and one of the evaluation criteria is the number of users accessing it and comments so if you have free time can you access it and leave comments the blog has a good and helpful information.
the blog talks about soft skills and how they are important in our life check for example (speed reading post and managing a busy schedule post and many others)
please check and leave comments if you can
Thank you
Fadi

Interesting Corrupted Redo Log Case

Renaps' Blog - Sun, 2008-03-02 11:59

It has been a while since I wrote on Renaps’ Blog .. most probably because I didn’t run through any interesting case to talk about for a while !

Yesterday, at one of our main clients, the production (soon to be – on Monday) database hung. Was it because it was a Feb 29th of a bisextile year ? or simply because the week-end arriving in the next 4 hours was the go-live of the main data warehouse ? — Its hard to say, I believe that it was just a little wink from our old friend Murphy !

The environment is a 16 Itanium CPUs Windows Server running 10.2.0.3.0; ASM is used even though the database runs on a single instance mode.

When the database hung, the alert.log showed the following error messages (of course, no one could connect or run any transaction on the database):

ARC0: All Archive destinations made inactive due to error 354
Committing creation of archivelog ‘+FLASH/inf1/archivelog/arc17972_0627748574.001’ (error 354)
ARCH: Archival stopped, error occurred. Will continue retrying
Fri Feb 29 11:48:05 2008
Errors in file c:oraadmininf1bdumpinf1_arc0_10236.trc:
ORA-16038: Message 16038 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [17972]
ORA-00354: Message 354 not found; No message file for product=RDBMS, facility=ORA
ORA-00312: Message 312 not found; No message file for product=RDBMS, facility=ORA; arguments: [1] [1] [+FLASH/inf1/onlinelog/group_1.257.627748575]

The first line of this message stack made me believe that the database just hung because there was no more space for the archiver to do its job. But after further analysis, it appeared that the problem was much more serious.

The actual cause of the issue was related to Redo-Log Corruption. The cause of the corruption was a “write or rewrite” SAN related error.

When Redo-Log corruption occurs, Oracle first tries to use the second (or third) member of the same group showing corruption to avoid un-availability of the database. In our case, this was impossible since there was only one member per group of Redo-Log — I was surprised to see that this client did not multiplex the Redo-Logs …

My first attempt to get the database back on its legs was to dump the content of the redo-log showing corruption issues (in this case, group 1).

SQL> ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;
ALTER SYSTEM DUMP LOGFILE ‘+FLASH/inf1/onlinelog/group_1.257.627748575’
*
ERREUR à la ligne 1 :
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 53504 change 70389305 time 02/29/2008 11:48:02
ORA-00334: archived log: ‘+FLASH/inf1/onlinelog/group_1.257.627748575’

As you can see, this attempt failed, I then attempted to clear the unarchived redo-log:
SQL> alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;

at this point, my command hung indefinitely, and I realized that pmon was holding a lock on the Redo-log Header block…

My margin of maneuver has become much smaller, here is the sequence of commands that brought the database back on track:

  1. shutdown abort;
  2. startup mount;
  3. issue the same command that was previously hanging (this time, it worked right away) :
    alter database clear unarchived logfile ‘+FLASH/inf1/onlinelog/group_1.257.627748575’;
  4. alter database open;
  5. shutdown immediate;
  6. Immediately take a FULL backup at this point;
  7. startup;
  8. Don’t forget to multiplex your Redo-Logs to avoid to run into this situation again !

Here you go, Redo-Log corruption on the header block of a group when only one member is used does look bad at first sight but I hope this article can help you if you run into a similar scenario.

References:

Metalink Notes:
145769.1
332672.1
438176.1
114734.1

Patrick Hamou
E-Mail: phamou@renaps.com
Web: www.renaps.com


Categories: DBA Blogs

APEX 3.0.1 on XE

Jornica - Sat, 2008-03-01 09:17

On 29 Februari 2008 Oracle released Application Express 3.1 . With this release you can upgrade the Application Express within Oracle XE to 3.1. For more information on this topic see Oracle Database XE and Application Express 3.1 and APEX 3.1 on XE .

Oracle released Application Express 3.0.1 recently. With this release you can upgrade APEX within Oracle XE from 2.1 to 3.0.1 (see Oracle Database XE and Application Express 3.0.1). But how to upgrade when you already installed APEX 3.0 on XE?

There is a patch available on Metalink to upgrade from APEX 3.0 to 3.0.1. According to the Patch Set Notes Oracle9i release 2 and higher support this patch if you have APEX 3.0.0.00.20 already installed (the Patch Set Notes also includes a list of fixed bugs). In order to download the patch from Metalink, you will need a support contract (which is not included with XE).

Another option is to remove APEX 3.0 completely. After the removal, XE has APEX 2.1. (still) installed and the next step is of course to install APEX 3.0.1. The big advantage of this option is you following the main route of upgrading XE/APEX as proposed by Oracle. However there are some things to keep in mind.

  • As always start with making a backup of your database.

  • Export all applications you would like to see again in APEX 3.0.1. If you remove APEX 3.0 completely, you also remove all applications stored in the FLOWS_030000(APEX 3.0) schema! Note that applications stored in the FLOWS_020100 schema will be migrated to FLOWS_030000 (APEX 3.0.1).

  • Make a backup of changed or added cascading style sheets (CSS), images and java scripts separately.

  • Run the apex_3.0.1\apxremov.sql script (the script is the same for both versions).

  • Run the apex_3.0.1\apxins.sql script. Check the installation log for errors.

  • Copy the apex_3.0.1\images directory to the /i/directory. Also copy the previous backed up CSS, images and java scripts to the /i/ directory.

  • Import all applications exported in the first step. Note that existing applications in APEX 2.1 are migrated to APEX 3.0.1 as part of the upgrade process.

  • Time to make a backup again.

Have a look at the APEX reports in SQL Developer, no errors any more.

APEX 3.0 on XE part 2

Jornica - Sat, 2008-03-01 09:16

On 29 Februari 2008 Oracle released Application Express 3.1 . With this release you can upgrade the Application Express within Oracle XE to 3.1. For more information on this topic see Oracle Database XE and Application Express 3.1 and APEX 3.1 on XE .

If you followed the guidelines of APEX 3.0 on XE, you have a working APEX 3.0 interface. However the administrative interface of APEX is not installed on XE by default. Using this interface you can manage all aspects of the service including creating, removing, and managing workspaces. Barney Mattox discovered a workaround to install this interface as well.

In order to install the administrative interface perform the following steps:

  1. Start SQL*Plus. Make sure your working folder is the APEX installation folder (where apexins.sql is located).

  2. Connect as user SYS.

  3. ALTER SESSION SET CURRENT_SCHEMA = FLOWS_030000
    /

  4. The following command imports the administrative interface: @script\f4050.sql

  5. The following command imports the request for workspace interface: @script\f4700.sql

  6. Change the password for the ADMIN user in order to login in the administrative interface: @apxxepwd.sql

  7. Use the following URL to access the administrative interface http://server:port/apex/f?p=4550:10 i.e. http://127.0.0.1:8080/apex/f?p=4550:10 .

Additonal notes:

  • Issue a spool install.log after logging in to SQL* Plus. If there occurs an error SQL*Plus will exit. You can use the logfile to find out what went wrong.

  • I experienced problems with step 3: Illegal security group id value for this schema. A workaround is to login as FLOWS_030000 and run the scripts. First you need to unlock user FLOWS_030000:
    ALTER USER FLOWS_030000 IDENTIFIED BY password ACCOUNT UNLOCK
    /

Workflow performance case study: Dont Repeat History, Learn from it

Gaurav Verma - Sat, 2008-03-01 02:06

A Sanskrit Shlok underlying the importance of careful planning

Preface They say: "Those who do not know history are condemned to repeat it."
But why repeat history if you dont have to?
Many questions come to light on this topic of discussion:
  • Is there a better way of designing conditional implementation in workflow?
  • What are the merits/demerits of repeated checking of a particular condition as opposed to executing something when the actual condition happens? What are the performance implications?
  • Are there workflow APIs available which simulate event based subscription model?
  • What are the performance advantages of using the basic Workflow APIs that simulate this event based subscription model?
This case study discusses the performance benefits of keeping unnecessary looping execution in workflow design a a bare minimum and if possible, eliminate it. If there are several short timed loops in the workflow design, that get executed very often, the bad news is that the System spends valuable resources spinning the wheels in vain.
The intended audience is workflow designers, support personnel and Oracle Apps DBA alike. This case study can potentially help users to design their workflow more optimally and succintly, which would have a minimal impact on the execution. The I/O profile and time taken for Workflow Background Process bears out all this information. As far as possible, trends have been depicted from the customer's Production system. The Business Scenario

The business requirement to be implemented was during the ATP (Available to Promise) phase in the life of an Order line. ATP simply means that we can guarentee the customer that a particular Item can be made available from a particular warehouse on a particular date.

When scheduling an Order Line, if a financial hold existed on the Order line due to some reason, it needed to be unschedule. Also, as soon as the financial hold was released by the Accounts Receivables department, the requirement was to proceed with the remaining business flow in the OEOL workflow design.

How It Was Designed

This is how it was designed:

Let us translate the flow:

  • When the WF background program executes custom ATP process, it checks if the order is on hold or not.
  • If the order header or one of the lines is on hold the custom activity "Unschedule Lines On Hold" will unschedule all those lines and waits for an hour before checking the line status again.
  • This loop repeats until all holds on the header/line are released.
Lets look more closely at the Wait Activity definition:

Checking its Node Attributes tab:

Now, checking its actual timeout value:

So, the bottomline is that the WAIT activity was timing out after every 15 mins (60 x 0.0416 ~ 15 mins). Just imagine the volume of OEOL item keys at this activity on any given hour. What kind of a load would they be putting on the system.
Now, is that an intelligent way of designing the workflow for this requirement? Some shocking revelations.. A little analysis of the volume in WF_ITEM_ACTIVITY_STATUSES_H (history) table showed something shocking (!!):

The XXONT_CUSTOM_ATP->WAIT step was generating the highest history records. In this case, it had created 27 million rows (75% of the entire volume) out of 35 millions rows in in WF_ITEM_ACTIVITY_STATUSES_H table.
PS: This is one of the reasons as to why this case study is very interesting: simply because its impact was tremendous. How are history records created?

Lets clear up some air and get our basics right about how and when are history records created.

When the workflow engine encounters an activity which has a status of COMPLETE, it moves (figuratively) that record from WF_ITEM_ACTIVITY_STATUSES to WF_ITEM_ACTIVITY_STATUSES_H and creates a new record in WF_ITEM_ACTIVITY_STATUSES with status ACTIVE.

Therefore, the more times a COMPLETed activity is re-visited, the more history records for that activity are created. This can happen when a loop having a WAIT activity with a short timeout is designed. The more the workflow background process flows through the activities of such a loop, the more records will get created in WF_ITEM_ACTIVITY_STATUSES_H.

The performance implications...

Lets try to think as to what are the other implications of such poor design?

  • The system keeps looping on idle timeout activity every 15 mins
  • Immediately after the timeout completes, the activity becomes eligible to be processed by the workflow.
  • The next activity (check if hold has been released) checks if the hold has been released?
  • Statistically speaking, as the probability of a financial hold being released is not immediate, the control goes back to the WAIT activity, wherein the cycle repeats
While this is happening:
  • Every 15 mins, the backlog (timedout or deferred) count of unprocessed OEOL item type bloats
  • Due to such frequent time-out activities, other more important processes/activities in the OEOL workflow are not executed in a realistic timeframe
  • Valuable system resource are spent on scheduling and executing the loop's activities on the Production machine
  • The OS on Production machine spends vital I/O resources in servicing these application calls
Simply speaking, that is very in-efficient.
This is like a vicious cycle. The more data you have in in WF_ITEM_ACTIVITY_STATUSES and WF_ITEM_ACTIVITY_STATUSES_H tables, the more time it takes to query and insert  into them (also, the more indexes you have, the more time it would take to delete/update/insert due to automatic index re-organization effort required). It just kills the online performance at times.
The following diagram bears out the above ideas. It shows the top activity (in terms of number of item_keys) in the breakup of OEOL backlog. It can be seen that, without fail, XXONT_CUSTOM_ATP was the top wait activity in the OEOL backlog. Not only was it a liablity to itself, it was also a liability to other legitimate activites perfectly eligible to be processed:
 Design Considerations: The Danger Of Introducing Infinite Loops

There is another big danger here: that of introducing infinite loop in the execution of Workflow Background Process (also referred to as FNDWFBG) for OEOL.

Let us assume that there is a standard WAIT activity with a very short timeout/wait time in a loop with a condition checking activity. When the condition is checked, the activity will be deferred or timed out for a very short time.
Meanwhile, Workflow Background Process would be busy processing all other deferred or timed out activities. If FNDWFBG is not done processing other deferred activities by the time the wait/time-out time of previous activities expires, the RE-timedout or deferred activities will be enqueued and re-processed AGAIN by the very same FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) job over and over, as long as the condition (e.g "line on hold") is met.
This creates an infinite loop, so FNDWFBG process almost never completes, eating up CPU time, creating massive DBMS archive logs (if turned on), and pumping rows into the WF activity statuses table. This would also create unnecessary rows in the WF_ITEM_ACTIVITY_STATUS_H table, which can be nemesis of online order entry performance in the 11i EBusiness Suite later on.

It is a good idea for workflow activities in custom workflows to have Wait/Timeout time greater than the longest normal execution time of Workflow Background process.
Also, a workflow activity having a near zero Relative (Wait) Time, or Wait Mode stated as Absolute Date, but no Absolute Date specified, which is equivalent to zero Relative Wait Time can cause the same issue (infinite loop) to occur for FNDWFBG (Workflow Background Process for say item_type=OEOL/OEOH) runs with Process Deferred = Yes.

Hence, these poor design practices should be avoided, because, remember, once a runtime for a particular WF definition gets initiated and even if the workflow definition is corrected later and uploaded, the previous runtime data will not be automatically corrected. So its prudent to take in good design considerations before letting your production data run on it.

More design considerations..

Whenever possible, standard Wait activity ( WF_STANDARD.WAIT() ) should not be placed in a loop with a small wait time (shorter than several hours). Alternative workflow designs should be evaluated.
WF_STANDARD.WAIT() with wait time of zero or near zero, or standard Defer activity (WF_STANDARD.DEFER()), should be NEVER be placed in a loop in a workflow process design diagram.

Extending the Wait time should only be used if this solution is not acceptable for a very good reason.

The Superior Design Alternative....

For the above business scenario, the best solution is not to use Wait activity in a loop at all. A much more attractive option is to use published Workflow blocking/contnuation APIs like WF_STANDARD.BLOCK() or OE_STANDARD_WF.STANDARD_BLOCK() should be used for stopping processing flow of a workflow item_key. WF_ENGINE.COMPLETEACTIVITY() should be used for resuming processing flow of a workflow item_key.

An additional advantage of using OE_STANDARD_WF.STANDARD_BLOCK is that  any OEOL workflow item which is blocked on OE_STANDARD_WF.STANDARD_BLOCK can also be progressible through Order Entry form via "Actions" button Progress Order option. This gives more flexibility to the user/designer.

If one wants to progress the flow of a workflow item programatically, WF_ENGINE.COMPLETEACTIVITY() API can be called. This API notifies the Workflow Engine that the specified activity has been completed for a particular item. It signals the Workflow Engine that an asynchronous activity has been completed. This procedure requires that the activity currently has a status of 'Notified' (Internal status equivalent to BLOCKED). An optional activity completion result can also be passed. The result can determine what transition the process takes next.

Example

/* Complete the BLOCK activity which is used in multiple
* subprocesses in parallel splits. */

wf_engine.CompleteActivity('ORDER', '10045',
'XXONT_CUSTOM_ATP:WAIT', 'null');

To enhance the design AND achieve the same business functionality, we need to replace the WAIT activity with a BLOCK activity which will wait till the last hold is released from the order header/line without causing any looping. The COMPLETEACTIVITY API would be called from the Financial Hold release form at Order Header level.

Its as simple as that. The new design would look like:

Seeing the Performance Benefits in Production..
Adopting this approach will create only 1 record in WF_ITEM_ACTIVITY_STATUSES table and 2 records in WF_ITEM_ACTIVITY_STATUSES_H table.


Contrast this against continuous looping till a condition is met: for each execution of the WF logic, records are created in WF_ITEM_ACTIVITY_STATUS* tables.

As a result of this change, the landscape of OEOL backlog also changed drastically.

The average runtime of the Workflow background Process for OEOL also came down consistently.

Summary of Learnings


  • It pays to know the published Workflow APIs. The Workflow Developer Guide is a good place to start. Spend time reading the fine manual. More information can never hurt.
  • It pays to spend time and effort in evaluating re-design of loops in workflow design early, especially since once initiated, a specific version of workflow runtime data does not automatically get updated after a newer version of workflow design is uploaded (through WFLOAD)
  • Poor design involving WAIT and poll activities in a loop can cause infinitely running Workflow Background Process (FNDWFBG)
  • Short-timeout activities in loops should be replaced with OE_STANDARD_WF.STANDARD_BLOCK or WF_STANDARD.BLOCK APIs and WF_ENGINE.COMPLETEACTIVITY APIs
  • Simple design changes can bring HUGE performance benefits, especially when a lot of volume is involved
  • One thing which was also learnt in hindsight was that the Development/Testing team needs to be extremely careful to account for all instantiated financial Holds.  We ran into a problem with the removal of the loop where a certain set of financial holds were not being completed.  For this set of scenarios (automatically released holds not going through the quick order form) we created a custom concurrent program to identify the blocks and call the WF_ENGINE.COMPLETEACTIVITY API. 


Simply put, Know your facts, Be smart and Dont repeat history if you dont have to. Above all, do you testing well.

Suggested Related Readings

When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization --

Posted on 17 May 2007; 12:09:31 PM


http://blogs.oracle.com/gverma/performance/om_wf_xxont_cascade

A Performance Case Study: Taming the Workflow Purge --

Posted on 21 March 2007; 6:58:11 PM


http://blogs.oracle.com/gverma/2007/03/21#a72

We're Skewed! Using 10053 event for being Histogram-wise --

Posted on 27 March 2007; 3:02:33 PM


http://blogs.oracle.com/gverma/2007/03/27#a80

When Conventional Thinking Fails: A Performance Case Study in Order Management Workflow customization

Gaurav Verma - Sat, 2008-03-01 01:57

Foreword

Order management workflow interacts with a variety of components like core workflow, order management and pricing application call interfaces.

Customizing it without understanding the implications can be a double edged sword and the nature of customization may unwittingly worsen the situation. An undesirable combination of the aspects can cause a very difficult to diagnose and troubleshoot situation as shall be demonstrated in this case study.

This article is essentially the distilled wisdom of severity 1 situation at a client site who was unable to ship ordered items to their customers since OM order line workflow background process had a severe performance issue. The business impact was tremendous: $ 1.5 to 2 million worth of shipping and invoicing was being prevented.

Many a times, performance troubleshooting involves having some functional insight as well. TKPROF and 10046 trace is not the panacea for all performance issues. In this case, some insight into the nature of Order Management APIs was also required for getting traction on the performance problem. A painful discovery path followed, riddled with bumps and insightful discoveries. The troubleshooting approach used was really out of the box and involved breaking down the code being executed and using logical deduction.

The generic learnings from the ordeal are presented in this case study. It is hoped that the learnings will help oracle applications user community to be concious of hidden implications of customizing OM workflow.

Feeling the pinch...
The client had heavily customized their order management workflow during an implementation of Order to cash business cycle of Oracle applications 11i.

Immediately after the April 2007 financial quarter end, the OEOL workflow background process started exhibiting extremely long execution times, such that the backlog of unprocessed OEOLs came to ~150,000. This was a sharp contrast from the normal processing throughput.

This was a moving problem. The hanging of OEOL workflow background process was quite apparenly due to a 'few' order lines, belonging to a few orders. Once the 'problematic' orders' OEOL workflow were either CANCELLED or ERRORED, the backlog started clearing at a slow pace. This went on for a little while, until similar Order lines OEOL workflow seemed to take inordinately long to process and held up the entire workflow queue.

Coming to grips with it.. As a knee jerk approach, many things were tried, including killing the currently running workflow background process and re-submitting it. The clincher was that once or twice the workflow background process was killed , it marked the OEOL ITEM_KEY being processed into ERROR state and hence took it out of the deferred queue. This forced the Workflow background process to consider the next deferred OEOL item_keys in line.

In ~50% of the cases, it would work and the newly thread would start exhibiting a more acceptable throughput. But after a certain point, even killing and re-submitting workflow background process was simply not working. It did seem that the logic being executed by FNDWFBG was taking extremely long for processing one order line.

The observation was baffling. The following query was used to find out approximately which order line was a single thread of workflow background:

Query 1:

SQL> select * from (select to_char(enq_time,'dd-mon-rr hh24:mi') , 
     user_data


from wf_deferred_table_m where state = 0 and corrid = 'APPSOEOL'


order by priority, enq_time) where rownum < 5;


TO_CHAR(ENQ_TIM USER_DATA(ITEMTYPE, ITEMKEY, ACTID, FUNCTION_NAME, PARAM_LIST, RESULT)


----------------------------------------------------------------------


02-apr-07 14:25 WF_PAYLOAD_T('OEOL', '30273337', 141255, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458068', 140850, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458075', 140850, NULL, NULL, NULL)


02-apr-07 16:25 WF_PAYLOAD_T('OEOL', '30458089', 140850, NULL, NULL, NULL)


4 rows selected.


Tying it back to the order management tables, the following query tells us order level details like order_number, line_number, ordered_item etc, which is even more useful:

Query 2:

select distinct oh.org_id, oh.order_number,
ol.line_number, ol.shipment_number,              
ol.option_number, ol.ordered_item, ol.line_id, wiash.begin_date,
priority, wdtm.enq_time
from applsys.wf_deferred_table_m wdtm,
    applsys.wf_process_activities wpa,
     applsys.wf_item_activity_statuses wiash,
     apps.oe_order_lines_all ol,
     apps.oe_order_headers_all oh
where substr (wdtm.corrid,5) = 'OEOL'
and (wdtm.delay < sysdate or wdtm.delay is null)
and wpa.instance_id=wdtm.user_data.actid
and wiash.item_type=wdtm.user_data.itemtype
and wiash.item_key=wdtm.user_data.itemkey
and wiash.process_activity=wdtm.user_data.actid
and wiash.item_key = ol.line_id
and oh.header_id = ol.header_id
and wdtm.state = 0
group by
oh.org_id, oh.order_number, ol.line_number, ol.shipment_number,
ol.option_number,
ol.ordered_item, ol.line_id, wiash.begin_date, priority, wdtm.enq_time
order by priority, wdtm.enq_time ;

Observing the output over sometime gave the capability of identifying if workflow background process was stuck on a specific OEOL workflow.

Limited insight by conventional analysis..

The conventional debugging approaches provided limited insight into the root cause of the issue.

10046 trace could not bring out the drill down of time spent on pl/sql API calls clearly. In addition, there was an OS file limit on the trace file on most production servers, even if the max_dump_file_size=unlimited is used.

Another option was to use dbms_profiler API on FNDWFBG call. It could have worked, except the FNDWFBG API call never completed, so the data could never be collected.

Searching the metalink gives a host of notes for getting OM debugs (level 5), Pricing debug etc.

1. Note 255040.1:bde_wf_data.sql - Profiling Workflow Runtime Data related

2. Note 163208.1:bde_last_analyzed.sql - Verifies Statistics for all

3. Note 353991.1:OMSuiteDataChk.sql and OMSuiteDataChk11i10.sql


This is all good information, except that there can also be too much of un-meaningful and un-directed information. Interpreting detailed application debug logs can be useful when it is known what should be checked.

Taking the whitebox approach....

Lot of debug/trace information could be generated, but it was probably best to take the FNDWFBG execution apart in slow motion. Once the execution model has been understood, more brainstorming using divide and conquer strategy is effective.

An alternate way of looking at FNDWFBG is that it is nothing but a serial simulator of steps outlined in the oracle workflow designer.

A whitebox (a software engineering term, as opposed to blackbox) approach was the need of the hour, coupled with some simple thinking. Using the workflow status monitor from Workflow Administrator responsbility in Oracle Applications 11i, the step at which an OEOL is currently at can be seen.

This may also be verified by the output of $FND_TOP/sql/wfstat.sql. Combining the observations from Query 2 above and drilling down via Workflow Monitor, it was clear that almost all the lines having performance issue were at Fulfill-Deferred step, which means that if FNDWFBG were to pick them up, it would then execute the next activity in line, which was Fulfill step.

Note: A quick workaround was to 1) Kill the FNDWFBG process 2) SUSPEND (this status avoids any processing by workflow) the problematic OEOL workflow via workflow monitor and 3) resubmit the FNDWFBG process.

Irreconcilable paradoxes of a Test case..

We really needed to analyze the drill down on the bad test case.

From the online examination of sqls being executed by FNDWFBG session, it was seen that a lot of time was being spent on XXONT Cascade Terms and Agreement, multiple times for apparently processing one OEOL ITEM_KEY. This was a really paradoxical observation and did not make any sense.

The only acid test of this was to call OE_FULFILL_WF.START_FULFILLMENT for specific OEOL line id or ITEM_KEY. The script run_start_fulfillment_with_om_debug.sql (Listing in Appendix A) was used to simulate a single execution of Fulfill step, which in turn called OE_FULFILL_WF.START_FULFILLMENT and did a rollback at the end. One such case ran for ~6 hours and finally finished.

Most surprisingly, an examination of raw trace file did show that sub-sequent steps like XXONT Cascade Terms and Agreement and Invoice Interface- Line had indeed been called multiple times. Therefore, it seemed that a single call to OE_FULFILL_WF.START_FULFILLMENT for one kind of OEOL was triggering off sequential execution of OM workflow for other OEOLs in the same Order, which were somehow related to the order line in question.

$ grep -i xxont_cascade_pkg.set_ln_trm_agr_proc prod_ora_677526.trc 
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end;
begin xxont_cascade_pkg.set_ln_trm_agr_proc (:v1, :v2,:v3, :v4, :v5); end; 
value="XXONT_CASCADE_PKG"
value="XXONT_CASCADE_PKG"

This simply did not make any sense because Fulfill and XXONT Cascade Terms and Agreement step were discrete steps, as verfiable from viewing the OM workflow diagram flow above.

Another very interesting and puzzling observation was that without fail, the top 2 sql calls in TKPROF output (raw trace sorted with sort=prsela,exeela,fchela) showed that ~45% of the time was spent on OE_FULFILL_WF.START_FULFILLMENT API and ~45% of it was spent on the call to xxont_cascade_pkg.set_ln_trm_agr_proc API.

"tkprof output for 10046 of start_fulfillment API call"

Resolving the paradox...

The answer, which is difficult to digest, does make sense. Order management has unique processing requirements for configured ordered items created using Oracle Configurator module. Since OE_FULFILL_WF.START_FULFILLMENT is a workflow aware API (notice the _WF in its name), it was obviously calling more Workflow routines inside it to progress the business flow to next logical set of activities.

We first need to understand the concept of Fulfillment set. A fulfillment set is a set of related order lines that represent a configurable item. Whenever a fulfillment set pr a model is ready to fulfill or get built in production floor - i.e., all members' OEOL workflow are internally in FULFILL_LINE:Notified (blocked) status. The last member reaching this activity signals completion of FULFILL_LINE activities of all other members before proceeding itself. Since the XXONT Cascade Terms and Agreement step comes right after FULFILL_LINE step, XXONT Cascade Terms and Agreement step will be executed for all members of the current model or fulfillment set in that one go.

The relation between other lines whose OM workflow was being resumed and the problematic line in question could be found using the following sql:

Select top_model_line_id from oe_order_lines_all where line_id=<problematic_line>;

This query defines a FULFILLMENT SET (a set of lines which are related together

and constitute an orderable item):

Select line_id from oe_order_lines_all where top_model_line_id=<top_model_line_id_from_above_query>;

The user first configures the components of a confgurable product/item and then the item is actually built. Depending on the kind of configured model (Assemble to order -- ATO or Pick to order -- PTO), the oe_order_lines_all.flow_status_code (order management workflow status) stays at either SHIPPED or BOOKED status or becomes AWAITING_FULFILLMENT (internally blocked).

Here is an example of this intermediate stage of processing for different kind of configured Items (ATO/PTO/Stock):

Now, since a lot of time was being spent on XXONT Cascade Terms and Agreement step, After reviewing the code in the package XXONT_CASCADE_PKG.pkb, it could be seen that OE_ORDER_PUB.PROCESS_ORDER() API was being called to sync up the terms and agreements across Order header and line in question.

There were some penalties for doing this: the payment term and agreement fields on the order line triggered repricing of the order lines. Also, for especially large orders (~4000 lines) with mostly mis-matched payment terms and agreement lines, repeated calls to OE_ORDER_PUB.PROCESS_ORDER() were causing a huge overhead.

The reason why the timings for OE_FULFILL_WF.START_FULFILLMENT() showed high in the trace files was that FULFILL_LINE step syncronizes accross all components in a configurable model or fulfillment set. The following output from a workflow query in Appendix B lists out the time recorded by workflow for executing XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE step:

Visualizing it in slow motion...

This means that, whenever a fulfillment set pr a model is ready to fulfill - i.e., all members OEOL workflow is internally in FULFILL_LINE:Notified (blocked) status, the last member reaching this activity will complete FULFILL_LINE activities of all other members before proceeding itself.

Since the Cascading activity comes right after FULFILL_LINE, it means that it will be executed for all members of the current model or fulfillment set in that one go. The aggregate timings for completing all other related OEOL workflows in the same fulfillment set goes into the timing recorded for OE_FULFILL_WF.START_FULFILLMENT() step.

The following diagrams attempt to represent the dynamics more clearly:

BEFORE calling OE_FULFILL_WF.START_FULFILLMENT() step:

AFTER calling OE_FULFILL_WF.START_FULFILLMENT() step:

Projecting the poor performance..

Based on the situation described above, a simple projection of poor performance can be done for a particular kind of order. Lets assume that there are 100 lines in a configurable model/Item. Also, lets assume that for 75% of the order lines, payment terms or/and agreements do not match with those in the Header. For each mis-matched (terms and agreements - with header) line, the time taken for pricing varied from 150 seconds to 750 seconds.

When we calcualte the cumulative time taken for executing xxont_cascade_pkg.set_ln_trm_agr_proc API for about 75% mis-matched lines of 100 lines, with ~5 mins taken for each line, the overall time taken comes to 75 x 5 mins ~ 8 hrs. That's the time for processing OE_FULFILL_WF.START_FULFILLMENT() step for one model. That was exactly the performance problem being experienced.

Now, just imagine the performance strain if the order had 10 such configurable items.

Mathmatically speaking:

The Solution: Simplify by taking the Highest Common Factor...

There were two important dimensions to the solution:

There was a bug in 11.5.9 Order management module which prevented the cascading order header changes to order lines from working properly (Profile option OM: Sales Order Form: Cascade Header Changes to Line should be Yes). This was prompting the calls to OE_ORDER_PUB.PROCESS_LINE() APIs.

Patch 5981436 is to be delivered by Oracle order management development for resolving this issue. This fix is also made part of 5950291 -11.5.9 OM CUMULATIVE PATCH. This would automatically match the payment terms and agreements from Order Header to order lines for majority of the cases, except if the user purposely chose to say No to the cascade option.

Another smart option was to do batch processing of all the mis-matched component lines of a configured model in ONE call to PROCESS_LINE() API by passing them as a pl/sql table argument. Also, this was to be done ONLY if the order line in question was for a Shippable Item. This was especially beneficial since only 1-5% of sub-components in a configured model are shippable. So that made it quite restrictive. This single change took out 95-97% of the repetitive work being done by OE_ORDER_PUB.PROCESS_ORDER() calls.

       --
       -- Calling API to update Terms and/or Agreement value(s) on the order line
       --
       oe_order_pub.process_line (p_line_tbl      => l_line_tbl,
                                  x_line_out_tbl  => o_line_tbl,
                                  x_return_status => l_return_status,
                                  x_msg_count     => l_msg_count,
                                  x_msg_data      => l_msg_data);

Seeing the performance Boost...

Before putting in the code change:

For a fulfillment set with top_model_line_id=9905479, there were 59 mis-matched lines, and the avg time for processing one call of process_line() was ~ 580 seconds.

After putting in the code change:

The time taken to same 59 mis-matched lines (of the same fulfillment set with top_model_line_id=29905479) in one pls/sql table (as an argument) in process_line() API call was ~ 625 seconds.

Putting statspack data into perspective...

This also explained why statspack reports showed xxont_cascade_pkg.set_ln_trm_agr_proc as one of  the major I/O offenders per execution:

Summary of learnings from the Case study..


  • Don't have expensive custom activities defined after START_FULFILLMENT step in OM workflow. It will pronounce the performance hit many times, especially when Oracle configurator and order management modules are being used in tamdem (Better, don't  have expensive workflow custom activities at all)
  • Batch processing should be over piece-meal processing, especially when an API has provision for the same. This reduces most of the repetitive processing
  • The whitebox (drill down) approach works for taking apart a baffling performance problem. Trying to simulate a test case usually leads to the heart of the performance issue in a more reliable way
  • Getting extensive debugs and traces is great, but only when it is known what is being looked for. Asking the right probing questions is very important. Question, question, question.
  • A well though out plan with minimal tracing and using a drill down approach, can bring better results than a shot gun or blunderbuss approach
  • Sometimes, a high level functional knowledge of the processing being done can be very useful in understanding the nature of problem. A balance between strict technical and pure functional knowledge can be fruitful towards solving performance issues

TRIVIA: Why was the customization put in the first place?

Cascading is the process in which values from the sales order header level automatically populate to the corresponding fields at the sales order line level.

Due to a bug in 11.5.9, Agreement field at Header level was not populating to the child lines of a configuration.

Due to this, the autoinvoicing program split lines into separate invoices by agreement. There were thousands of invoices where the components were being invoicing separately from the top model. It was an invoicing fiasco.

Appendix A:

sandbox:dev> more run_start_fulfillment_with_om_debug.sql

var resultout varchar2(1000)
set serveroutput on time on timing on

prompt
prompt setting userid, responsbility and application id for JHASLAGE FOR order management super user
prompt

define user_id=7962
define resp_id=52560
define appl_id=660

define item_key=29904309
define activity_id=141236

declare

l_debug_file varchar2(400);
l_org_id NUMBER;
l_count NUMBER;
l_result VARCHAR2(30);
l_spid varchar2(20);

begin

fnd_global.apps_initialize(&user_id,&resp_id,&appl_id);

select p.spid
into l_spid
from v$process p, v$session s
where s.audsid=userenv('sessionid')
and paddr = addr;

OE_Standard_WF.OEOL_SELECTOR
(p_itemtype => 'OEOL'
,p_itemkey => '&&item_key'
,p_actid => 12345
,p_funcmode => 'SET_CTX'
,p_result => l_result
);

oe_debug_pub.debug_off;

--enable OM debug
oe_debug_pub.debug_on;
oe_debug_pub.initialize;
l_debug_file := OE_DEBUG_PUB.Set_Debug_Mode('FILE');
oe_debug_pub.setdebuglevel(5);

dbms_output.put_line('Debug file is : '|| l_debug_file);

--enable trace
EXECUTE IMMEDIATE ('alter session set max_dump_file_size=unlimited');
EXECUTE IMMEDIATE ('alter session set tracefile_identifier=''gverma_start_fulfillment_with_om_debug_&&item_key''');
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT FOREVER, LEVEL 12''');

OE_FULFILL_WF.START_FULFILLMENT('OEOL','&&item_key',&&activity_id,'RUN',:resultout);

--disable trace
EXECUTE IMMEDIATE ('ALTER SESSION SET EVENTS ''10046 TRACE NAME CONTEXT OFF''');

--turn debug off
oe_debug_pub.debug_off;
dbms_output.put_line('File name '||OE_DEBUG_PUB.G_DIR||'/'||OE_DEBUG_PUB.G_FILE);
dbms_output.put_line('Trace file generated for this request:'||'/*'||l_spid||'*.trc');

rollback;

EXCEPTION WHEN OTHERS THEN

dbms_output.put_line('reached error in script ' ||sqlerrm);

rollback;
end;
/

select :resultout from dual
/

Appendix B:

col ITEM_KEY for a10 Head "Item Key"

col BEGIN_DATE for a18 head "Begin Date"


col END_DATE for a18 head "End Date"


col ROOT_ACTIVITY for a30 head "Root Activity"


col RESULT for a15


col STATUS for a8 Head Status


col time_taken_mins for 99999 head "Time(Mins)"


col step format a60

set pages 100 time on timing on lines 150

spool get_details_for_wf_step

select ias.item_key,

to_char(ias.begin_date,'DD-MON-RR HH24:MI:SS') begin_date, to_char(ias.end_date,'DD-MON-RR HH24:MI:SS') end_date,


(ias.end_date-ias.begin_date)*24*60 time_taken_mins,


pa.process_name||'->'||pa.instance_label step,


ias.activity_status Status,


ias.activity_result_code Result


from wf_item_activity_statuses ias,


wf_process_activities pa


where ias.item_type = 'OEOL'


and ias.process_activity = pa.instance_id


and ias.begin_date >= sysdate-14


and pa.process_name||'->'||pa.instance_label = 'XXONT_PROD_SEC_STANDARD_LINE->FULFILL_LINE'


and ias.activity_status = 'COMPLETE'


--


-- and the step was not cancelled


--


and ias.activity_result_code != '#FORCE'


--


-- it took more than so many mins to complete this step


--


and (ias.end_date-ias.begin_date)*24*60 > 30


order by (ias.end_date-ias.begin_date)*24*60 desc, ias.begin_date

/

spool off

A SQL Tuning Case Study: Could we K.I.S.S. Please?

Gaurav Verma - Sat, 2008-03-01 01:47

 Preface They say, it pays to K.I.S.S (Keep It Simple and Smart). The simplest ideas are the greatest ideas.This philosophy also applies to sql coding when we tell the oracle SQL engine what to fetch and how to fetch it (Indirectly). There are great rewards for simplistic thinking.
Lets discuss this principle through a case study in which the user was trying to fetch very simple data through an over-kill query, involving  RA_* views which were internally using some of the same tables. The SQL engine was therefore just trying to fetch and filter too much data from the same sources. It was not certainly efficient.
In this case study, we see how its important to understand the "Why" of a query, rather than the "How" of a query. Many times, just rethinking the whole approach pays great dividends. The Poorly Performing Query The client had a custom concurrent program which had this query as the worst offender.. it was taking 90% of the time and the program used to take ~ 1 hour to finish.
The formatted query is listed below:
SELECT SUBSTR(A_CUST.CUSTOMER_NUMBER,1,8) CUST_NUMBER, 
       HZ_CUST_SITE_USE.LAST_UPDATE_DATE LAST_UPDATE_DATE,
       SUBSTR(A_CUST.CUSTOMER_NUMBER,1,8) BILL_CUST_ID
FROM
    RA_CUSTOMERS A_CUST,
    HZ_CUST_ACCT_SITES_ALL HZ_CUST_ACCT_SITE,
    HZ_CUST_SITE_USES_ALL HZ_CUST_SITE_USE,
    RA_ADDRESSES_ALL RAA
WHERE
  A_CUST.CUSTOMER_NUMBER = :B1
  AND A_CUST.CUSTOMER_ID =  HZ_CUST_ACCT_SITE.CUST_ACCOUNT_ID
  AND HZ_CUST_ACCT_SITE.CUST_ACCT_SITE_ID = 
      HZ_CUST_SITE_USE.CUST_ACCT_SITE_ID
  AND HZ_CUST_ACCT_SITE.PARTY_SITE_ID =  RAA.PARTY_SITE_ID
  AND HZ_CUST_ACCT_SITE.STATUS = 'A'
  AND HZ_CUST_SITE_USE.STATUS = 'A'
  AND HZ_CUST_SITE_USE.SITE_USE_CODE =  'BILL_TO'
  AND RAA.ADDRESS1 = 'ICC LEASE'
/
Lets look at the TKPROF stats and runtime SQL plan used along with row counts.

Its interesting to observe that while only ~3000 rows are returned as the output of this sql, a huge number of rows were fetched from one of the HZ* tables.
Lets check the underlying tables being used by the Views RA_ADDRESSES_ALL and RA_CUSTOMERS. RA_ADDRESSES_ALL was a synonym for RA_ADDRESSES_MORG, while RA_CUSTOMERS was a synonym for RA_HCUSTOMERS.
  1  select referenced_name, referenced_type
  2  from dba_dependencies
  3* where name='RA_HCUSTOMERS'
SQL> /
REFERENCED_NAME         REFERENCED_TYPE
----------------------- -----------------
HZ_CUST_ACCOUNTS        TABLE
HZ_PARTIES              TABLE
SQL> l
  1  select referenced_name, referenced_type from dba_dependencies
  2  where name='RA_ADDRESSES_MORG'
  3* and referenced_type='TABLE'
SQL> /
REFERENCED_NAME                REFERENCED_TYPE
------------------------------ -----------------
HZ_PARTY_SITES                 TABLE
HZ_LOCATIONS                   TABLE
HZ_CUST_ACCT_SITES_ALL         TABLE
HZ_LOC_ASSIGNMENTS             TABLE
Lets also check the data volume involved in the base HZ* tables being referred and the percetange of estimation sample Percentage used.. (~70%) by Gather statistics program:
 Do we need this complexity?
This is a good time for Reality Check: This is certainly overkill.
Why would you want to join complex views with 3-4 base tables being referred inside the views. Not an intelligent approach, shall we say? Better, why wouldnt one want to query the same data by understanding the data model relationship between the base tables involved and see if the same results can be achieved through them.
Spending time on checking the data volume involved, the % Estimation by Gather Stats, index cardinality etc would not lead to anything conclusive or direct inference in this case.
Another good learning from the http://etrm.oracle.com website (requires a Metalink Login: Note:150230.1) regarding the RA_* views is that they should not be used directly in ANY CUSTOMIZATIONS ever! All the more reason to eliminate the views from the query altogether and find an alternate way of getting the same result.

 The Better Way...
Now, Only 1 column was being really queried. Inspite of this, after understanding the object model relationship diagrams for HZ* tables, at least five HZ* tables were required to be queried for a supplied customer_number value.
What was the SQL trying to achieve:
For a particular Customer_number, It wanted the find the Last_update_date for an ACTIVE Site which was being used as an ACTIVE BILL_TO location, with an address of ICC LEASE in its information.
Lets look at the index information:
select column_position, column_name
from dba_ind_columns
where index_name='&1'
/
Enter value for 1: HZ_LOCATIONS_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
              1 ADDRESS1
SQL> /
Enter value for 1: HZ_PARTY_SITES_U1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
              1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_PARTY_SITES_N2
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
              1 LOCATION_ID
SQL> /
Enter value for 1: HZ_CUST_ACCT_SITES_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
              1 PARTY_SITE_ID
SQL> /
Enter value for 1: HZ_CUST_SITE_USES_N1
COLUMN_POSITION COLUMN_NAME
--------------- ------------------------------
              1 CUST_ACCT_SITE_ID
              2 SITE_USE_CODE
              3 PRIMARY_FLAG
The same SQL query could be re-written such that it fetched the same data using base HZ_* tables, after connecting the foreign Key relationships.
Such an approach had many advantages:

  • RA_* views avoided as they are not supported except for in standard 11i products, also, its highly inefficient to access 7 tables/views for getting 1 column
  • The new query fetches less than half the number of consistent read buffers as compared to the previous query (verified using set autotrace on sql command)
  • The new query has just 14 access steps as opposed to 21 access steps for previous un-tuned query
  • The new query uses three unique Primery Key indexes (primary key) out of the five tables and two Foriegn key indexes for the remaining two, which is a really selective way of querying data
It simply cannot get better than that!
The performance improvement was very visible as the improved query ran within 2 minutes. Compare that with its previous cousin which used to run for ~1 hour.
SELECT SUBSTR(hca.account_NUMBER,1,8) CUST_NUMBER, 
       hcsu.LAST_UPDATE_DATE LAST_UPDATE_DATE,
       SUBSTR(hca.account_NUMBER,1,8) BILL_CUST_ID
FROM
   hz_cust_accounts       hca,
   hz_cust_acct_sites_all hcas,
   hz_cust_site_uses_all  hcsu,
   hz_party_sites         hps,
   hz_locations           loc
WHERE
--
-- Joining all the tables through primray keys
-- and foreign key relationships first
--
      hca.account_number = :B1
  and hca.cust_account_id = hcas.cust_account_id
  and hcas.party_site_id = hps.party_site_id
  and hca.party_id = hps.party_id
  and hps.location_id = loc.location_id
  and hcas.cust_acct_site_id = hcsu.cust_acct_site_id
--
-- the actual data query criteria starts here
--
  AND hcas.STATUS = 'A'
  AND hcsu.STATUS = 'A'
  AND hcsu.SITE_USE_CODE =  'BILL_TO'
  AND loc.ADDRESS1 = 'ICC LEASE'
--
-- Could perhaps add this criteria too ???
-- AND hcas.bill_to_flag = 'Y'
--
/
Before:

After:
 Summary It pays to ask the "why" questions sometimes, rather than focusing on "How" and "What" and going on a wild goose chase. The "Why" questions can get you more quickly to the answer by re-thinking the strategy.
Einstein himself said, "everything should be made as simple as possible, but no simpler."
The defence rests.
Related Links
Other Performance Related Articles

An Application Tuning Case Study: The Deadly Deadlock

Gaurav Verma - Sat, 2008-03-01 01:36

Preface

Lets face it. Deadlock and Latching issues have always been hard to debug. Many a times, we don't focus too well on application tuning, even though, it always advisable to look at the application design and user access trends first. These things don't come to the mind too readily.
Many a times, we do face situations where the users are deadlocked, sitting and watching blue screens or the application was freezing up. While there are many reasons for this to happen: Poor OS implementation, (swapping/disk/Memory issues) , RDBMS bugs, RDBMS initialization parameter values, having a lot of non sharable SQL, a good share of cases can also be caused due to Poor application coding, without undertanding the sequence in which resources should be accessed or acted upon (updated/deleted etc).
This is one such case study in which a deadlock was traced back to a deadlock situation un-aware code and the solution was to enhance the code to not cause deadlocks in the future. While the change was really simple, picturizing the deadlock in action was really difficult.
This article also goes into the mechanics of contructing the deadlock situation in slow motion. This makes it an instructive case study to read through for future reference. Getting the Bad Press...
The client had a big OLTP implementation of Order management (Order to cash business flow) 11i E Business suite and the main bread and butter was to take orders for their configurable products.
Many a times during any business day, the customer service representatives (~200 people) would get blue screens or get hung up, while trying to work on Orders. Getting frantic helpdesk calls for angry customer service reps was normal and obviously this was under the management's scanner since it was highly visible.
Customer service reps would keep making small talk to their clients who would call in to order chairs etc and eventually, the clients would give up on the wait. As a fallback of this, the client was irate (they were losing their customers) and frustrated at implementing order management.
Everybody in customer service would say: Why is Oracle so slow? Scratching the surface...
This continued on for months and it was normal for several myths about bad oracle performance to float around. More so, no one had any idea what was happening. Someone blamed workflow, someone blamed the network and so on..
The customer had done a lot of customizations. Preliminary investigations revealed that often there was a deadlock detected between a concurrent program session and a forms user session.
As a corollary, the custom "Order Acknowledgment" program would sometimes hang inordinately for > half hour, whereas otherwise, it would "normally" finish in 1-2 mins.
Analyzing wait events (using v$session_wait) and 10046 event session tracing didnt lead anywhere. The pattern was not predictable too. It was really very puzzling behaviour. Understanding the dead-lock in slow motion...
The "XX Sales Order Acknowledgement XML publishing" program used to run every minute. There had to be some co-relation between the programs code (table/object access/locking pattern) and the activity being done on Order entry forms.
After analyzing the code and seeing how underlying tables were being accessed or updated, It became clear that there was a problem in the "XX Sales Order Acknowledgement XML publishing" code.
The complete picture of the deadlock pattern was available now. Lets see it in slow motion here: Deadlock Situation 1:
From the entire bunch of qualifying Order header records to be processed for Order Acknowledgements, a batch of 25 Orders were updated at a time. After the updation of 25 orders, a commit was issued, which then freed the locks.
Now, out of these 25 records, if "XX Sales Order Acknowledgement XML publishing" concurrent program (say A) was trying to update the first Order, and there were be a Customer Rep (say B) who was working on the very same order in OE screen, "A" HAD TO WAIT till B freed the record by either commit/rollback action in Order Entry form.
TILL THE TIME B freed the same record, "A" was not be able to proceed with the remaining 24 Orders in the batch and kept on waiting for the row level exclusive lock.
It eventually either timed out or acquired the lock on the row. Similarly, "A" was not be able to proceed till "B" was done with the updates on the Order Header in OEXOETEL (Order Entry) form.
The same applied for any other Order which A had to update. Thus the situation when a single run of "XX Sales Order Acknowledgement XML publishing" concurrent program would be taking 20 mins or so. During this time, one would witness a high Order Entry udpating activity at the Customer Support site.
Now, lets reverse the situation. Deadlock Situation 2:
At anytime, there may be 20 (say) Customer Reps (say "B"), who are working on 25 or 30 orders. Therefore, in theory, around 25-20 Order Entry Headers rows may be locked in exclusive mode.
Now, every next minute, the "XX Sales Order Acknowledgement XML publishing" concurrent program was trying to update these very Order Entry Headers rows and was having a hard time waiting for the row level exclusive locks to get released.
When "B" saved its changes, the commit released the row level exclusive locks. Only then was "A" able to proceed with its job. The real time deadlocks seen during the day were caused by a mix of Deadlock situation 1 and 2.
Both the above situations were happening at the same time and vying for the same resource, the same row in the Order Entry headers table, to become free.
This was a vicious, endless cycle and caused locking grief for people working on Order Entry. Breaking the cycle
Once the nature of the application design and sequence of accessing resources were understood, it was clear that the vicious cycle had to be broken.
It was best to make the custom program more intelligent by checking first if the Order header record was locked (can be done by using the SELECT ..... FOR UPDATE NOWAIT clause).
Any already locked record was skipped for update. The logic used was that the skipped Order header record would be picked up for processing by the custom program next time AFTER the custom service rep would finally commit or rollback on the Order Entry form (the program was running every minute, remember). Summary
The problem was with the custom application design after analyzing the table/object access/locking pattern & understanding the code. Essentially, in this case, what was happening was that the Custom order acknowledgment program used to update a batch of 25 Order header records, in a sequential fashion, and only then issue a commit.
The update statement in culprit concurrent program never checked if the Order header record was already locked or not by another session and hence, as per default behavior, the update statement waited till the lock timeout happened -- this is a long time until NOWAIT clause is specified.
This could happen for any of the 25 Order header records it would try to update. When a customer service rep wanted to update the same Order which was being processed by the order acknowledgment program, the person would hang since the form session couldnt acquire a lock on the Order header record till the time a commit was issued by order acknowledgment program.
On the other hand, when a customer service rep would update an Order through the Order Entry screen, they would lock up an order for a good 10-20 mins or so (they kept doing stuff to it, while talking to the customer on the phone) and the custom order acknowledgment program would keep waiting for that exclusive lock on the order header record.
Again, this could happen for any of those 25 records in the batch, as per the design. So the deadlock was happening both ways.
And it was really subtle. NO amount of tracing or latch analysis or metalink research would have brought the troubleshooter to this stage of understanding the real issue.
Related Links
Other Performance Related Articles

Yet Another Case Study: The over-commit trap!

Gaurav Verma - Sat, 2008-03-01 01:33

Introduction Its always rewarding to talk about lesser known performance tuning scenarios, which are not all that well appreciated in the DBA or developer world. Ask anyone as to what performance tuning has one worked upon and quite likely, you may hear - Ah well, I made sure the stats were gathered latest, the right index was being used or I used a hint to use the right index. How often do we look at other aspects like the frequencey of commits? An atypical performance problem scenario At a particular client site, a custom concurrent progam had a history of taking more than 8-10 hrs to complete.
The Applicaton load nature was a mixture of high OLTP during the regular business day hours and more than moderate batch processing during the evening to early morning hours. Needless to say, a lot of long running batch jobs were contending for shared resources like Rollback & temp segments.
A lot of diagnosis was done as to why the custom concurrent program would take so long. Quite often, the program would terminate with the notorious ORA-1555 (snapshot too old error), even though the data volume involved was quite less (7000 records)  and abundant rollback segment space -- 120g.
Even setting a high rollback segment retention initialization parameter undo_retention  as 8 hours did not help out in ironing out the performance issue. Examining the possibilities... There could have been several reasons why the performance lag was manifesting:

  • The query had a bad plan and the compile time plan was different than runtime plan. This happens many times, but it didnt seem to be the case
  • A hint for using unique index (from using index range scans) to avoid db sequential file read wait event seemed to make no difference
  • The level of  DML activity against tables within the process could be simply too much for the batch load to complete successfully in a suitable time window. This theory did not sound right since rollback retention period ~ 8 hours
  • The program was not doing commit or rollback often enough to be stay within the rollback retention period?
Foxy observations..  While sql  tracing (event 10046) identified that a simple select on a custom table having not so much data was having a hard time, the corresponding SQL plan was using a unique index (!!). What more could one ask for? The situation was almost wierd.
At the same time, pl/sql profiling (dbms_profiler) identified that FETCH  phase of that very simple select was taking long. The HTML output generated from plsql profiler runs using the profiler.sql script of Metalink Note 243755.1 was especially helpful in getting to this stage of analysis to determine where was the maximum delay in the entire process. Getting to the bottom of it  Again, one more thing identified, which could be missed easily, in the tkprof'ed output of raw sql trace was that the CR blocks read for the fetch phase of the cursor were high. The sql engine had to read a huge number of block reads for forming a read consistent image for the FETCH phase. This aspect is not readily discernable since the regular thought process is to look at elapsed time and SQL plan.
What could the reason for such a high  number of block reads for forming a read consistent image ? Too few commits/rollbacks or too much  commits/rollbacks? Overdoing anything is usually not a rewarding experience. Looking closer into the code...  On re-examining the PL/SQL profiler HTML output, it was seen that the program logic was executing a commit statement within a loop for every input record processed.
This would obviously overtax the rollback segments and causing the user server process to record a read consistent image of the corresponding transaction data at every commit.
Taking cue from this observation, when commits were done in a batch fashion at some threshold, the entire load ran in 4 hours (Input volume: 7000 lines). Conclusion

Neither is it good to do commits less frequently, nor is it good to over-commit. Moderation is the key for a happy medium and at times, it can have bearings on performance problems as well.
Who said software cannot learn from Philosophy?

A tuning case study: The goofy optimizer (9i.x RDBMS )

Gaurav Verma - Sat, 2008-03-01 01:23

Preface They say: Tune the application first. And then, many a times, the CBO behaviour is not predictable or favourable, inspite of its best intentions or internal hueristic rules or availability of latest statistics with good sampling percent.
This is one such example and how the usage of hints had to be used as one of the last options to improve the performance of a concurrent program.
Ideally speaking, the CBO chooses what is usually either the best or one of the best available data access plan for executing and fetching the sql output. It should take into considerations all the underlying objects statistics and use the least costly data access plan. Sometimes, it doesnt happen out of the box.
Note: Some seeded 11i Oracle Application modules also depend heavily on hints to make the performance of their code acceptable. One such heavy user is the QP or Advanced Pricing module. The Problem Scenario... At a client site, a shipping related custom concurrent program started ~5 hrs to complete on an average, rather then the usual ~3 hrs. From preliminary investigation using RDBMS trace (event 10046) and PL/SQL dbms profiler, that the maximum time (about ~4.5 hrs) was taken up by the following innocent looking SQL itself!!
SELECT HP.CATEGORY_CODE CATEGORY_CODE ,
       HC.SALES_CHANNEL_CODE  SALES_CHANNEL_CODE ,
       HC.CUSTOMER_CLASS_CODE CUSTOMER_CLASS_CODE ,
       SUM(L.SHIPPED_QUANTITY * L.UNIT_SELLING_PRICE)
         ORDER_SHIP_TOT ,
       L.ACTUAL_SHIPMENT_DATE SHIP_DATE ,
       DECODE (H.ORG_ID, 362, 'US', 365, 'CAD',
                         5624, 'MEX', H.ORG_ID) OPERATING_UNIT ,
       TT.NAME ORDER_TYPE
FROM
  ONT.OE_ORDER_HEADERS_ALL H ,
  APPS.OE_ORDER_LINES_ALL L ,
  APPS.HZ_PARTIES HP ,
  APPS.HZ_CUST_ACCOUNTS HC ,
  APPS.HZ_CUST_SITE_USES_ALL HCSU ,
  APPS.HZ_CUST_ACCT_SITES_ALL HCAS ,
  APPS.OE_ORDER_SOURCES OS ,
  APPS.OE_TRANSACTION_TYPES_TL TT ,
  XXIW.XXIW_BS_OM_ORDER_TYPES XOT
WHERE
     H.INVOICE_TO_ORG_ID = HCSU.SITE_USE_ID(+)
AND H.ORDER_SOURCE_ID = OS.ORDER_SOURCE_ID
AND OS.NAME <> 'Satcom Closed'
AND H.ORDER_TYPE_ID = TT.TRANSACTION_TYPE_ID
AND TT.LANGUAGE = 'US'
AND TT.NAME NOT LIKE 'INTERNAL%'
AND TRUNC(L.ACTUAL_SHIPMENT_DATE)
     BETWEEN NVL(:B2 , TRUNC(L.ACTUAL_SHIPMENT_DATE))
         AND NVL(:B1 , TRUNC(L.ACTUAL_SHIPMENT_DATE))
AND HCSU.CUST_ACCT_SITE_ID = HCAS.CUST_ACCT_SITE_ID(+)
AND HCAS.CUST_ACCOUNT_ID = HC.CUST_ACCOUNT_ID(+)
AND HC.PARTY_ID = HP.PARTY_ID(+)
AND UPPER(HC.CUSTOMER_CLASS_CODE) <> 'INTERNAL'
AND UPPER(HC.SALES_CHANNEL_CODE) <> 'INTERNAL'
AND H.FLOW_STATUS_CODE NOT IN ('CANCELLED','ENTERED')
AND H.BOOKED_DATE IS NOT NULL AND L.FLOW_STATUS_CODE = 'CLOSED'
AND L.LINE_CATEGORY_CODE <>'RETURN'
AND L.HEADER_ID = H.HEADER_ID
AND TT.NAME = XOT.ORDER_TYPE
AND XOT.TYPE = 'ALL'
GROUP BY HP.PARTY_NAME, HP.CATEGORY_CODE, HC.SALES_CHANNEL_CODE,
         HC.CUSTOMER_CLASS_CODE, H.ORDER_NUMBER, H.HEADER_ID,   
         H.BOOKED_DATE, L.ACTUAL_SHIPMENT_DATE,
         H.FLOW_STATUS_CODE, H.ORG_ID, OS.NAME, TT.NAME

The top time taking code section from pl/sql DBMS profiler data. As can be seen, the maximum time is spent in the Fetch of the cursor executing this query:
 Evaluating options The first thing which comes to the mind is: Can we play with different sampling estimations? e.g. 10%, 50%, 70%. While the idea is appealing, it can destablize many other SQL queries and its an uphill task to judge the impact of such a change on the remaining application peformance as such. Heavy price to pay for tuning one lone SQL.
Again, the statistics of the objects were also recent enough (Stats gathered on a weekly basis as usual). There were no big changes in the data volumes as such. Getting into the optimizer's thought process From the SQL plan being used, It can be readily seen that the optimizer acted greedily like this: It saw that there are several small HZ* tables which could easily fit into PGA/SGA  memory and so it thought -- Cool, I can do a quick Hash join of these folks (that would also take care of the outer join predicate evaluation). After that, it proceeded to retrofit the remaining tables to be joined into the plan which was developing.
Fine approach. Only problem is that the index range scan of Header_id column using the non-unique index on the oe_order_lines_all table simply kills the performance of the fetch phase. It is very simply, very very inefficient.
19101503 INDEX RANGE SCAN OE_ORDER_LINES_N1 (object id 334899)
A much better approach would have been to first filter off the header_id values from OE_ORDER_HEADERS_ALL table so that it would use the unique primary key index of OE_ORDER_HEADERS_ALLand join with the OE_ORDER_LINES_ALL table using nested loop operation.
This itself would filter off a huge chunk of the data (which is mostly centered around OE_ORDER_LINES_ALL table). The remaining HZ* & miscellaneous data could easily be plugged in via Hash Joins. The idea was to drive the query using OE_ORDER_HEADERS_ALL to eliminate the maximum chunk of un-required data first.

Also, the predicate AND h.booked_date IS NOT NULL was replaced with
and h.booked_flag = 'Y'
. Why use a NOT NULL clause when a flag is available in OE_ORDER_HEADERS_ALL?

Also, a custom index xxoe_order_lines_n11 was found to exist on oe_order_lines_all.actual_shipment_date column. We thought we'll make use of it too. The more the merrier.

So then, after grouping together the related predicates together, formatting the query and Putting these ideas into practice, the original query was changed as following:
SELECT /*+ index (h,oe_order_headers_u1)
index (l,xxoe_order_lines_n11)
           use_nl(h,l)
       */
      hp.category_code  category_code
     ,hc.sales_channel_code sales_channel_code
     ,hc.customer_class_code customer_class_code
     ,SUM(l.shipped_quantity * l.unit_selling_pric order_ship_tot
     ,l.actual_shipment_date ship_date
     ,DECODE (h.org_id,
        362,  'US',
        365,  'CAD',
        5624, 'MEX',
        h.org_id)  operating_unit
     ,tt.NAME  order_type
   FROM
         ont.oe_order_headers_all  h
        ,apps.oe_order_lines_all   l
        ,apps.hz_parties           hp
     ,apps.hz_cust_accounts        hc
     ,apps.hz_cust_site_uses_all   hcsu
     ,apps.hz_cust_acct_sites_all  hcas
     ,apps.oe_order_sources        os
     ,apps.oe_transaction_types_tl tt
     ,xxiw.xxiw_bs_om_order_types  xot
   WHERE
          h.header_id                    = l.header_id
      AND h.ORDER_SOURCE_ID              = os.ORDER_SOURCE_ID
      AND h.ORDER_TYPE_ID                = tt.TRANSACTION_TYPE_ID
      AND h.flow_status_code not in ('CANCELLED','ENTERED')
      -- AND h.booked_date IS NOT NULL
      -- Replaced with the following condition                 
      and h.booked_flag = 'Y'
      -- Replaced with the above condition
  
      AND tt.LANGUAGE                    = 'US'
      AND tt.NAME                 not like 'INTERNAL%'
      AND l.flow_status_code             = 'CLOSED'
      AND l.LINE_CATEGORY_CODE          <>'RETURN'
      AND TRUNC(L.ACTUAL_SHIPMENT_DATE)
          BETWEEN NVL(:B2 , TRUNC(L.ACTUAL_SHIPMENT_DATE))
              AND NVL(:B1 , TRUNC(L.ACTUAL_SHIPMENT_DATE))
      AND h.invoice_to_org_id             = hcsu.site_use_id (+)
      AND hcsu.cust_acct_site_id = hcas.cust_acct_site_id(+)
      AND hcas.cust_account_id           = hc.cust_account_id(+)
      AND hc.party_id                    = hp.party_id(+)
      AND UPPER(hc.customer_class_code) <> 'INTERNAL'
      AND UPPER(hc.sales_channel_code)  <> 'INTERNAL'
      and tt.name                        = xot.order_type
      and xot.type                       = 'ALL'
      AND os.name                       <> 'Satcom Closed'
   GROUP BY
      hp.party_name,
      hp.category_code,
      hc.sales_channel_code,
      hc.customer_class_code,
      h.order_number,
      h.header_id,
      h.booked_date,
      l.actual_shipment_date,
      h.flow_status_code,
      h.org_id,
      os.name,
      tt.name;

The Improvement

The same query now executed under 5 minutes!!! ~4.5 hrs to ~5 mins. Thats quantum.

The improved plan was as follows:

Conclusion

While it may not be ideal, at times, it may be required to supplement of coach the CBO to achieve acceptable levels of application performance. The effect may be more long lasting if the "relative" data volume pattern of the underlying objects is not expected to change drastically in the system.

Case study: How to leverage +LargePages option of J2SE1.5 on Linux hugemem kernel for 11i Oracle Applications java servlet based modules

Gaurav Verma - Sat, 2008-03-01 00:54

Preface

This case study deals with Oracle configurator product (shortcode CZ) of 11i Oracle Applications and how to make it more stable and scalable by leveraging the LargePages option provided in J2SE1.5. This paper is in continuation to earlier article Common Incorrect beliefs about process address limits on 32bit platforms
It is assumed that the maintenance release of 11i Oracle Applications is 11.5.10 as per certification specified in Note 304099.1 - Using J2SE Version 5.0 with Oracle E-Business Suite 11i. To find the latest version of Java certified with the E-Business Suite application tier, please see OracleMetaLink Note 300482.1, titled "Overview of Using Java with Oracle E-Business Suite Release 11i".
This study assumes RHAS3 as the running platform OS.

The general approach and the case for this study

To increase stability and response time of a java servlet based application (say Oracle configurator), at least these two conditions must be satistifed:

  • Garbage collection must be minimized - The more the overall heapsize, the lesser will be the contention for space across different sessions, assuming the user load remains the same. The lesser the demand for object or session allocation in the JVM heap, the lesser will be the garbage collection.
  • Response time must be reasonably high - This, again is dependent upon garbage collection. During full garbage collection (GC) cycles, the application is not able to respond to the user activity because its doing background maintenance. Therefore, the lesser the garbage collection, the higher is the response time.
There might also be a requirement for doing a considerable caching of configurator models in the JVM heap, which adds to the memory overhead. Using LargePage option can be useful in such a situation. Installing the hugemem kernel

Based on the article Redhat Linux Kernels and process space limits using Java.. Some eye-openers , it is already established that the hugemem kernel allows to address upto 3.4G of process address space.

Install the hugemem kernel rpm for the same linux kernel level on the configurator tier. As an example For 2.4.21-40 level, the kernel-hugemem-2.4.21-40.EL.i686.rpm is the corresponding rpm.

As root:

# rpm -Uvh kernel-hugemem-2.4.21-40.EL.i686.rpm

Since you would be persisting with the hugemem kernel across boots, make Hugemem the default kernel by specifying the default=0 line in the Boot loader configuration file. Assuming the boot Loader is GRUB, the file would be /etc/grub.conf. default= specifies the order of the kernel listed in loader configuration file, starting with 0 as the first. It might be set to different value(1,2,3) depending on however many kernels installed on the machine.


Note: Installing the hugemem kernel from the rpm does put the hugemem kernel as the first entry in the boot loader configuration file (/etc/grub.conf), therefore default=0 will be needed to be specified.


#more /etc/grub.conf


# grub.conf generated by anaconda


#


# Note that you do not have to rerun grub after making changes


# to this file


# NOTICE: You have a /boot partition. This means that


# all kernel and initrd paths are relative to /boot/, eg.


# root (hd0,0)


# kernel /vmlinuz-version ro root=/dev/rootvg/lv1


# initrd /initrd-version.img


#boot=/dev/sda


default=0


timeout=10


splashimage=(hd0,0)/grub/splash.xpm.gz


title Red Hat Enterprise Linux ES (2.4.21-40.ELhugemem)


root (hd0,0)


kernel /vmlinuz-2.4.21-40.ELhugemem ro root=/dev/rootvg/lv1 hda=ide-scsi


initrd /initrd-2.4.21-40.ELhugemem.img


.....


.....


Create a symbolic link for System.map in /boot to point to System.map file specific to hugemem kernel.

#pwd
/boot
# ls -l System.map
lrwxrwxrwx 1 root root 30 May 24 20:55 System.map -> System.map-2.4.21-40.ELhugemem

This is a also good time to install the jdk-1_5_0_06-linux-i586.rpm (J2SE(TM) Development Kit 5.0 Update 6 from http://java.sun.com).


As root:

# rpm -Uvh jdk-1_5_0_06-linux-i586.rpm

Warn the Operating System that the limits will be challenged:


  • Add the following in /etc/security/limits.conf

applmgr soft nofile 65536

applmgr hard nofile 65536


applmgr soft nproc 16384


applmgr hard nproc 16384


applmgr soft stack unlimited


applmgr hard stack unlimited


applmgr soft memlock unlimited


applmgr hard memlock unlimited


  • Set the ulimit -a for the unix user running the JVM  like this:

# ulimit -a

time(cpu-seconds) unlimited


file(blocks) unlimited


coredump(blocks) 0


data(kbytes) unlimited


stack(kbytes) unlimited


lockedmem(kbytes) unlimited


memory(kbytes) unlimited


nofiles(descriptors) 65536


processes 16384

Implement hugepage & supporting features in OS

Make kernel configuration changes for supporting Hugepages and shared memory segments. The example assumes that you would be locking 9GB of real memory on the machine with 12G of RAM.


  • Add this the /etc/sysctl.conf :

##############################################################

# Added for huge page access


#


# This is worth of 9GB locked in Real Memory


# This limit needs to be changed as per the available memory

# on the Box with 12g RAM. Change this as per your

# requirement

#


##############################################################


vm.hugetlb_pool=9216


##############################################################


# Added to keep maxsize of shared memory segment at 3.5g


#


# This is required to support LargePages feature of JDK1.5

# and to avoid the following message at JVM startup:

# Java HotSpot(TM) Server VM warning: Failed to reserve

# shared memory (errno = 22).

##############################################################


kernel.shmmax=3758096384


kernel.shmall=3758096384


##############################################################


  • Mount hugepage filesystem as root:

# mkdir -p /mnt/hugepages


  • The following entries should be there in /etc/fstab:

# vi /etc/fstab

..


..


##############################################################


# These entries are needed to make hugemem work


##############################################################


none /dev/shm tmpfs defaults,size=9G 0 0


##############################################################


# In the above setting, size should be modified as real


# memory being locked in sysctl.conf for vm.hugetlb_pool


##############################################################


nodev /mnt/hugepages hugetlbfs rw 0 0


##############################################################


  • Reboot the machine to boot into Hugemem Kernel and for allocating hugepages to reflect in /proc/meminfo. After the reboot, it should look like this:

$ grep -i huge /proc/meminfo

HugePages_Total: 4608


HugePages_Free: 4608


Hugepagesize: 2048 kB


Note: For dual core CPU Dell machines, it was found that even though Hugepagesize was 2mb, the number pages to be specified in /etc/sysctl.conf had to be doubled to make it worth a particular memory limit, say 9G. You are advised to experiment & verify with the output of previous command on your machine to check if this holds true for you too.



  • Check that the hugepagefs and tmpfs filesystems have been mounted with the required options:

# mount | grep huge

nodev on /mnt/hugepages type hugetlbfs (rw)


# mount | grep tmpfs


none on /dev/shm type tmpfs (rw,size=9G)

Prepare the Java servlet application for leveraging the LargePage JVM option


$ whoami


applmgr


$ vi $IAS_ORACLE_HOME/Apache/Apache/bin/java.sh


# ###############################################################


#


# This file is automatically generated by AutoConfig. It will be


# read and overwritten. If you were instructed to edit this file,


# or if you are not able to use the settings created by


# AutoConfig, refer to Metalink document 165195.1 for assistance.


#


# ###############################################################


# Added ulimit -s 8192 to make jdk1.5 work with modified native


# thread stack size ONLY for this servlet. This will keep the


# effect local.


#################################################################


ulimit -s 8192


JSERVJAVA=/usr/java/jdk1.5.0_06/jre/bin/java


#################################################################


ARGV="$*"


JVMLOGDIR=/ORACLE/prod2/9iAS/Apache/Jserv/logs/jvm


STDOUTLOG=$JVMLOGDIR/jserv.stdout


STDERRLOG=$JVMLOGDIR/jserv.stderr


...


...


...


#################################################################


  • In $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties, add the JVM switches directive line in the # Begin customizations & # End customizations block at the very end of the file. Remember to put that all these JVM switches in the same line:


# vi $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties


...


...


...


# Begin Customizations


################################################################


# Added to make JVM1.5 work with Oracle Configurator


################################################################


wrapper.bin=/ORACLE/prod3/9iAS/Apache/Apache/bin/java1.5.sh


################################################################


# the max heapsize specified by -Xmx will have to be changed as


# per the max available RAM On the config machine e.g. if the


# RAM=12g and we are reserving 9g for JVM heaps, then each JVM


# can have a max Heapsize of 2.5g or so:


# (2.7 g + 300g -> for perm generation overhead ) x 3 JVMs ~ 9g


# minus spare


################################################################


wrapper.bin.parameters=-verbose:gc -Xmx2750M -Xms2750m -Xss128k -XX:PermSize=128m -XX:MaxPermSize=256M -XX:NewRatio=3 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:ParallelGCThreads=4


-XX:+PrintGCTimeStamps -XX:+UseTLAB -XX:+UseLargePages -server


################################################################


# End Customizations


Note: It is important to keep these directives in # Begin & End Customization section so as to preserve these settings across AutoConfig sessions.


  • You will need to comment out the following 2 lines in the starting section of $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties :


#######################


# Execution parameters


#######################


# The Java Virtual Machine interpreter.


# Syntax: wrapper.bin=[filename] (String)


# Note: specify a full path if the interpreter is not visible in


# your path.


# wrapper.bin=/ORACLE/prod2/9iAS/Apache/Apache/bin/java.sh


# Arguments passed to Java interpreter (optional)


# Syntax: wrapper.bin.parameters=[parameters] (String)


# Default: NONE


################################################################


# comment out this line so that the java servlet will at least


# startup


################################################################


# wrapper.bin.parameters=-verbose:gc -Xmx1024M -Xms128M -XX:MaxPermSize=128M -XX:NewRatio=2 -XX:+PrintGCTimeStamps -


XX:+UseTLAB


################################################################


Note: It should be remembered that the default wrapper.bin directive for java.sh and default JVM options will get uncommented after running Autoconfig. So you would have to remember to comment them again after running Autoconfig, or else the Apache servlet will not start.

Why dont we do it this way?
One good question would be: why dont we put the JVM switches in the autoconfig token jvm_options in $CONTEXT_FILE? This would alleviate the need for manually updating the $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties each time after AutoConfig. For example.


<jvm_options oa_var="s_jvm_options" osd="Linux">-verbose:gc -Xmx1024M -Xms128M -XX:MaxPermSize=128M -XX:NewRatio=2 -XX:+PrintGCTimeStamps -XX:+UseTLAB</jvm_options>


The answer is simple. Setting the JVM options here will cause EACH servlet setup in the configurator tier to use the same switches. Assume that there are 2 more servlets configured to be running on the same machine (iPayment and OXTA servlets), the total memory taken up the servlets will be 3 X 3g ~ 9G. If you were planning on running 3 confugurator JVMs with LargePage option, all your real memory is gone already.


Hence the manual change step. At the same time, if your IAS_ORACLE_HOME is setup in such a way through the *.properties file that ONLY the configurator servlet is supposed to startup, then this method would work as well. So, this is also a viable option & probably a better one in this situation.

Reap the benefits
Now when you start the confgurator servlet using $SCRIPTS/adapcctl.sh, you can easily verify that the servlet JVM process has 3G of heapsize allocated and that the Garbage collection time spent is considerably reduced after using the parallel GC JVM options used above.


........


........


45990.174: [ParNew 1640767K->1624634K(2457536K), 0.0106000 secs]


46718.218: [ParNew 1640888K->1624758K(2457536K), 0.0133860 secs]


46913.196: [ParNew 1641014K->1628789K(2457536K), 0.0291830 secs]


........


.........


The advantage of using largepages is that the heap memory will be locked and is not eligible for paging out on swap. The access of objects in heap in real memory is obviously faster.


Hence, using Largepages option is a good option  for achieving performance goals, especially when the caching demands in the servlet are considerable and there is a section of RAM on the machine running servlet, which is not being put to any use.

Related articles and links

HugePages on Linux: What It Is... and What It Is Not... Metalink Note:361323.1


Bigpages vs. Hugetlb on RedHat Linux                            Metalink Note:261889.1


Java support for Large Memory Pages


Java Tuning White Paper

Redhat Linux Kernels and process space limits using Java.. Some eye-openers

Gaurav Verma - Sat, 2008-03-01 00:49

Different Kernels available and their memory limits using Java There are at least 3 different kind of Kernels available in Redhat linux:
1) SMP kernel:- This kernel is the most common variant used and supports only upto 4GB Virtual Memmory ((2**32 = 4GB) linear addresses). It gives an address space of 3G virtual memory peruser process &amp; 1G for kernel, but after deducing stack overhead, the maximum user addressable space limit comes to 1.7g.
The split is:
1.0 GB Text/code
1.7 GB available process memory for mmap, malloc / address space
     (e.g. usable by shared Pool)
0.3 GB Stack
1.0 GB Kernel
2) Enterprise Kernel:- In this kernel, the process address space can go up to 64GB VM (2**36 = 64GB linear addresses). This feature is called Page Address Extension (PAE) and is also referred to as Very Large Memory (VLM) support.
The enterprise kernel provides VLM support out of the box, BUT, it requires relinking of the kernel, which is a pretty unpopular option with Redhat and usually last resort of system administrators.
Note: You should be aware that Oracle Corp. will stop supporting any Oracle
          installation on a recompiled kernel, particularly Red Hat.
3) Hugemem Kernel:- Red Hat Enterprise Linux 3.0/4.0 includes a new kernel known as the hugemem kernel. This kernel supports a 4GB per process user space (versus 3GB for the other kernels), and a 4GB direct kernel space.
  The hugemem kernel is usually required in order to use all the memory in system configurations containing more than 16GB of memory. The hugemem kernel can also benefit configurations running with less memory (if running an application that could benefit from the larger per process user space).
The virtual address space is split as:
1.0 GB Text/code
2.7 GB available process memory for mmap, malloc / address
        space (e.g. usable by shared Pool)
0.3 GB Stack
4.0 GB Kernel Understanding how one kernel is better than other... It is important to understand the pros and cons between using SMP, Enterprise and Hugemem kernels. For applications having large virtual memory requirements, even the enterprise kernel could be a solution using the VLM feature in RHEL3/4, but it would entail creating a very large buffercache, using options mentioned in Note 262004.1 for Oracle Databases (one kind of application).
On top of the page addressing problem for 64GB linear addresses, there will be a problem in kernel memory if we have a lot of memory. It would be required to build a Page Translation Extension (PTE) for all available memory pages and this restricts the memory we can address.
Typically the kernel memory of 1gb is completely used up if its required to address just 24GB of memory. To overcome this problem we need a large kernel memory and this is where the hugemem kernel beats the Enterprise kernel as it provides 4G of Kernel memory. Experimenting with process limits across different factors..

We can find out the maximum number of threads which can be spawned on a platform across any JVM version by running a program written in Java (ThreadTest.java) -- listing available in Appendix A.
This program just keeps on spawning threads and keeps them alive almost forever. When it reaches the Java.Lang.OutOfMemory error on spawning a thread, it outputs the maximum number of threads it was able to spawn.
Based on this information and the native java thread stack size specified on JVM startup and using the golden memory rule, we can calculate the total virtual memory address space of the JVM process.
For a listing of the java source code of and the unix shell script check_thread_limits.sh, please refer to Appendix B-1. The results of the max threads spawned and Maximum user address space by varying JDK version, Kernel (SMP, Hugemem), ulimit, stacksize are listed in Appendix B-2.
As can be seen from Appendix B-2:
For SMP kernels:
With ulimit -s unlimited:
(Total size of managed heap + native heap) + (number of threads X thread stack size) ~ 1.7G
With ulimit -s 10m or 8m or 6m or 4m or 2m:
(Total size of managed heap + native heap) + (number of threads X thread stack size) ~ 2.7G
For Hugemem kernels:
With ulimit -s unlimited:
(Total size of managed heap + native heap) + (number of threads X thread stack size) ~ 2.5G
With ulimit -s 10m or 8m or 6m or 4m or 2m:
(Total size of managed heap + native heap) + (number of threads X thread stack size) ~ 3.7G
Note: This also implies that the "unlimited" native stack size constant is not the same as 10m as might be seen in comments in /usr/src/linux/include/sched.h for _STK_LIM constant, rather its some valule which is very high as compared to 10m. Sometime, seeing and reading comments in source code files is not believing, as I found out.


Appendix A : Listing of ThreadTest.java

public class ThreadTest {

public static void main(String[] pArgs) throws Exception {

try {
// keep spawning new threads forever
      while (true) {
      new TestThread().start();
      }
   }
// when out of memory error is reached, print out the number of
// successful threads spawned and exit
   catch ( OutOfMemoryError e ) {
     System.out.println(TestThread.CREATE_COUNT);
     System.exit(-1);
   }
  }
  static class TestThread extends Thread {
    private static int CREATE_COUNT = 0;
    public TestThread() {
     CREATE_COUNT++;
    }
// make the thread wait for eternity after being spawned
    public void run() {
      try {
        sleep(Integer.MAX_VALUE);
      }
// even if there is an interruption, dont do anything
      catch (InterruptedException e) {
      }
    }
  }
}

Appendix B-1 : Listing of check_thread_limits.sh

Appendix B-2 : Results of check_threads_limits.sh across different Kernel, Java Version and native Java stack size

SMP Kernel results:

Hugemem Kernel results:

Common Incorrect beliefs about process address limits on 32bit platforms

Gaurav Verma - Sat, 2008-03-01 00:08

Preface

Most people believe that one cannot go beyond 1.2g-1.7g or so as a maximum heapsize for Java servlet based applications deployed with 9iAS/10gAS. The ignorant would cite this as a limitation of 32bit Intel architecture.

In reality, this assumption is only partially true as it applies only to SMP Kernels and that too with the default -unlimited native stack size per process. People might say "Your results may vary" but the truth is that there are several factors like process address space & stack limits, kernel limits, shared memory support etc which need to be understood to make an educated comment on the question of  - What is the best limit I can push the JVM heap to for my application and How can I do it  without compromising stability ?.

Kernel Limitations and known issues of JDK1.4 Heaps on 32-bit platform vis-a-vis 64-bit platforms

While it is generally believed that the maximum memory addressable space for a process on 32-bit platforms is 1.7g , this is only true for the SMP kernel.

Right? Well, only if unlimited(usually the default, given by $ ulimit -s command ) native stack size is used. If any lesser value like 10m, 8m, 6m, 4, 2m are used, the user address space increases to 2.7g.

There are white papers on metalink having details how the SGA (implemented as OS shared memory area) of an oracle database on 32-bit platform can be pushed to 3.5g. This is achieved by lowering the starting address point in the memory address stack for the oracle executable and relinking it, so that the memory addressable space increases, when a process from the oracle executable is spawned.

For JVMs, the application specific golden limit has really to be found out by trial and error by tweaking ulimit -s and some other factors.

The easy command for finding the maximum heapsize limit can be found by raising the value of -Xmx and -Xms switches at jvm startup:

# e.g. if you want to see if a heap of 1400m can be allocated and committed at startup

$ java -version -Xmx1400m -Xms1400m 

If this doesnt fail and you get the version of java in the output, keep pushing the OS with higher heap sizes and see when the segmentation fault Signal 11 errors start coming. You have reached the forbidden limit and would do well to keep ~300-400m off this limit for accomodating native stack space for java threads. The native stack space requiement would vary as per concurrent user load of the application.

The golden memory rule:
The golden rule of thumb for max address space per process to be kept in mind is:


(Total size of managed heap + native heap) + (number of threads X thread stack size) = max process address space allowed by Kernel


Viewing it in a more wholistic way,


Data segment (=Total size of managed heap + native heap) + stack segment(= # of threads X thread stack size) = max user address space allowed by Kernel

More possibilities with JDK1.5

There are some exciting things which can be done using JDK1.5, like addressing 3G & more of heapsize & using LargePages feature, using which its possible to eliminate heavy swapping, iowait & Garbage collection, since the heap is kept in real memory.

It should be noted that JDK1.5 is only supported at this time with Oracle Applications 11.5.10. Its not yet certified yet with 11.5.9 mostly due to logistical reasons and for avoiding regression testing efforts with JDK1.6, which Oracle plans supposedly plans on doing.

Related Links and Articles

Case study: How to leverage LargePages option of J2SE1.5 on Linux hugemem kernel for 11i Oracle Apps

APEX 3.1 On XE

Jornica - Fri, 2008-02-29 17:28

On 29 Februari 2008, Oracle released Application Express 3.1 . The page Oracle Database XE and Application Express 3.1 provides upgrade instructions .

Why TE.040 Test Scripts for UAT are a waste of time and money

Jo Davis - Thu, 2008-02-28 17:39
I’m all for testing… I love the stuff.. but seriously, spending weeks of effort putting together a test script in the format of:
  1. Enter "Jo's super Journal 290208" as the journal entry header
  2. Enter the journal source "Manual"
  3. Enter the journal category "Adjustment"
  4. .....
  5. ????
It seems I've spent weeks of my life over the last 11 years writing these things. Weeks I will never get back. (There are some nice ones uploaded by other nice humans over at http://www.oracleappsblog.com/index.php/weblog/comments/oracle-11i-test-scripts/ if you’re looking for a standard test script repository by the way).

Why I don’t like them, in a nutshell:
  • too easy to concentrate effort on the test script steps instead of taking a step back and looking at the types of testing you’re doing and what sort of coverage you’ve got across the applications, particularly with regard to negative testing and scenario-based integration testing.
  • users hate writing them and try to outsource it to the implementation partner, which is never going to work because seriously the only person who can effectively test whether you’ll still be able to do your job on Monday morning is you, not some hired gun.
  • they contribute to the intimidating thud factor. Let’s face it – would a user would like to see a 50 page test script he/she has to get through or a one page one? Keep in mind that he/she’s also juggling their actual job (which is piling up on their desk while they’re here testing this thing because no one actually back-fills the finance department)… talk about stress.
  • they get out of date very quickly and are never maintained, so when you go to do the next project you find the steps listed for ‘Enter an invoice’ are not the way Payables actually perform the task. It must be kept in line with the user procedure but there's no method to do so.
So - back to basics – what is the purpose of User Acceptance Testing? To ascertain whether the business can continue to function with the system as delivered. Whether it be a fresh implementation, an upgrade or a really big patch, the purpose of UAT is the same.
We need to spend time, effort and money on documenting the user procedures and the training. Put the effort in where it will really be used ongoing, not just used for 2-4 weeks of UAT every six months or so.

The test scripts can then just document the actual test case or scenerio and refer out to the procedures/training guide on how to actually perform the task. If you really want a detailed, step-by-step test plan for UAT use a product to document your procedures which will produce the test script steps automatically, e.g. Oracle’s User Productivity Kit (UPK) or (my favourite) RWD’s uPerform. If you get really exited go have a look at Turnkey Solutions and see how you can have software document the testing for you apparently (haven't tried it myself yet, but the demo looks promising). Then you can concentrate your effort on what you should be testing, rather than how you should be testing it.

Pages

Subscribe to Oracle FAQ aggregator