Friday

11.5.10.2 Upgrade

The past 6 months or so we have been working on a 11.5.10.2 upgrade. Last weekend we finally upgraded production over a period of 48 hrs. (That includes rebuilding test environments for testing.) The upgrade went pretty smooth and there haven't been any major issues on the techstack side. We have had the usual jinitiator issues and I believe there have been some minor application issues, but otherwise i'd say a very successful upgrade.

Going into the upgrade we expected it to about the same level of effort as our 11.5.8 to 11.5.9. However, that wasn't the case to say the least.

So hopefully i'll have more time to post in the new year...!

Thursday

Stuck Concurrent Requests

Every now and then users call us with a concurrent request that is running longer than normal and/or blocking other batch jobs because of incompatibilities. Upon investigation we'll see that there is no database session for the request. Since there isn't a database session users may not be unable to cancel the request themselves. The cancel button will be grayed out. The solution is to clean the fnd_concurrent_requests table.

Background: Concurrent programs may be incompatible with other programs which means they cannot execute at the same time. If the stuck concurrent request has such rules defined, then programs it is incompatible with will not run until the problem is solved.

There are 2 ways to do this, update the table manually or run the Oracle provided cmclean.sql script. Depending on the method you choose, you'll need the request id. This can be provided by the user or you can look at the running requests via Oracle Applications Manager (OAM). To navigate there click on Site Map on the top left hand corner of the page. Under Concurrent requests click on Running.




Once your in the Running requests screen you'll see which programs are currently being executed. With the help of your users, find the request id in question and make note of it. The recommended approach from Oracle will be:

1. Kill the database sessions for the requests. (In our case there weren't any.)
2. Shutdown the concurrent managers.
3. Run the cmclean.sql script Note: 134007.1
4. Start your concurrent managers.

The other method is to update the bad rows in the fnd_concurrent_requests table manually.

update fnd_concurrent_requests set STATUS_CODE='D',phase_code='C' where request_id=<request id>

STATUS_CODE of D means Cancelled and a phase_code of C is completed.

For a list of status, phase_codes and what they mean, refer to Note: 297909.1.

The benefit to updating the fnd_concurrent_requests table manually is that no downtime is required. If you are using cmclean.sql remember to shutdown the concurrent managers first!

Wednesday

Modifying the Listener Port for Database Control

A coworker was installing 10Gr2 on a windows server which already had an existing 9i environment. They wanted the database console to connect via the 10G listener (non-standard port 1522) and not the 9i (standard port 1521) to keep the environments separate.

I was able to find the following document which lists all of the default ports used by Enterprise Manager (10g).

Subject: Overview of Default Ports Used by EM 10g Grid Control, DB Control and AS Control
Doc ID: Note:235298.1


Search for "SQL*Net Listener for the 10g DB" and it will list the modifications required in order to change the listener port. The steps in this section assume you need to change the listener port as well. Since the listener was already created for me here are the steps I had to follow:

1. Stop DB Control by executing the command emctl stop dbconsole

2. Modify the agent to reflect the new port number. Edit $ORACLE_HOME/<hostname_sid>/sysman/emd/targets.xml

The instructions say you only need to modify the listener port value. However, I noticed that there are two entries in this file. One is for the database and another for the listener, each pointing to the default port of 1521. I'm not sure if I needed to change the database port but since the DBA wanted to keep both environments separate I changed it to be safe. I didn't test this procedure without making that change.

Here is a copy of the file (of course username and password values were modified. ;).


<Targets AGENT_SEED="39661000">
<Target TYPE="oracle_emd" NAME="myserver.mydomain:1830"/>
<Target TYPE="host" NAME="myserver.mydomain" DISPLAY_NAME="myserver.mydomain" VERSION="1.0"/>
<Target TYPE="oracle_database" NAME="mydb10g" VERSION="1.0">
<Property NAME="MachineName" VALUE="myserver"/>
<Property NAME="Port" VALUE="1522"/>
<Property NAME="SID" VALUE="mydb10g"/>
<Property NAME="OracleHome" VALUE="D:\oracle\ora10g"/>
<Property NAME="UserName" VALUE="1993cs3901f8d5af7" ENCRYPTED="TRUE"/>
<Property NAME="password" VALUE="8c35Da9iks4e07bab7b6c5906a837f4" ENCRYPTED="TRUE"/>
</Target>
<Target TYPE="oracle_listener" NAME="LISTENER_myserver" VERSION="1.0">
<Property NAME="Machine" VALUE="myserver"/>
<Property NAME="LsnrName" VALUE="LISTENER10g"/>
<Property NAME="Port" VALUE="1522"/>
<Property NAME="OracleHome" VALUE="D:\oracle\ora10g"/>
<Property NAME="ListenerOraDir" VALUE="D:\oracle\ora10g\network\admin"/>
</Target>
</Targets


3. Modify $ORACLE_HOME/<hostname_sid>/sysman/config/emoms.properties to reflect the new port. Two entries have to be modified here as well. emdRepPort and emdRepConnectDescriptor should reflect the new port number.

Here are the modified entries with 1522 replacing the standard port 1521.

oracle.sysman.eml.mntr.emdRepPort=1522
oracle.sysman.eml.mntr.emdRepConnectDescriptor=(DESCRIPTION\=(ADDRESS_LIST\=(ADDRESS\=(PROTOCOL\=TCP)(HOST\=myserver)(PORT\=1522)))(CONNECT_DATA\=(SERVICE_NAME\=mydb10g)))


4. Last but not least start up database console with emctl start dbconsole and verify that you can login successfully.

Tuesday

Error cloning a clone

A few posts ago I hit an issue with executing the runInstaller. It still surprised me that after hundreds of installs you can still hit new errors. The other day I hit a new error while cloning an EBS environment. While preparing the environment for cloning (executing adpreclone.pl dbTier), the error below appeared:


RC-50409: Topology information required for cloning not found in the database. Make sure that
AutoConfig was run successfully prior to starting Rapid Clone, and try again
Raised by oracle.apps.ad.tools.AdCreateCtlFile
RC-50208: Exception in method gen_crctrfclone_sql
RC-50208: Exception in Constructor AdCreateCtlFile
Raised by oracle.apps.ad.tools.AdCreateCtlFile



This particular environment is a clone of our production environment. After it was built users had performed some work and wanted it cloned to another environment. There are two solutions for this particular error, which can be found in detail in Metalink Note: 376964.1.


The easiest solution is to run autoconfig at the dbTier level. The curious part is that since this environment is a clone of production, autoconfig would would have been executed as part of the post cloning steps (adcfgclone.pl dbTier). I checked my post cloning logs and indeed it completed successfully. I executed autoconfig ($ORACLE_HOME/appsutil/bin/adconfig.sh) on the database tier again and the preclone script finished without error.

Wednesday

HotSpot Virtual Machine Error : 11

I've been working with Oracle for quite a while now and have more than my share of installs under my belt. Today a co-worker emailed me with the following error produced by the 10.2.0.1 runInstaller.

All installer requirements met.

Preparing to launch Oracle Universal Installer from /tmp/OraInstall2007-10-23_11-51-47AM. Please wait ...$ Starting Installer in advanced mode ...
Oracle Universal Installer, Version 10.2.0.1.0 Production
Copyright (C) 1999, 2005, Oracle. All rights reserved.


Unexpected Signal : 11 occurred at PC=0xFECD9110
Function=[Unknown. Nearest: JVM_MonitorWait+0x1F24]
Library=/tmp/OraInstall2007-10-23_11-51-47AM/jre/1.4.2/lib/sparc/client/libjvm.so

Current Java thread:
at oracle.sysman.oii.oiip.osd.unix.OiipuUnixOps.chgrp(Native Method)

.
.
.

#
# HotSpot Virtual Machine Error : 11
# Error ID : 4F530E43505002EF 01
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_08-b03 mixed mode)
#
# An error report file has been saved as hs_err_pid27925.log.
# Please refer to the file for further information.


I'm still amazed that after hundreds of installs you can still hit a new error. In this case the oraInst.loc file, which points to the central inventory location, was not setup correctly. There was an old location which they commented out, but they forgot to specify a new one.

Ex. Sample oraInst.loc (On unix servers this file is normally stored in /var/opt/oracle or /etc)

#Oracle Installer Location File Location
#Wed Apr 30 14:51:42 EDT 2007
inventory_loc=/home/oracle/oraInventory
inst_group=dba


What is the Central Inventory? The Central Inventory stores information for all Oracle products installed on the server. Inside the inventory location there is an xml file (inventory.xml) which lists all of the ORACLE_HOMEs.

Ex. <Central Inventory Path>/ContentsXML/inventory.xml (In this example, using the path from the oraInst.loc file above you would look in /home/oracle/oraInventory/ContentsXML/inventory.xml)

<?xml version="1.0" standalone="yes" ?>
<!-- Copyright (c) 2001 Oracle Corporation. All rights Reserved -->
<!-- Do not modify the contents of this file by hand. -->
<INVENTORY>
<VERSION_INFO>
<SAVED_WITH>2.2.0.19.0</SAVED_WITH>
<MINIMUM_VER>2.1.0.6.0</MINIMUM_VER>
</VERSION_INFO>
<HOME_LIST>
<HOME NAME="APPSDB_ORA" LOC="/app/oradb/920_64bit" TYPE="O" IDX="1"/>
<HOME NAME="APPSDB_ORA1" LOC="/app/oradb/1020_64bit" TYPE="O" IDX="2"/>
<HOME NAME="APPSIAS_ORA" LOC="/apps/appsora/oraora/iAS" TYPE="O" IDX="3"/>
</HOME_LIST>


As you can see, this server has both 9i and 10g database software and an application server. Inside each of these ORACLE_HOMEs there is a Local Inventory. The local inventory stores the list of components, patchsets and interim patches installed in that particular ORACLE_HOME. The file in this case is named comps.xml

ex. <Local Inventory Location>/ContentsXML/comps.xml (This file is rather large so just a small clip. This is from the 920_64bit ORACLE_HOME.)

<?xml version="1.0" standalone="yes" ?>
<!-- Copyright (c) 2001 Oracle Corporation. All rights Reserved -->
<!-- Do not modify the contents of this file by hand. -->
<PRD_LIST>
<TL_LIST>
<COMP NAME="oracle.server" VER="9.2.0.1.0" BUILD_NUMBER="0" REP_VER="0.0.0.0.0" RELEASE="Production" INV_LOC="Components/oracle.serv
er/9.2.0.1.0/1/" LANGS="en" XML_INV_LOC="Components21/oracle.server/9.2.0.1.0/" ACT_INST_VER="2.2.0.12.0" DEINST_VER="2.2.0.4.0" INS
TALL_TIME="2004.Dec.16 14:43:30 EST" INST_LOC="/mnt/app/visdb/920_64bit/oracle.server">
<EXT_NAME>Oracle9i Database</EXT_NAME>
<DESC></DESC>
<DESCID>COMPONENT_DESC</DESCID>
<DEP_GRP_LIST>
<DEP_GRP NAME="Optional" TYPE="O"/>
<DEP_GRP NAME="Required" TYPE="R"/>
</DEP_GRP_LIST>
<DEP_LIST>
<DEP NAME="oracle.rdbms" VER="9.2.0.1.0" DEP_GRP_NAME="Optional" HOME_IDX="1"/>

By default Oracle looks for the oraInst.loc file in /var/opt/oracle or /etc depending on your platform. But you can change the location if you wish, you'll just need to remember to specify the -invPtrLoc parameter with opatch and the runInstaller.

ex. opatch lsinventory -invPtrLoc <path to your>/oraInst.loc
runInstaller -invPtrLoc <path to your>/oraInst.loc


This relates back to an earlier post I made which shows you how to recreate your central inventory in an EBS environment.

Thursday

RAC Crash Recovery

In a normal oracle environment you can set the FAST_START_MTTR_TARGET parameter which controls the amount of time (seconds) required to recover from an instance or server failure. 10g has a new feature Automatic Checkpoint tuning which utilizes periods of low I/O to advance checkpoints. This feature requires that FAST_START_MTTR_TARGET is set to a non-zero value.

FAST_START_MTTR_TARGET includes both instance startup and crash recovery. If you have a RAC environment and there is a failure the surviving instances will perform the recovery. Therefore including the instance startup time in the mean time to recover calculation isn't accurate. So in the case of RAC environments you may choose to set the _FAST_START_INSTANCE_RECOVERY_TARGET parameter which determines the amount of the time for instance recovery. Note: If multiple instances fail, recovery may take longer than the value you specified for this parameter.

So in summary, FAST_START_MTTR_TARGET includes the time for startup, mount, crash recovery, and open. _FAST_START_INSTANCE_RECOVERY_TARGET includes the time for instance recovery only.

Changing any parameter which affects checkpointing should be throughly tested to make sure you don't thrash your IO subsystem. Each of these parameters will increase I/O.

Wednesday

11i and Listener Passwords

A standard practice for any Oracle database should be to enable passwords on your listeners. EBS is no exception and recently Oracle published a metalink note: 386374.1 on how to do this for 11.5.x environments.

There are some pre-reqs, specifically:

3453499 (11i.ADX.F)
5225940 (Post ADX.F Fixes)
5107107 - AUTOCONFIG ENGINE & CONFIG TOOLS ROLLUP PATCH N or higher

The note has detailed steps on how to set the password at the application and database tiers.

In my opinion, just as important as setting a password, is to monitor for invalid attempts after. Otherwise whats the point? At best you will only slow down a determined intruder. Once you have set a password, any attempt to execute a command via lsnrctl (without setting your password first) will result in the following error showing up in your listener log:

TNS-01169: The listener has not recognized the password

Each DBA has his preferred language for scripts. Some prefer shell, others perl. Myself, I prefer perl. Below is a perl script I have used to monitor the listener log for the above error.


#!/usr/bin/perl -w

if ($#ARGV !=1) {
die "Usage: check_alert.pl ex. hostname /var/opt/oracle/listener.log.\n";
}
sleep 2;

$interval=60; # How many seconds before we check to see if data has been written to the logfile;
$email_threshold=5; # How many errors within the interval before an email gets sent;
$hostname=$ARGV[0];
$file=$ARGV[1];
open(filePtr, $file) or die "Can't find $file\n";


for (;;) {
@errors=("Subject: Listener Password Errors for $hostname\n");
$currTime = localtime(time);
push(@errors,"Here are some errors found at $currTime for $hostname.\n");

while () {
chop $_;
if (/TNS-01169/) {
push(@errors, "$_\n");
}
}

if ($#errors > $email_threshold) {
$rndExt = time;
$rndFile = "alert_errors_$rndExt";
open (TMPFILE, ">/tmp/$rndFile");

foreach $error (@errors) {
print TMPFILE $error;
}
close(TMPFILE);
system ("mail username\@mycompany.com < /tmp/$rndFile");
system ("rm /tmp/$rndFile");
}

sleep $interval;
seek filePtr, 0, 1;
}




The script takes two parameters: <hostname> <path_to_listener_log>

A couple of key things you may want to customize in the above script are the $interval and $email_threshold variables. As well, there is an email address towards the bottom where alerts will be sent.

Now, I am by no means a perl expert but the above works for me. I'm sure there are some more efficient ways to monitor the log file. Feel free to post a comment with your preferred method. If you cycle your logfiles keep in mind you'll need to restart this script. I've monitored resource consumption and its pretty minimal but I know some DBA's will prefer not having a process run constantly.

Note: I've only tested this script on a 9i listener log. Its on my todo list to test it on 8.0.6.

UPDATE: Thanks to Howard Rogers for informing me of the 10g changes. Security has been enhanced such that the TNS listener administrator has to be a member of the local dba group. Unless you plan on administering your listener remotely then you don't need a password. Please see the comments and Metalink note: 340009.1

Most of our databases are still 9i but i'll be sure to remove the listener passwords as part of the upgrades.

Tuesday

TIP: Tracing a session via a logon trigger

Awhile back I found a tip which has come in handy quite a few times. Just recently I was troubleshooting an ORA-03124 error which occurred during a run of a BI tool. The BI tool spawns many sessions while gathering and processing data so its not practical to sit there, grab the sessions information and turn on tracing manually.

From Oracle Magazine Tip for Week of June 26, 2006:
CREATE OR REPLACE TRIGGER ON_LOGON_SCOTT
AFTER LOGON ON DATABASE
WHEN ( USER = 'SCOTT' )
BEGIN
execute immediate 'alter session set events ''10046 trace name context forever, level 12''';
EXCEPTION
WHEN OTHERS THEN
NULL;
END;


Of course, don't forget to disable it after!

Thursday

Jinitiator Vulnerability

Great analysis over at Integrigy about on the recent Jinitiator vulnerabilities. If you are running Jinitiator versions 1.1.8.3 to 1.1.8.25 then you should definitely take a look.

Integrigy Jinitiator Vulnerability Analysis

Wednesday

Port Pools

Every now and then I get a request asking which port a specific service is using. The server team may ask for a list of ports so they can be reserved, etc. The Metalink document below lists all the ports required and which services run on them. It also has a nice utility which will calculate the ports required depending on the port pool you choose when installing or cloning.

Subject: FAQ: Cloning Oracle Applications Release 11i Doc ID: Note:216664.1

See question number 21: What is the port pool? What if I want to give a specific value to a Server Port?

Note: the port calculations given by the utility are defaults. They may have been changed in your environment for various reasons. For example, we may have to change the port listed below because it interfered with another application.

<jcache_port oa_var="s_java_object_cache_port" oa_type="PORT">12346 </jcache_port>

Friday

Rman and Windows

Thought I would point out a nice little utility i've used on windows 2003 environments. The program is called linkd and it is part of the Windows Server 2003 Resource Kit. Linkd is a utility which allows you to create Junction points on a windows server. Junction point is just the term used for directory symbolic links.

Why is this handy? Recently I was asked to restore a windows production database on a test server. As you may know with rman, it is very picky on paths. For example, if you place your backup in a directory called d:\backup\nightly and would like to restore this on another server, then you have to place the backup in the same path.

Unfortunately there wasn't enough free space on D: for me to place the backup. However, there was more than enough on the H: drive. I used the linkd utility to create a symbolic link from D:\backup\nightly to H:\backup\nightly (can be any path) and ran the restore.

On unix symbolic links are common knowledge, not the same for windows... In this case I took over from a couple of dba's who were trying to get this working for over a day (24hrs straight). They opened a case with Oracle Support and before I took it over they were trying to extract the datafiles manually from the backup. Yikes, painful. So hopefully this tool will become more well known.

Related links:

How to create and manipulate NTFS junction points - MS Support Site
Junction V1.05 - MS Technet
Windows 2003 Resource Kit download page

Tuesday

Training

Good news and bad news. The good news is that I have finally received approval to go on the 11i System Administrator Fundamentals Ed 3 course from Oracle. This will be my first bit of formal training on EBS. I'm sure it will provide me with some useful information and hopefully some interesting posts. The bad news is that Tom Kyte will be in town that week for a half day session on 11g. This course will be taking place in Toronto towards the end of Sept.

Friday

Cloning and your SGA

Finally back from a little vacation... It started off rocky. During a normal disk resize operation (SAN attached) the storage team encountered a bug (or feature depending how you look at it) and it corrupted all of the databases on our dev server. Normally this isn't a big deal but we have some rather large projects ongoing at the moment and their environments are also on this server. So, the first weekend of my vacation was spent recovering.
----
Cloning is the act of creating an identical copy of an EBS environment. That means everything from the number of httpd servers to the size of your SGA will be the same. This can be a problem if your cloning from production, since its memory footprint is probably significantly larger than your dev/test environments can support. Whether thats because the server(s) aren't physically large enough or because multiple environments are eating into the available memory. So during the cloning process, you may want to resize the database's SGA.

Note: There are probably a few ways to change the SGA size during cloning. So if you know if another method please leave a comment because i'd be interested in knowing. For the sake of this article, lets assume the only parameter I want to change is the sga_target.

Before you clone an environment you have to prepare the source system by running adpreclone.sh on both the appsTier and the dbTier. This process gathers information about your environment and creates template files. In the case of the database tier, these templates are stored under $ORACLE_HOME/appsutil/template. Two files in particular are:

$ORACLE_HOME/appsutil/template/afinit_db920.ora -- 9i Database
$ORACLE_HOME/appsutil/template/afinit_db102.ora -- 10G Database


If you look at your log file ($ORACLE_HOME/appsutil/log/ApplyDBTier*.log) created during the database post cloning ($ORACLE_HOME/appsutil/clone/bin/adcfgclone.pl ) you will see the below:

instantiate file:
source : /envision1/app/oradb/1020_64bit/appsutil/template/afinit_db102.ora
dest : /envision1/app/oradb/1020_64bit/dbs/initORA.ora
backup : /envision1/app/oradb/1020_64bit/dbs/initORA.ora to /envision1/app/oradb/1020_64bit/appsutil/out/ORA_myserver/0
6220730/initORA.ora


In this case I have a 10g database, so the afinit_db102.ora file is being referenced to create the new parameter file. If you view this file it looks like a normal parameter file except for the fact not all the values are specified. For example:

sga_target = %s_db_sga_target% # Required 11i setting

So one way to modify SGA parameters would be to edit the template file for your database version and hard code the value you want, ie:

sga_target = 700M


Another option would be to modify the context file that is used, in conjunction with the template file, to build the parameter file. I'm not 100% sure that the file i'm about to reference is the correct one but from my limited testing it has worked for me. During the preclone, along with the template file creation I noticed some other files are created. One such being:

$ORACLE_HOME/appsutil/clone/context/db/CTXORIG.xml

Changing the following line in this file from:

<sga_target oa_var="s_db_sga_target">800M</sga_target>

to
<sga_target oa_var="s_db_sga_target">600M</sga_target>


also produces the same result. Once adcfgclone.pl completes the SGA_TARGET in the newly created parameter file will be set to 600M.

Tuesday

RC-10203 and directory permissions

This past week we hit a weird issue while trying to clone a test environment from production. It took a fair bit of effort to figure out the problem so I thought i'd mention it in case you have a similar setup but haven't added this item to your checklist.

Almost a year ago we moved to a shared application filesystem. So as part of the cloning process we have to run the script:

$ cd
/bin
$ perl adclonectx.pl sharedappltop \
contextfile=
'applications context file for the existing node'

This configures the node to be able to access the shared setup. This script errored out:

Choose a value which will be set as APPLPTMP value on the target node [1]:
AC-10006: Exception - java.io.FileNotFoundException: /admin/template/adxmlctx.tmp thrown while creating OAVars object for file: ../admin/template/adxmlctx.tmp
RC-50004: Error occurred in CloneContext:
RC-10203: The template provided does not match with the context file.
Check logfile /apps/appsora/oraappl/admin/ORA/out/CloneContext_07270115.log for details.
ERROR: context creation not completed successfully.
Please check /tmp/adclonectx.err file for errors


Our environment hasn't changed much and cloning is a pretty regular exercise. So my first thought is what could be different now than previously? To me this type of problem screams setup issues.

On our side everything seemed fine. It wasn't until we noticed the following that any progress was made:

$ pwd
/apps/appsora/oraappl
$ df -k .
df: cannot canonicalize .: Permission denied


Since this is a shared application file system, the file system is mounted via NFS. Apparently the permissions on the mount point (/apps/appsora) were 700 so it wasn't writable by the group (in this case dba). I still wasn't sure why this was causing our problem... I could view, create, modify files since the account I was using was the owner. (UID's and GID's are global)

To fix the permissions problem the file system was unmounted, and the directory permissions changed to 770. Once it was remounted the adclonectx script worked properly. So you may want to add directory permissions to your checklist if your using NFS and EBS.

AWR Operation failed

I know most are talking about 11g but we finally upgraded our production EBS database to 10g a few weeks ago. A few months ago, while a fellow DBA was on vacation, I used 10g's diagnostics features to troubleshoot a hanging issue in their data warehouse DB. The historical view which allows you to see what was happening in a database during a window of time is priceless and easily helped me resolve the issue. How often do users call you as they are experiencing performance problems? Never...! I'll be in a weekly meeting and users will say that they experienced issues 3 days ago! So I was looking forward to this ability.

Much to my surprise tho when I clicked on the performance tab in database control, the data was a couple of days old. 10 snapshots after the upgrade to 10g finished, it suddenly stopped working. I checked the alert log, trace files, etc but the only error I could find was if I tried to change the AWR settings. ie, changing the snapshot interval would consistently produce the following error:


ORA-13516: AWR Operation failed: only a subset of SQL can be issued ORA-06512: at
"SYS.DBMS_WORKLOAD_REPOSITORY", line 10 ORA-06512: at
"SYS.DBMS_WORKLOAD_REPOSITORY", line 33 ORA-06512: at line 1


A search of Metalink didn't result in any good hits so I opened a tar. An RDA and a few updates later they asked me to execute the following query to see if AWR was in restricted mode. In our case, it was not:

SQL> select nam.ksppinm name, val.KSPPSTVL, nam.ksppdesc description
2 from x$ksppi nam, x$ksppsv val
3 where nam.indx = val.indx AND
4 nam.ksppinm = '_awr_restrict_mode'
5 order By 1
6 /

NAME
--------------------------------------------------------------------------------
KSPPSTVL
--------------------------------------------------------------------------------
DESCRIPTION
--------------------------------------------------------------------------------
_awr_restrict_mode
FALSE
AWR Restrict Mode


The only other suggestion was to restart the database. This past weekend we had to apply some one-off patches so we took advantage of the maintenance window and restarted the database. Since then it has been working properly. I'm not a big fan of this kind of solution, I'd rather know the root cause and if its preventable. But in this case it seems Oracle Support is not sure either.

Monday

ORA-00376: file 6 cannot be read at this time

It seems each night I try to hit the sack a little early (1am) I get paged shortly after. Last night was no exception with a page coming in from one of our application support teams. Users in Australia were having issues logging into one of our customer support applications.

All of the environments I support are unix based but our team does support some Windows servers. So while i'm on call i'm responsible for those as well.

The first thing I do is check the alert log and I notice the following errors repeating over and over:

Errors in file d:\orant9i\admin\orcl\udump\orcl_j000_2096.trc:
ORA-12012: error on auto execute of job 22
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: 'D:\ORCL\MYDATA01.DBF'
ORA-06512: at "SCHEMA.STOREDPROCEDURE", line 3
ORA-06512: at line 1


I seemed to recall some of our other windows based database servers having issues in the past with backup software locking the files. So the next thing I do is called the Intel Support group in Australia and ask them if any backups were executed today. Indeed this past weekends full backups failed and they initiated some manual backups today.

The support person confirmed that the OS backup had completed, so I restarted the database. The database opened successfully and I checked the alert logs to verify there were no errors. As it had in the past occurrences, these steps resolved the issue.

Since I couldn't get back to sleep I kept an eye on the alert log and surfed the web on my home PC. About 20 minutes later the error appeared again followed by an email from a manager in Australia.

I was kind of surprised by this so I checked v$datafile. The status of 3 files was "RECOVER". My first thought was, why do I always get the fun issues when i'm on pager?!

This was a relatively small database and before I attempt any recovery I always take a cold backup. (If time allows.) Once I restarted the database I executed the following commands for each datafile that was marked as needing recovery:

SQL> recover datafile 'c:\my\datafile.dbf';
SQL> alter database datafile 'c:\my\datafile.dbf
' online;

Media recovery was successful and the problem did not reoccur. So just a note for those windows DBA's out there. If you have automated backup software ensure that your datafile directories are excluded. While on unix the worst case would be an inconsistent backup, on windows, file locking behaves differently and could cause an outage like the one I experienced.

Well, i'm off to bed shortly and hopefully will get a good nights sleep!

Wednesday

Resetlogs and your Rman catalog

This is one of those, I knew it at one point but its been so long ago items. Also, the manner in which it was discovered had us scratching our heads for a minute. I try not to blog about work specific items, you never know who is watching out there and i'm sure we have all heard of bloggers who have been fired. Playing it safe, lets just say, someone made a boo boo.....

So just a reminder that when a database is opened with resetlogs your creating a new incarnation of the database. If your using an rman catalog, you will have to login to your catalog and execute register database. Otherwise you will get an error:

Recovery Manager: Release 9.2.0.8.0 - Production

Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.

connected to target database: MYDB (DBID=2508956331)
connected to recovery catalog database

RMAN> 2> 3> 4> 5> 6>
Starting backup at 11-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 07/11/2007 03:30:00
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20003: target database incarnation not found in recovery catalog

Friday

X Server and EBS

According to Note: 231137.1 Troubleshooting the Self Service Framework with Oracle Applications, you need access to an X server for dynamic image generation. These images are used for buttons and menu bars that you see on self service pages. Before I get to far, I should probably add that if your on windows you don't need an X server.

Metalink note: 181244.1 lists the different options you have available and describes how to configure each. In summary the 3 options you have are:

1. Use the X display of your server.
2. If your server doesn't have a display you can use 3rd party software Xvfb or VNC.
3. A Desktop X display server - ie, use 3rd party software such as Exceed on a PC

If your environment is autoconfig enabled you can look at you context file to see where you display is currently pointed.

ora@myserver=> grep display *.xml ORA_myserver.xml: myserver:0.0

You can also see it in your $IAS_ORACLE_HOME/Apache/Jserv/etc/jserv.properties file

wrapper.env=DISPLAY=myserver:0.0


To test whether or not the application is able to access the X Server you can use the AOL test UI:

http://hostname:port/OA_HTML/jsp/fnd/aoljtest.jsp

The first thing you will be prompted for is your connection details:



After you click on the TEST button scroll to the bottom of the screen and click on "Enter AOL/J Setup Test". On the left hand side of the following screen you see a summary of the environment you connected to and a list of tests. Scroll down on the left hand side until you see "X Server Accessibility". Once you click on that link the right hand frame will show you whether or not the X Server is accessible.




Note 231137.1 also describes some of the common errors you may see if the X server is not accessible. However, sometimes you will get rather cryptic errors if the application can't talk to an X server.

For the record, we use the X display of our server (option 1 above). In order to use this option the unix administrators have to type xhost + on the console to grant access to the X server. A caveat is that the unix administrators have to lock their terminal. If they logout, then access is revoked. This past weekend we had some maintenance in which the servers needed to be rebooted. After which the unix admins logged out of the terminal. Sunday morning users reported the following error when in iProcurement:



As you can see, not alot of information to go on. Since this is self-service web application the first thing to check is the apache and jserv log files. In the jserv.log file was the following error:

[24/06/2007 14:38:08:378 EDT] OAButtonBean, localName='button': oracle.cabo.style: Could not initialize the graphical environment
. Please make sure that the DISPLAY environment variable is set correctly. Proceeding with image generation disabled...


Using the X server test via AOL Test UI we could verify that the application couldn't talk to the X server. After a call to the unix team they granted us access, "locked" the screen and once apache was bounced the error was resolved. So in a nutshell, keep that link above handy. We have hit numerous weird errors in the past and sometimes it involves a lot of investigation to find out in the end it was just the X Server.

I believe in R12 the need for an X server is gone but we are a long ways from R12..... For now we may investigate using option 2 or 3 above so that we can manage this piece ourselves.








Thursday

10.2.0.2 Rman duplicate issue

Each night we use rman's duplicate database feature to refresh a test environment to aid in troubleshooting production issues. This past weekend we upgraded our production database from 9i to 10gR2. We also applied alot of patches and the simplest way to update our test environment was to reclone it.

Last night we used our refresh scripts (rman duplicate) to rebuild the test database. Early this morning users reported the following error:




Since this problem wasn't reproducible in production we knew it had to be something with the refresh. The first thing we did was to look for differences between production and test databases. Some objects were invalid in test but recompiling them didn't resolve the issue. Checking through the log files for the refresh I noticed that autoconfig didn't complete successfully.


Uploading Context file /apps/appsora/oraappl/admin/ora_myserver.xml
Context File upload for /apps/appsora/oraappl/admin/ora_myserver.xml failed.
Check log file created by FNDCPUCF program.
. . .
[AutoConfig Error Report]
The following report lists errors AutoConfig encountered during each
phase of its execution. Errors are grouped by directory and phase.
The report format is:


[PROFILE PHASE]
AutoConfig could not successfully execute the following scripts:
Directory: /apps/appsora/oracomn/admin/install/ora_myserver
afcpctx.sh INSTE8_PRF 1


AutoConfig is exiting with status 1



afcpctx.sh uploads you context file to the database by calling FNDCPUCF. I assuming FNDCPUCF stands for FND Concurrent Process Upload Context File but I can't seem to find much information on it. Since it fires off a concurrent request the next step was to check the log files generated for errors. The following error was in one of them:

Uploading context file...
Oracle error -1187: ORA-01187: cannot read from file 702 because it failed verification tests
ORA-01110: data file 702: '/oradata/dbf/oradata/temp02.dbf' has been detected in FND_GSM_UTIL.APPEND_CTX_FRAGMENT.


I haven't encountered that error before so but a quick search on Metalink turned up the following note:


Subject: RMAN After 10.2.0.2 Duplicate Database Gets Ora-1187 And Ora-1110 on TempFiles
Doc ID: Note:374934.1


In 9i tempfile information is removed from the database during the duplication process and you need to re-add them manually. In 10g, tempfiles are now included. However, if the files already exist on the OS then the tempfiles are not re-created and you will see errors in the alert log. So, in order to resolve this issue we dropped the tempfiles and re-added them (as per the metalink note).

To drop tempfiles:

alter database tempfile '/u01/oracle/mydb/temp01.dbf' drop;

To add a new one (assuming your temporary tablespace is named temp):

alter tempspace temp add tempfile '/u01/oracle/mydb/temp01.dbf' size 500M;


Once this was done, autoconfig completed without error and we restarted the application.

Friday

10g Upgrade issues

This weekend we are upgrading our E-Business Suite database from 9i to 10g (10.2.0.2 to be specific). Today while performing some final checks I noticed that an initialization parameter was not set properly in our test environment.

I had used the script bde_chk_cbo.sql from Metalink note: 174605.1 which builds a report consisting of the current, required and default values of your initialization parameters. The parameter in question was optimizer_secure_view_merging and according to the report it was undefined. Which means its using the default value of TRUE. According to Note 216205.1, Database Initialization Parameters for Oracle Applications 11i this parameter should be set to FALSE.

I checked my logs of the test upgrade and I did indeed set this parameter to FALSE. So I performed a search on metalink and found Note 404646.1, How To Set Parameter optimizer_secure_view_merging = FALSE Using Autoconfig. In summary, autoconfig isn't updating your pararameter file properly unless your on version 115.7 of the template file $ORACLE_HOME/appsutil/template/afinit_db102.ora. We have version 115.3 installed. This parameter is needed to avoid some performance issues with dictionary queries. If this applies to your environment, you should review the metalink note for the workaround.

Over the past month or so we have upgraded many environments to 10g, reverted some back to 9i, etc. It started to get confusing as to which environment was running 10g and which were still on 9i. (We have about 14 dev/test/project environments right now, with more planned.) One of my coworkers started up a 9i database with the 10g Oracle Home and we noticed something interesting:

ALERT: Compatibility of the database is changed from pre-10i to 10.2.0.0.0.
Increased the record size of controlfile section 2 to 256 bytes
The number of logical blocks in section 2 remains the same
Increased the record size of controlfile section 4 to 428 bytes
Control file expanded from 1614 blocks to 1644 blocks
Increased the record size of controlfile section 9 to 56 bytes
Control file expanded from 1644 blocks to 1672 blocks
Increased the record size of controlfile section 10 to 200 bytes
Control file expanded from 1672 blocks to 1694 blocks
kcv_lh_or_upgrade: 10.2 upgrading 1 incarnations
Control file expanded from 1694 to 1990 blocks for upgrade.
Setting recovery target incarnation to 1
Wed Jun 20 00:11:34 2007
Successful mount of redo thread 1, with mount id 764203184
Wed Jun 20 00:11:34 2007
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Wed Jun 20 00:11:34 2007
ALTER DATABASE OPEN


As you can probably guess by the messages above, the control file was modified. When the database was shutdown and started with the 9i software the following happened:


ORA-00201: controlfile version 9.2.0.0.0 incompatible with ORACLE version
9.2.0.0.0


Yikes! On Monday I'm going to see if I can recreate the control files and startup the database but right now i'm off to upgrade our production EBS database to 10g.

Patching and EBS

A couple of people [ Steven Chan , Niall Litchfield ] have blogged over the past week about patching an Oracle EBS environment.

Steve says, "I'm compelled to make an admittedly-quixotic attempt to persuade you that the benefits of keeping your E-Business Suite environment up-to-date far outweigh the costs."

In general I agree with him based on my past experiences. A few times, due to a number of reasons, I haven't been able to do as much patching as I would have liked. That has come back and bitten us a few times by hitting issues that we wouldn't have encountered otherwise. There can be a substantial cost to troubleshoot issues. It usually requires time from DBA's, developers, QC and end users. That can add up pretty quickly.

For us tho there are three reasons why we are not as current as I would like. For large efforts, such as upgrades, its very difficult to put a dollar value on being reactive instead of proactive. There is also no guarantee that being upto date will reduce the number of issues or if they will resolve outstanding issues. That makes it difficult to make a case with upper management.

The other reason is because of testing requirements and/or testing resources. We have some custom modules and code so there is a fairly rigorous testing cycle for any change that goes into our environment. We have 2 teams dedicated for testing and at times we have to bring in end users. A testament to the testers tho is that we rarely encounter issues after changes go live.

Lastly, since business has a backlog of requests, our work often takes a back seat. Its a continuous fight to be able to put in some of the things we would like to do. I'm happy to say tho that we have done a bit of work over the past year to upgrade our hardware, consolidate environments and implement features such as Shared Application Filesystem and merging patches to reduce downtime. We recently put in multiple application tiers and without a shared filesystem maintenance would be a nightmare.

Currently we have some very aggressive projects going on at the moment. We are upgrading to 10gR2, implementing the Sales Compensation (OIC) and Projects modules. And if that wasn't enough, upgrading to 11.5.10.2. I think its safe to say that the next 6-8 months are going to be very busy.

Thursday

EBS 10g upgrade problem

Recently we tried to upgrade our database from 9.2.0.6 to 10.2.0.2 following Metalink Note: 362203.1. [FYI: 10.2.0.3 is now certified with EBS and the metalink note has been updated to reflect that.]

The upgrade was applied to a dev environment with no issues and passed the developers sanity checks. Next we upgraded our stage environment but an issue was flagged during user acceptance testing. The issue was isolated to one custom form.

What made this issue confusing was that the problem only existed in stage, not our dev environment. To cut a long story short the problem was related to a known 10g upgrade issue.

Note: 345048.1
Subject: 'Group By' does not sort if you don't use order by in 10g.

The gist behind this note is that in 10g the group by clause uses a hash algorithm to group data while in 9i a sort algorithm was used. As all DBA's know, there is no guarantee that rows will be returned in the same order after each execution unless you use an order by clause. I guess at the time the form was created, the results were returned in an order favorable for the developer.

This doesn't answer why this problem wasn't consistent between the two upgraded environments. When comparing init parameters between both databases, it was noticed that the optimizer_features_enable parameter was not the same. In the environment with the form issue the value was 10.2.0 and the other environment was set to 9.2.0. This is actually one of the workarounds stated in the note above.

As part of the 10g upgrade for EBS you have to modify the init parameters detailed in Note 216205.1, Database Initialization Parameters (init.ora settings) in Oracle Applications Release 11i. While upgrading the problem environment, the DBA had issues reseting a few parameters, optmimizer_features_enable was one of them, so it remained at its 9.2.0 value.

The command used to reset an init parameter if your using an spfile is:

SQL> alter system reset optimizer_features_enable scope=spfile sid='*';

In one environment it worked, but the problem environment returned an error:

ERROR at line 1:
ORA-32010: cannot find entry to delete in SPFILE


In the end the problem was solved, but if both upgrades had been identical some of the confusion and alot of work would have been avoided. The process to find this solution was pretty lengthy.

As a result, we have asked our developers to go through custom code and verify order by clauses are being used when the query also has a group by. It also hightlights the importance of making sure identical steps are followed when promoting changes and to investigate anything different.

Shared Application Filesystem

Recently we upgraded our production environment to a shared application file system. The intention of posts like these isn't to list out step by step which tasks we followed while performing the upgrade but to provide you with an overview of our approach and issues you may encounter.

To start, it would probably be best to provide you with an overview of our architecture. Our production environment is by use by almost everyone in the company. If you have an expense, we use EBS, if you have to submit a timesheet, chances are you use EBS. Of course, its mostly used as a financials environment but by the end of this year more and more modules will be brought online.




As you can see from the diagram, we have a load balanced application tier, a single database node with a cold failover and most of the environment resides on a SAN.

Details on how to implement a Shared Application filesystem can be found in Note: 233428.1










Here are the basic steps we followed:
  • Merge existing APPL_TOPS. The document above doesn't say which node should be the primary in a shared application filesystem. I did encounter one note: 372758.1 which mentions WEB node must be primary. In the end, we did choose the admin node as primary since it would cut down on the implementation time.

    When running perl adcfgclone.pl appsTier on the primary node after you copy the files from the secondary nodes, you have to say yes to all services.
  • Share the filesystem to the other Nodes. Make sure the filesystem is mounted in nolock mode if this is a unix based environment. Or you may encounter the problem described in Note: 390425.1 - Cannot Start Apache After Adding A Web Node On Shared APPL_TOP. For our environment, we shared a SAN mount point on the dbTier via NFS to the application tiers.
  • Add a node to the shared application filesystem. Note: you will need to create a location for each application tier to store configuration files. It needs to be created before you run txkSOHM.pl. In our environment, we used local disk to avoid the issue above.


Issues Encountered
Even tho we ran the adadmin option to check for missing files there were still some that were missing after the merge. Files that were customized on the secondary nodes but not the primary node (pre-merge) did not get updated. One example is the appslogo.gif file. I'm not sure what logic Oracle follows when it determines which files need to be brought over from the secondary servers, so i'm assuming that since appslogo.gif already existed on the primary server, Oracle didn't replace it.

We also noticed that there were quite a few files missing under the $OA_HTML subdirectories.

We have SSL setup in our environment, so we had to re-enable it via the txkrun.pl script. I believe the reason we did this was so we didn't have to manually edit all the files it touches. You will need to copy your SSL certificates to the location you created on each server for configuration files.

Since we have a loadbalanced environment we had to change the related context file values to the appropriate settings. It appears the clone conveniently resets those values. (ie, Web enty%, Active Web Port, Login page... etc)

If you receive a the infamous yellow security bar within forms after you regenerate the jar files and you didn't choose the web tier as your primary server, then you may need to copy your adcert.txt, adsign.txt and adappltop.cer to the primary node.

Maintenance mode via adadmin only works on the primary server. I hit this issue once and haven't double checked it to confirm. Since all patching is performed on the primary server this really isn't an issue.

Each time we run autoconfig a number of files have to be modified, such as adovars.env, some context variables.

As I have mentioned above, on our application tier we have forms/web and on the database tier concurrent/admin. The problem with our approach above is that all the services are enabled on the dbTier. So, when you execute adstrtal.sh all of the services, even apache start up. Obviously, this is not something you want. You can maintain your own start/stop scripts, remove the scripts for the services you don't need (and see some errors when running adstrtal.sh) or you can rerun adcfgclone.pl on the primary tier and disable the services you don't want.

When you execute adpatch it checks $APPL_TOP/admin/adconfig.txt file for which services are enabled on that tier. This file will still show that all services are available, so patching is not an issue. We tested this configuration by applying some patches in our sandbox and confirming that forms and web files have been updated properly.

Monday

Recreating your Global inventory

As every Apps DBA knows, we apply alot of patches. For some reason, every now and then, the global inventory gets corrupted (for lack of a better term) in our cloned environments. For example:

oraora@myserver=> opatch lsinventory

Oracle Interim Patch Installer version 1.0.0.0.55
Copyright (c) 2006 Oracle Corporation. All Rights Reserved..

We recommend you refer to the OPatch documentation under
OPatch/docs for usage reference. We also recommend using
the latest OPatch version. For the latest OPatch version
and other support related issues, please refer to document
293369.1 which is viewable from metalink.oracle.com

Oracle Home = /oradata/app/oradb/920_64bit
Location of Oracle Universal Installer components = /oradata/app/oradb/920_64bit/oui
Location of OraInstaller.jar = "/oradata/app/oradb/920_64bit/oui/jlib"
Oracle Universal Installer shared library = /oradata/app/oradb/920_64bit/oui/lib/solaris/liboraInstaller.so
Location of Oracle Inventory Pointer = /var/opt/oracle/oraInst.loc
Location of Oracle Inventory = /oradata/app/oradb/920_64bit/inventory
Path to Java = /oradata/app/oradb/920_64bit/jre/1.4.2/bin/java
Log file = /oradata/app/oradb/920_64bit/.patch_storage//*.log

Creating log file "/oradata/app/oradb/920_64bit/.patch_storage/LsInventory__03-05-2007_16-43-00.log"

Result:

LsInventory: OPatch Exception while accessing O2O


OPATCH_JAVA_ERROR : An exception of type "OPatchException" has occurred:
OPatch Exception:
OUI found no such ORACLE_HOME set in the environment
Can not get details for given Oracle Home
An exception occurs
null


There is no Interim Patch



OPATCH_JAVA_ERROR : An exception of type "OPatchException" has occurred:
Can not get a list of inventory on this home.

ERROR: OPatch failed because of Inventory problem.



If you do a search on metalink you'll see a few potential resolutions which range from verifying your oraInst.loc file to manually editing the inventory.xml file within your global inventory.

For 11i environemnts, Oracle provides us with a method to recreate the global inventory. For complete details you can view Note:295185.1 but the key steps are:

1. Verify ora inventory_loc within oraInst.loc
2. Take a backup of your inventory location.
3. Make sure you are on OUI 2.2 or later.

If your iAS ORACLE_HOME is missing from the inventory.xml file:

4. Sync your apps environment file.
5. Change to $IAS_ORACLE_HOME/appsoui/setup and execute perl OUIsetup.pl

If your RDBMS ORACLE_HOME is missing from the inventory.xml file:

4. Sync your database environment file.
5. Change to $ORACLE_HOME/appsoui/setup and execute perl OUIsetup.pl

ex. Excuting on the database tier:


oraora@myserver=> . ./DBORA.env
oraora@myserver=> cd $ORACLE_HOME/appsoui/setup
oraora@myserver=> perl OUIsetup.pl
ORACLE_HOME : /oradata/app/oradb/920_64bit (APPSDB_ORA)
OUI Location : /oradata/app/oradb/920_64bit/appsoui/oui
CLASSPATH : /oradata/app/oradb/920_64bit/appsoui/setup/OUISetup.jar:/oradata/app/oradb/920_64bit/appsoui/oui/lib/OraInstaller.jar:/oradata/app/oradb/920_64bit/appsoui/oui/lib/xmlparserv2.jar
JRE_INST : /oradata/app/oradb/920_64bit/appsoui/jre/1.1.8

Executing OUI setup now:
/oradata/app/oradb/920_64bit/appsoui/jre/1.3.1/bin/java -classpath /oradata/app/oradb/920_64bit/appsoui/setup/OUISetup.jar:/oradata/app/oradb/920_64bit/appsoui/oui/lib/OraInstaller.jar:/oradata/app/oradb/920_64bit/appsoui/oui/lib/xmlparserv2.jar -DDEBUG_MODE=false oracle.apps.ad.util.OUISetup /oradata/app/oradb/920_64bit /oradata/app/oradb/920_64bit/appsoui/oui /oradata/app/oradb/920_64bit/appsoui/jre/1.1.8 APPSDB_ORA

Log file : /oradata/app/oradb/920_64bit/appsoui/setup/OUIsetup.log
Checking inventory ...
Updating registration of ORACLE_HOME in global inventory...
Oracle home registration completed.
Removing OUI entries from the global inventory: /oradata/home/oraagf/oraInventory/ContentsXML/comps.xml
Removing OUI entries from the global inventory completed successfully
Registering the OUI component with local inventory...
OUI 2.2 registration completed.


OPatch lsinventory should now work fine.