Oracle-L: RE: Database Monitoring tool

From: Stephen Lee <slee_at_dollar.com>
Date: Fri, 01 Nov 2002 09:15:23 -0800
Message-ID: <F001.004F9F4A.20021101091523@fatcity.com>

What I found happening with monitoring tools that use agents is that these do not connect to the database in the same way that the applications connect; and there were times when the agent thought everything was A-OK when they weren't. So I wrote a ksh script which has grown to 1402 lines. Without handing out a copy of
the entire script, I think I can give you a good idea of the approach I took. You will probably see things in the following script examples that are probably not necessary, but the result of extreme paranoia. My philosophy is: there is no such thing as too much paranoia. I included some script examples on what I thought might be some of the more obscure, less common stuff you see in a ksh script.

The high-level logic is:

Can I read my config files? Info on what databases to monitor; when to monitor; passwords to use; who to page; who to e-mail; status of each database after the last test; whether to send a page or just an e-mail or both when there is a problem with a database.
Is a copy of me already running?
Is my backup box up and running?
Is it time to do maintenance like take current log files and move them to a history location, and delete very old history, send out a summary e-mail of the current state of things (I do it at about 7:00 am).
Test the databases.
Send out any e-mail or pages.

Taking these in order:

It always good to write programs (contrary to the beliefs of some, scripts ARE programs) so that you avoid hard-coding configuration info. That way the script is as generic as possible, and all specific info is in config files. Unless you want to drive people crazy, you want to store the status of the databases at the last test and page only when statuses change rather keep sending out page after page for the same problem. The script also examines the config files themselves and reports any formatting errors therein.

When the script starts, it attempts to create a dummy soft link to a non-existent file. I believe this is an atomic operation. If this succeeds, we are good to go. If it fails, then it is almost certainly because the soft link already exists. So I use the ps command to parse running processes to see if another copy of this script is running. This looks like:

CREATE_LOCK_FILE () {
## The creation of the lock file is not necessary to check if another
copy of the script

## is running. But most of the time another copy will NOT be running, so
I think this

## will save the overhead of running ps and parsing its output. If the
lock file is not

## successfully created, then we run ps and parse its output to confirm
if another copy

## of the script is running.
## It also serves as an easy manual check that the script is running.

ln -s DO_BE_DO_DB_DOO "${ADMIN_DIR}/lock_file" 2> /dev/null if [ $? -ne 0 ]; then

      MY_NAME=`basename "$0"`
      MYPID=$$
      MYPPID=`ps -eo pid -o ppid -o args | sed 's/^ *//g;s/  */ /g' | awk
'$1 == PID {print $2}' PID=$MYPID`
      {
      ps -eo pid -o args | sed 's/^ *//g;s/  */ /g' | awk '
         $1 == PID {next}
         $1 == PPID {next}
         $2 == "vi" {next}
         $0 ~ "awk" {next}
         $0 ~ XXX {print "ALREADY RUNNING"}
      ' PID=$MYPID PPID=$MYPPID XXX=$MY_NAME
      } | while read LINE; do
         if [ "$LINE" = "ALREADY RUNNING" ]; then
            echo "`date '+%D %H:%M'`: Exiting. Already running."
            return 1
         fi
      done

return 0

}

Note: This is in a ksh function, hence the "return 1" rather than an exit. (Good, modular programming and all that, you know.)

Your script should have a trap function to remove the lock_file and any other needed cleanup if it gets a trappable signal. With the number of databases I need to check, and the fact that the free space check uses a significant amount of time, the script will usually run for about 2 minutes (give or take). But if the network is super-slow, or the target computers under extreme load, it is possible that the script will take longer. The script runs every 12 minutes (that's 5 times per hour) as a cron job.

In order to provide redundancy, the script is written so that when it gets its config info, it learns if it is on the primary box, or the backup box. In both cases, it tests the other box to see if it should continue with the database tests, or let the other box do it (since I am the secondary and the primary is up). It sends a notification if it sees a problem with the other box.

To test the database, the script connects to the database as system via sqlplus and selects the name of the database from v$database to see if it gets what it expects to get. The monitoring script runs on a box that is on a different subnet than any of the databases; so, in effect, I am implicitly monitoring the network (sort of) too. This also (sort of) monitors the listener setup on the remote box and the accuracy of my tnsnames.ora file. The philosophy here is one of trying to kill as many birds with one stone as I can. If I get the name of the database OK, then I run an anonymous block that looks at the free space to see if there are any tables with a NEXT extent greater than any free extent left in the tablespace. That is all that is monitored. You can get carried away with this and make the script so fat that it becomes unwieldy and difficult to manage. The test is as follows:

## Try to connect to database
## The sqlplus connection test is run in the background in case it hangs.
## Because it is running in the background, edits to the sid_status file
## by the background process cannot be done since we might have multiple
## background processes trying to edit the file at the same time.
## So we must APPEND to an intermediate test_results file.
## Then do the sid_status edits later.
(

      ## Start by assuming we have unknown error (status=2),
      ## then prove otherwise.
      STATUS=2
      {
      sqlplus -s <<-XXX
         ${LOCAL_DB_USER}/${LOCAL_DB_PASS}@${LOCAL_SID}
         whenever sqlerror exit failure
         whenever oserror exit failure
         set serveroutput on
         set heading off
         set feedback off
         select name from v\$database;
         declare
            cursor c1 is select tablespace_name from dba_tablespaces;
            v_maxspace number := 0;
            v_maxnext number := 0;
            v_maxnext_t number := 0;
            v_maxnext_i number := 0;
         begin
            dbms_output.enable(200000);
            for tspace in c1 loop
               select max(bytes) into v_maxspace from dba_free_space where
tablespace_name = tspace.tablespace_name;
               select nvl(max(next_extent),0) into v_maxnext_t from
dba_tables where tablespace_name = tspace.tablespace_name;
               select nvl(max(next_extent),0) into v_maxnext_i from
dba_indexes where tablespace_name = tspace.tablespace_name;
               if v_maxnext_t > v_maxnext_i then
                  v_maxnext := v_maxnext_t;
               else
                  v_maxnext := v_maxnext_i;
               end if;
               if v_maxnext > v_maxspace then
                  dbms_output.put_line('BLOWOUT IMMINENT');
               end if;
            end loop;
         end;
         /
         exit;
      XXX
      } | sed '/^$/d; s/^[    ]*//g' | while read LINE; do

		parse and analyze stuff .....

	done

If you have replication set up, you should probably monitor the status of that; but I use a separate, more basic script for that and run it much less frequently.

Since, for a number of reasons, sqlplus can hang (usually, database is hung or waaaaaaay slow, or on Linux, if network is broke or remote box is down), I take the entire chunk of the script that tests the database and run it in the background. These are kicked off one after the other, so all databases are tested simultaneously. Since the driving script can't get the output from all these, they write to an output file. Then, when all the background tests have finished, the script takes the output file and uses it to update the file that records the status of the database at the last test and the status at this test. Since, when I wrote the script, I did not know all the possible statuses I might get, these statuses are recorded as numbers which represent an index of a ksh array. As I discovered new things that could be wrong, I just added to the array. Currently, the array looks like:

   STATUS_LIST[0]='OK'
   STATUS_LIST[1]='TNSPING FAILED'
   STATUS_LIST[2]='UNSPECIFIED ERROR'
   STATUS_LIST[3]='SQLPLUS HUNG'
   STATUS_LIST[4]='NEXT EXTENT GREATER THAN AVAILABLE'
   STATUS_LIST[5]='ORACLE NOT AVAILABLE'
   STATUS_LIST[6]='FAILED TO ALLOCATE SHARED MEMORY'
   STATUS_LIST[7]='STARTUP OR SHUTDOWN IN PROGRESS'
   STATUS_LIST[8]='LISTENER COULD NOT RESOLVE SID'
   STATUS_LIST[9]='TNS COULD NOT RESOLVE NAME'
   STATUS_LIST[10]='NO LISTENER'
   STATUS_LIST[11]='PROTOCOL ADAPTER ERROR'
   STATUS_LIST[12]='MAX PROCESSES EXCEEDED'
   STATUS_LIST[13]='INVALID USERNAME/PASSWORD'
   STATUS_LIST[14]='NO APPROPRIATE SERVICE HANDLER'
   STATUS_LIST[15]='FAILED TO START DEDICATED SERVER'
   STATUS_LIST[16]='IO ERROR READING FILE'
   STATUS_LIST[17]='FILE CANNOT BE READ'
   STATUS_LIST[18]='ERROR WRITING TO AUDIT TRAIL'
   STATUS_LIST[19]='SNAPSHOT TOO OLD'
   STATUS_LIST[20]='SHUTDOWN IN PROGRESS'
   STATUS_LIST[21]='IMMEDIATE SHUTDOWN IN PROGRESS'
   STATUS_LIST[22]='END OF FILE ON COMMUNICATION'
   STATUS_LIST[23]='ARCHIVER ERROR. CONNECT INTERNAL ONLY'
   STATUS_LIST[24]='ORA-00600 INTERNAL ERROR'
   STATUS_LIST[96]='BLANK STATUS'
   STATUS_LIST[97]='INVALID SQLPLUS OPTION'
   STATUS_LIST[98]='JUST TESTING'
   STATUS_LIST[99]='CONFIG FILE ERROR'

The matching part of the script that conducts the tests has an awk script like:

         STATUS=`echo "$LINE" | awk '
            substr($0,1,3) == substr(LOCAL_SID,1,3) {print "0"; exit}
            $0 == "BLOWOUT IMMINENT" {print "4"; exit}
            /^ORA-01034/ {print "5"; exit}
            /^ORA-04031/ {print "6"; exit}
            /^ORA-01033/ {print "7"; exit}
            /^ORA-12505/ {print "8"; exit}
            /^ORA-12514/ {print "9"; exit}
            /^ORA-12541/ {print "10"; exit}
            /^ORA-12560/ {print "11"; exit}
            /^ORA-00020/ {print "12"; exit}
            /^ORA-01017/ {print "13"; exit}
            /^ORA-12519/ {print "14"; exit}
            /^ORA-12500/ {print "15"; exit}
            /^ORA-01115/ {print "16"; exit}
            /^ORA-00376/ {print "17"; exit}
            /^ORA-02002/ {print "18"; exit}
            /^ORA-01555/ {print "19"; exit}
            /^ORA-01090/ {print "20"; exit}
            /^ORA-01089/ {print "21"; exit}
            /^ORA-03113/ {print "22"; exit}
            /^ORA-00257/ {print "23"; exit}
            /^ORA-00600/ {print "24"; exit}
            /[Ee][Rr][Rr][Oo][Rr]/ {print "2"; exit}
            /^SP2-0306/ {print "97"; exit}
            ## If no matches above then keep current value of status.
            {print STATUS}
         ' LOCAL_SID="$LOCAL_SID" STATUS="$STATUS"`

Note: the awk script is part of that "parse and analyze" section of the while loop above.

If you're wondering about the "BLOWOUT IMMINENT", that's what the anonymous block returns that tests for free space.

After all these background processes are kicked off, the script runs the "jobs" command every 30 seconds, and keeps doing so until all the jobs have completed; or the maximum allotted time for the jobs has been reached; at which point, the jobs are killed and the status for those databases is logged as SQLPLUS HUNG. Here's what that looks like:

CHECK_JOBS_FOR_TIMEOUT () { MAX_INDEX=$1
CUR_INDEX=1

 ## Give sqlplus sessions plenty of time to do their thing.
 ## The background SQL stuff might take a while to complete.
 ## Adjust how long the script waits by changing the value in the while

condition

COUNT=1
while [ $COUNT -le 8 ]; do

## The first jobs command is to clear out any "jobs completed" messages.

      jobs > /dev/null
      if [ -z "`jobs`" ]; then break; fi
      sleep 30
      COUNT=$(( $COUNT + 1 ))

   done
   jobs > /dev/null
   echo "SLEPT $COUNT times" >> "$LOG_FILE"

## Kill any remaining jobs.

for JOB_NUMBER in `jobs | sed 's/$[^0-9]*$$[1-9][0-9]*$$.*$/\2/'`; do

      kill %${JOB_NUMBER}
      echo "killed job number $JOB_NUMBER" >> "$LOG_FILE"

done

## Then make note of any sqlplus sessions that did not appear to finish
## by checking to see if there is an entry in the test_results file.
## Note: Because some jobs can (and usually do) finish before other jobs
are

## started, the job numbers get recycled. So you can't match the index
in

## the JOB_SID array to its corresponding job number from the jobs
command.

while [ $CUR_INDEX -le $MAX_INDEX ]; do

      ## Read the test_results file to see if test for SID completed.
      SID_WAS_FOUND=`egrep -w \^${JOB_SID[$CUR_INDEX]} "$TEST_RESULTS"`
      ## If test got hung up and didn't complete, then record status 3.
      if [ -z "$SID_WAS_FOUND" ]; then
         ## Use status=3 for sqlplus hung.
         echo "${JOB_SID[$CUR_INDEX]} 3" >> "$TEST_RESULTS"
         echo "NOT DONE: Sqlplus did not complete on
${JOB_SID[$JOB_NUMBER]}" >> "$LOG_FILE"
      fi
      CUR_INDEX=$(( $CUR_INDEX + 1 ))

done

return 0
}

The majority of the script are checks looking for Murphy's Law (Paranoia is good!) rather than actually testing the databases. I'll include the "main" section of the script so you get a good idea of what is being done.

ONE VERY IMPORTANT NOTE: The box that runs this is an old 166 Mhz Pentium running Red Hat 7.1. The ksh that comes with Linux is crap. However, the fine folk in charge of the genuine ksh93 have made it available for download. Also, if you (as I did) write your script on one of the non-open source Unixes (i.e. Solaris, AIX, etc) DO NOT use good programming practice and use absolute paths for utility names. The egotistical propeller-heads in charge of Linux have decided, in their Great Wisdom, that the Unixes of the past didn't do things the way they should have; so now the mantle for fixing all the world's problems is resting on the shoulders of all-knowing, all-seeing, gurus of Linux. And you lowly pee-ons, who are just trying to get work done, should be a grateful recipient of their wisdom. If you are used to using nawk, stop. The unlimited intelligence of the Linux Propeller-heads has divined that you should no longer be using it ... sorry if it breaks your stuff ... but that's the price of "progress". (Hey guys. When, you take time off from sun-burning your tonsils long enough to fix ksh, get back to me. Oh, but dear me! I shouldn't be writing ksh scripts!)

######################### BEGIN MAIN ########################

SET_VARIABLES
if [ $? -ne 0 ]; then exit $?; fi

CREATE_LOCK_FILE
if [ $? -ne 0 ]; then exit $?; fi

CHECK_OTHER_BOX
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi

MAINTAIN_ATTIC
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi

MAINTAIN_OUTPUT_FILES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi

CHECK_CONFIG_FILES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi

SET_MORE_VARIABLES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi

#------------- begin loop to build SID_LIST

SID_LIST=''
## The following line creates environment variables DEFINED in config file (db_times)
## Note: the readability of the TIME_FILE was tested in CHECK_CONFIG_FILES eval `sed -n '/^DEFINE/s/$DEFINE *$$.*$/export \2/p' "$TIME_FILE"`

## BUILD LIST OF SIDS TO TEST
## The following sed statement is multiple lines. sed -n '

s/=/ /g
/^SID/s/$SID *$$.*$/\2/p
' "$TIME_FILE" | while read SID TIME_LIST; do

## If there is no entry for the SID in the sid_status file,
## create a default entry.

   ADD_TO_STATUS_FILE "$SID"
      if [ $? -ne 0 ]; then continue; fi
## Check that the monitoring time entries in the db_times file are legal
   VALIDATE_TIME_LIST $SID $TIME_LIST
      if [ $? -ne 0 ]; then continue; fi
## See if we are monitoring this SID at this time.
   GET_TIME_TO_MONITOR
      if [ $? -ne 0 ]; then continue; fi

## Skip any SIDS that have a config file error.
## Note that this is the only error that goes directly to the sid_status
file.

## This error is generated by the previous function calls.
## All other test output after this goes to the test_results first prior
## to being put into the sid_status file.
CONF_STAT=`awk '$1 == SID {print $3}' SID="$SID" "$SID_STATUS"` if [ "$CONF_STAT" = "99" ]; then continue; fi

## If we are monitoring this SID, then add it to the list.
SID_LIST="${SID_LIST} ${SID}" done
#------------- end loop to build SID_LIST

#------------- begin loop to test SIDs
JOB_INDEX=1
for SID in $SID_LIST; do

## Build an array of SIDs that we are testing.
   JOB_SID[$JOB_INDEX]="$SID"
   JOB_INDEX=$(( $JOB_INDEX + 1 ))
## This function will set the variable DB_PASS for the user SYSTEM.
   GET_SYSTEM_PASSWORD "$SID"
      if [ $? -ne 0 ]; then continue; fi

## This function will set the variable CONNECT which is the SQLNET
connect string.

GET_TNS_ENTRY "$SID"
if [ $? -ne 0 ]; then continue; fi

## See if we can sqlplus to the SID and do a simple select.
TEST_THE_SID "$DB_USER" "$DB_PASS" "$SID" if [ $? -ne 0 ]; then continue; fi

## Give the OS a little time between starting SQL Plus process.
## Maybe this will prevent bogus SQL Plus error messages (SP2-0306)
sleep 2

done
#------------- end loop to test SIDs

## See if there are any sqlplus sessions that did not complete by checking ## if there are test results for the session in the test_results file. MAX_INDEX=$(( $JOB_INDEX - 1 ))
CHECK_JOBS_FOR_TIMEOUT $MAX_INDEX
PUT_TEST_RESULTS_INTO_STATUS_FILE
CREATE_HTML_FILE
PAGE_AND_EMAIL_PEOPLE if [ $TIME -ge $MAINT_TIME -a $TIME -lt $(( $MAINT_TIME + $CRON_INTERVAL )) ]; then

SEND_DAILY_SUMMARY
fi

## If there is a secondary, fail-over box, update it with current info. if [ "`hostname`" = "$PRIMARY_BOX" -a -n "$SECONDARY_BOX" ]; then

RCP_SID_STATUS
fi

rm "${ADMIN_DIR}/lock_file"

exit 0

######################### END MAIN ########################

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Stephen Lee
  INET: slee_at_dollar.com

Fat City Network Services    -- 858-538-5051 http://www.fatcity.com
San Diego, California        -- Mailing list and web hosting services
---------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Fri Nov 01 2002 - 11:15:23 CST