Re: intermittent long "log file sync" waits

From: Chris Stephens <cstephens16_at_gmail.com>
Date: Wed, 29 Jan 2020 08:49:12 -0600
Message-ID: <CAEFL0szY4bMDek2qv4KFbtu6puEtqL00TWcYA4U8ZYmCZnikhg_at_mail.gmail.com>



holy shnickees psnapper is awesome. i discovered quite a few backup related processes that were stuck w/ pstack prior to pulling down and using psnapper. after killing those sessions, the remaining rman processes showed progress w/ psnapper output like:

psnapper]# ./psn -p 874 -g comm,state,syscall

Linux Process Snapper v0.14 by Tanel Poder [https://tp.dev/psnapper] Sampling /proc/syscall, stat for 5 seconds... finished.

  • Active Threads
 samples | avg_threads | comm            | state                  | syscall
------------------------------------------------------------------------------
      43 |        0.43 | (oracle_*_lsst) | Disk (Uninterruptible) | pread64
      31 |        0.31 | (oracle_*_lsst) | Disk (Uninterruptible) | lstat
       7 |        0.07 | (oracle_*_lsst) | Running (ON CPU)       | pread64
       6 |        0.06 | (oracle_*_lsst) | Running (ON CPU)       | fstatfs
       3 |        0.03 | (oracle_*_lsst) | Disk (Uninterruptible) |
[running]
       3 |        0.03 | (oracle_*_lsst) | Disk (Uninterruptible) | open
       3 |        0.03 | (oracle_*_lsst) | Running (ON CPU)       |
[running]
       1 |        0.01 | (oracle_*_lsst) | Disk (Uninterruptible) | fstatfs
       1 |        0.01 | (oracle_*_lsst) | Running (ON CPU)       | open

it turns out we (i) copied over backup scripts and cron entries to another cluster i built up a few weeks ago. i thought i successfully replaced all references to database experiencing issues but that was not the case. I killed all processes related to backups scripts kicked off from this new RAC connecting to problem RAC and now ASH output is looking much better.

i will continue to monitor for unreasonable "log file sync" waits but i suspect this was the cause of the issue. possibly locks on CF? not sure why that wouldn't show up on ASH though.

On Wed, Jan 29, 2020 at 2:13 AM Stefan Koehler <contact_at_soocs.de> wrote:

> Hello Chris,
> as I already mentioned - please use pSnapper (
> https://github.com/tanelpoder/psnapper ) by Tanel Poder.
>
> pSnapper samples /proc file system and does not suspend the process like
> pstack (as it is just a wrapper around gdb which uses ptrace() under the
> hood). Suspending a process can screw it up :)
>
> In addition pSnapper provides more useful information (e.g. state, C
> function that runs in kernel space for a specific syscall, etc.).
>
> Best Regards
> Stefan Koehler
>
> Independent Oracle performance consultant and researcher
> Website: http://www.soocs.de
> Twitter: _at_OracleSK
>
> > Chris Stephens <cstephens16_at_gmail.com> hat am 28. Januar 2020 um 23:49
> geschrieben:
> >
> > was just getting ready to sign off and noticed the archivelog backup
> scheduled to run every hour seems to be stuck and has been for at least 10
> mins:
> >
> > [oracle_at_lsst-oradb04 ~]$ ps -ef | grep oracle_backup.sh
> > oracle 13369 31167 0 16:43 pts/0 00:00:00 grep --color=auto
> oracle_backup.sh
> > oracle 14649 14645 0 16:00 ? 00:00:00 /bin/bash
> /home/oracle/scripts/rman/oracle_backup.sh -d lsst2db -s archivelog.rman -c
> lsst2db_rman -e cs2018_at_ncsa.illinois.edu -r iddsdba_rman
> > [oracle_at_lsst-oradb04 ~]$ pstack 15649
> > Process 15649 not found.
> > [oracle_at_lsst-oradb04 ~]$ pstack 14649
> > #0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
> > #1 0x0000000000440b84 in waitchld.isra.10 ()
> > #2 0x0000000000441e3c in wait_for ()
> > #3 0x0000000000433b0e in execute_command_internal ()
> > #4 0x0000000000433d2e in execute_command ()
> > #5 0x000000000041e365 in reader_loop ()
> > #6 0x000000000041c9ce in main ()
> > [oracle_at_lsst-oradb04 ~]$ pstack 14649
> > #0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
> > #1 0x0000000000440b84 in waitchld.isra.10 ()
> > #2 0x0000000000441e3c in wait_for ()
> > #3 0x0000000000433b0e in execute_command_internal ()
> > #4 0x0000000000433d2e in execute_command ()
> > #5 0x000000000041e365 in reader_loop ()
> > #6 0x000000000041c9ce in main ()
> > [oracle_at_lsst-oradb04 ~]$ pstack 14649
> > #0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
> > #1 0x0000000000440b84 in waitchld.isra.10 ()
> > #2 0x0000000000441e3c in wait_for ()
> > #3 0x0000000000433b0e in execute_command_internal ()
> > #4 0x0000000000433d2e in execute_command ()
> > #5 0x000000000041e365 in reader_loop ()
> > #6 0x000000000041c9ce in main ()
> > [oracle_at_lsst-oradb04 ~]$ pstack 14649
> > #0 0x00007f9a6537e41c in waitpid () from /lib64/libc.so.6
> > #1 0x0000000000440b84 in waitchld.isra.10 ()
> > #2 0x0000000000441e3c in wait_for ()
> > #3 0x0000000000433b0e in execute_command_internal ()
> > #4 0x0000000000433d2e in execute_command ()
> > #5 0x000000000041e365 in reader_loop ()
> > #6 0x000000000041c9ce in main ()
> > [oracle_at_lsst-oradb04 ~]$
> >
> >
> > On Tue, Jan 28, 2020 at 4:30 PM Chris Stephens < cstephens16_at_gmail.com>
> wrote:
> > > ok. i've had the following running since lunchtime and will let it run
> through the night in hopes that issue occurs again:
> > >
> > > _at_snapper ash,stats,trace 10 999999 lgwr
> > >
> > > I will run the following script as well:
> > >
> > > #! /bin/bash
> > >
> > > while [ 1 ]
> > > do
> > > echo `date` >> /tmp/pstack.26552
> > > pstack 26552 >> /tmp/pstack.26552
> > > echo `date` >> /tmp/pstack.26556
> > > pstack 26556 >> /tmp/pstack.26556
> > > echo `date` >> /tmp/pstack.26560
> > > pstack 26560 >> /tmp/pstack.26560
> > > sleep 5
> > > done
> > >
> > > based off:
> > >
> > > [oracle_at_lsst-oradb05 bin]$ ps -ef | egrep "lg.*lsst2db2"
> > > oracle 26552 1 0 Jan17 ? 00:06:14 ora_lgwr_lsst2db2
> > > oracle 26556 1 0 Jan17 ? 00:00:13 ora_lg00_lsst2db2
> > > oracle 26560 1 0 Jan17 ? 00:00:12 ora_lg01_lsst2db2
> > >
> > > All sessions for this workload are connecting to service w/ tracing
> enabled so we'll have trace data as well.
> > >
> > > I will (hopefully) have updates in the morning.
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Jan 29 2020 - 15:49:12 CET

Original text of this message