Bobby Durrett's DBA Blog

Subscribe to Bobby Durrett's DBA Blog feed
Oracle database performance
Updated: 7 hours 3 min ago

SQL Profile example when best plan not clear

Tue, 2020-10-20 12:54

I resolved another production performance issue with a SQL Profile yesterday. I have several posts about SQL Profiles, so I do not want to be redundant, but this case was a little different because it was not clear that I had a better plan. I want to document the challenge that I had deciding if I had the best plan and show the resolution.

On September 21 when I think I was on vacation or otherwise not in the office there was a big performance issue on an older production database. The on-call DBA identified the sql_id of the top SQL statement as 30q69rbpn7g75. But he and an application support developer together could not connect that SQL statement back to the long running reports that were impacting the business. Eventually the issue went away later that night. Here is some of the execution history from that original issue:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75      1463081877 21-SEP-20 10.00.04.123 AM                5         420798.922         144724    209959.863                  0                      0                      0           4824516.8           710748.4                            0                    870
30q69rbpn7g75      1732425686 21-SEP-20 10.00.04.123 AM               13         66.9426923     56.1538462    7.25523077                  0                      0                      0          6410.23077         249.769231                            0             414.538462
30q69rbpn7g75       592872749 21-SEP-20 10.00.04.123 AM                1           4144.437           3240       955.246                  0                      0                      0               63878              29725                            0                     89
30q69rbpn7g75      4175108531 21-SEP-20 10.00.04.123 AM               11         172328.731     92788.1818    62448.1775                  0                      0                      0             3463219         466919.273                            0             610.090909
30q69rbpn7g75      2823019537 21-SEP-20 11.00.17.704 AM               19         332354.362     212357.895     22580.778                  0                      0                      0          11145610.8         163882.105                            0             303.526316
30q69rbpn7g75      2908690533 21-SEP-20 11.00.17.704 AM                1          23377.169          13070     11680.972                  0                      0                      0             1582917              89479                            0                    258
30q69rbpn7g75       291832905 21-SEP-20 11.00.17.704 AM                2         40314.0255          15940      24490.36                  0                      0                      0             1831813             128525                            0                    248
30q69rbpn7g75      1467059382 21-SEP-20 11.00.17.704 AM                1          20179.636           5760     16155.407                  0                      0                      0              124599              84761                            0                    780
30q69rbpn7g75      1033740578 21-SEP-20 11.00.17.704 AM                1            1728.49           1570         2.906                  0                      0                      0                1525                 35                            0                     12
30q69rbpn7g75      4175108531 21-SEP-20 11.00.17.704 AM                8         147782.833        59617.5    65356.3268                  0                      0                      0             2280007          245985.25                            0                    402
30q69rbpn7g75      3938646712 21-SEP-20 11.00.17.704 AM                2         139722.393          55905     86105.482                  0                      0                      0             6080269             616766                            0                 1143.5
30q69rbpn7g75      2823019537 21-SEP-20 12.00.31.659 PM               48         238332.678     138706.875    19077.4738                  0                      0                      0          6928661.85         99573.2708                            0             145.395833
30q69rbpn7g75      2823019537 21-SEP-20 01.00.45.195 PM               64         147520.373     80835.1563    19092.0985                  0                      0                      0          4148771.28         106131.016                            0              79.890625
30q69rbpn7g75      2823019537 21-SEP-20 02.00.58.673 PM               58         180185.939     113102.931    14365.2987                  0                      0                      0          5926129.21         123920.569                            0             22.0344828
30q69rbpn7g75      2823019537 21-SEP-20 03.00.12.404 PM               37         307432.645     201436.216    22904.6901                  0                      0                      0          10204978.9         158950.973                            0             201.243243
30q69rbpn7g75      2823019537 21-SEP-20 04.00.26.543 PM               28         465140.082     326940.357    30687.9033                  0                      0                      0          16715547.3             263153                            0             460.571429
30q69rbpn7g75      2823019537 21-SEP-20 05.00.40.707 PM               14         934982.157     690958.571    41595.1995                  0                      0                      0          34940770.8         365038.357                            0             243.285714
30q69rbpn7g75      2823019537 21-SEP-20 06.00.54.453 PM               14         818768.534     640054.286    42596.9506                  0                      0                      0          33547406.9         451864.786                            0                  471.5
30q69rbpn7g75      2823019537 21-SEP-20 07.00.08.229 PM                4         2329248.39        2013515    75722.5718                  0                      0                      0           104343531         1027683.25                            0                    859
30q69rbpn7g75      2823019537 21-SEP-20 08.00.21.705 PM                1         4006478.22        3707840     88265.422                  0                      0                      0           186157328            1082000                            0                   2744
30q69rbpn7g75      2823019537 21-SEP-20 09.00.34.774 PM                1         1818375.63        1771470     20586.628                  0                      0                      0            88206433             374924                            0                   2692
30q69rbpn7g75      2823019537 21-SEP-20 10.00.49.484 PM                2            1742051        1399440     41061.122                  0                      0                      0          68750135.5             335797                            0                 1479.5
30q69rbpn7g75      2823019537 21-SEP-20 11.00.02.543 PM                0         3552963.71        3183770    142948.208                  0                      0                      0           154159601             633488                            0                      0

This output is from my sqlstat.sql script. Notice how earlier in the day there are several plans with widely varying elapsed times. Plan hash value 1732425686 averages 66 millisconds during the hour ending at 10 am. I do not know why the query has so many different plans. 2823019537 was the bad plan and it got locked in throughout the afternoon and into the night.

Later in that same week I reviewed the plans and found that the longest running plans used certain indexes on the top table and the shorter ones used another index. I looked at the “Segments by Logical Reads” section of an AWR report during the problem time and found that the top segment was index X6_WFCTOTAL. I used by getplans.sql to get all the plans for SQL id 30q69rbpn7g75 and found that the fastest ones used range scans against index X1_WFCTOTAL and the slow ones did range or skip scans against indexes X5_WFCTOTAL or X6_WFCTOTAL. So I picked one plan, 382022017, and used coe_xfr_sql_profile.sql to force 30q69rbpn7g75 to always run with plan 382022017 which used index X1_WFCTOTAL. Here is some execution history of the plan I picked:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75       382022017 24-SEP-20 09.00.13.658 AM                2          1713.2625            865         5.805                  0                      0                      0                2135               76.5                            0                    141
30q69rbpn7g75       382022017 24-SEP-20 10.00.26.654 AM                2            355.329            120      265.0765                  0                      0                      0                8183             1324.5                            0                    673
30q69rbpn7g75       382022017 24-SEP-20 11.00.39.519 AM                1            190.386            180        28.981                  0                      0                      0               17505               1759                            0                   1878
30q69rbpn7g75       382022017 24-SEP-20 01.00.05.144 PM                4          217.37625            200        20.723                  0                      0                      0            21009.25             392.25                            0                   1865
30q69rbpn7g75       382022017 24-SEP-20 04.00.45.160 PM                5           507.1578            114      432.2858                  0                      0                      0              7076.6              936.2                            0                  407.8
30q69rbpn7g75       382022017 24-SEP-20 05.00.58.322 PM                3         47.5793333             10    35.1866667                  0                      0                      0          504.333333         81.3333333                            0                     54
30q69rbpn7g75       382022017 24-SEP-20 06.00.11.202 PM                1            313.107            110       229.071                  0                      0                      0                8178                399                            0                    396
30q69rbpn7g75       382022017 25-SEP-20 04.00.21.833 AM                3         30.7433333     16.6666667    15.0446667                  0                      0                      0                 927                 89                            0             110.666667

This plan was averaging less than 1800 milliseconds. But after putting in the supposedly better plan it seemed to have worse execution times that other plans. Also, the plan hash value was different than 382022017. Here is the execution history from when I put this SQL Profile in place:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75       653445232 28-SEP-20 12.00.42.470 PM                1          92139.015          26930     61332.901                  0                 20.998                      0             2202073            2197909                            0                    171
30q69rbpn7g75       653445232 28-SEP-20 01.00.55.521 PM                9         83373.8866     25152.2222    56386.2852                  0             14.6918889                      0          2102264.11            2065530                            0             2613.88889
30q69rbpn7g75       653445232 28-SEP-20 02.00.08.369 PM                5         101830.312          29264    70157.0998                  0                 9.1714                      0           2439219.6          2432260.2                            0                    348
30q69rbpn7g75       653445232 28-SEP-20 03.00.21.759 PM                9          89705.846     26606.6667    61716.6917                  0             14.2046667                      0          2212464.67         2196829.33                            0             825.111111
30q69rbpn7g75       653445232 28-SEP-20 04.00.34.724 PM                2          91173.826          26425     63832.838                  0                22.1385                      0             2181959          2195820.5                            0                  310.5
30q69rbpn7g75       653445232 28-SEP-20 05.00.47.682 PM                4         90240.0368          26430    62587.1345                  0                 16.558                      0             2200555          2197764.5                            0                 168.75
30q69rbpn7g75       653445232 28-SEP-20 07.00.13.544 PM                3         107541.972     26816.6667     75418.071                  0                 16.164                      0             2193977         2193579.33                            0             39.3333333
30q69rbpn7g75       653445232 29-SEP-20 02.00.45.539 AM                1          92416.895          26410     66108.169                  0                 21.449                      0             2194591            2193764                            0                    158
30q69rbpn7g75       653445232 29-SEP-20 03.00.30.438 AM                1         103773.265          26510     75455.905                  0                 13.887                      0             2200242            2198725                            0                    122
30q69rbpn7g75       653445232 29-SEP-20 05.00.56.488 AM               13         74637.6784     20391.5385    51139.8206                  0             16.9292308                      0          1686133.69         1684935.15                            0             24.2307692
30q69rbpn7g75       653445232 29-SEP-20 06.00.09.283 AM               10         105894.074          28882    72971.0734                  0                20.1169                      0           2418827.8          2417314.8                            0                     56
30q69rbpn7g75       653445232 29-SEP-20 07.00.22.558 AM                4         89408.2108          26080    61537.7755                  0                10.0275                      0             2174791            2169846                            0                 421.75
30q69rbpn7g75       653445232 29-SEP-20 08.00.35.596 AM                7         71644.7906     20574.2857    48465.1234                  0                  8.331                      0             1697739         1694385.43                            0             232.857143
30q69rbpn7g75       653445232 29-SEP-20 09.00.48.530 AM                2         113993.942          32580    72589.2515                  0                 6.2465                      0             2672441            2667206                            0                    145
30q69rbpn7g75       653445232 29-SEP-20 10.00.01.174 AM                6         99793.2442          28600    69181.7687                  0             4.65783333                      0          2394135.83         2387505.17                            0                    246
30q69rbpn7g75       653445232 29-SEP-20 11.00.14.240 AM               10         94787.9044          28367    65535.8735                  0                10.6895                      0           2353904.6          2336951.3                            0                  476.8
30q69rbpn7g75       653445232 29-SEP-20 12.00.27.167 PM                5           89167.86          26462     61499.462                  0                14.4808                      0           2200557.8            2195895                            0                  201.8

Notice first that the plan hash value is 653445232 instead of 382022017. I usually see things like this when the plan has system generated table temporary table names, but I have not seen that in the plan. There must be something like that going on. Either that on the SQL Profile just locks in a slightly different plan. Anyway, 653445232 is the plan caused by the SQL Profile. Notice how the average elapsed time hangs around 90,000 to 100,000 milliseconds. But 382022017 has elapsed times under 1800 milliseconds. Seeing these results, I dropped the SQL Profile. Kind of like a doctor who swears to “do no harm” I did not want to slow down queries that were running very efficiently in my attempt to prevent another system slowdown or outage like we had on September 21. I dropped the SQL Profile on September 29.

Then yesterday, October 19, the problem recurred. This time the bad plan was 3908549093 and used a skip scan on X6_WFCTOTAL. I seem to see a lot of bad plans with skip scans. Here was the execution history before I put the SQL Profile back in:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75      3908549093 19-OCT-20 11.00.07.389 AM               16         832180.951      450528.75    68650.1103                  0                      0                      0          21204921.8           495900.5                            0                    562
30q69rbpn7g75      3908549093 19-OCT-20 12.00.21.033 PM               33         392068.144     194336.364    47412.7197                  0                      0                      0          9271475.06         221593.545                            0             309.454545
30q69rbpn7g75      3908549093 19-OCT-20 01.00.34.570 PM                4         3543778.15        1793980    261653.391                  0                      0                      0          82176276.8            1088971                            0                 1036.5

Between 12 and 1 pm the system was really bogged down with executions of 30q69rbpn7g75 taking almost an hour, 3543778 milliseconds. So, I put the SQL Profile back in. I had the script sitting there from the first time I tried it. These are all kept in the sqlt/utl directory. After putting it in things cleared up and the backlog of reports slowly emptied. I think we killed one long running session and had the user rerun the report with the SQL Profile in place.

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms CPU Average ms IO Average ms Cluster Average ms Application Average ms Concurrency Average ms Average buffer gets Average disk reads Average disk write megabytes Average rows processed
------------- --------------- ------------------------- ---------------- ------------------ -------------- ------------- ------------------ ---------------------- ---------------------- ------------------- ------------------ ---------------------------- ----------------------
30q69rbpn7g75       653445232 19-OCT-20 02.00.48.739 PM               27         106287.519     26936.2963     61603.699                  0             7.41166667                      0          2226367.89         2204705.19                            0             1358.25926
30q69rbpn7g75      3908549093 19-OCT-20 02.00.48.739 PM                0         9380042.72        6465310    308289.185                  0                      0                      0           294469425            1581492                            0                   7067
30q69rbpn7g75       653445232 19-OCT-20 03.00.01.772 PM               80         103522.356       26259.25    67322.2418                  0             21.1012125                      0           2159835.1         2142347.26                            0               914.2625
30q69rbpn7g75       653445232 19-OCT-20 04.00.14.862 PM               31         99252.0065          26990    66069.6995                  0             24.5769355                      0          2227363.45         2213021.26                            0             885.709677
30q69rbpn7g75       653445232 19-OCT-20 05.00.27.768 PM                5         94587.0244          26988    64440.3338                  0                24.1514                      0           2223326.4          2204672.6                            0                 1553.6
30q69rbpn7g75       653445232 19-OCT-20 06.00.40.629 PM                5         93301.2074          26690    65105.9732                  0                14.0782                      0           2213653.6            2203033                            0                  736.4
30q69rbpn7g75       653445232 19-OCT-20 07.00.53.923 PM                3         101117.233     27193.3333    72020.9813                  0             15.4373333                      0          2225218.33            2207644                            0                   1623
30q69rbpn7g75       653445232 20-OCT-20 12.00.59.660 AM                1           92061.35          26550     64953.945                  0                 22.245                      0             2210157            2206170                            0                    235
30q69rbpn7g75       653445232 20-OCT-20 02.00.25.964 AM                1          92872.242          26470     66092.822                  0                 11.999                      0             2208305            2206231                            0                    158
30q69rbpn7g75       653445232 20-OCT-20 05.00.36.773 AM                2         88107.2095          26075     61670.129                  0                17.2175                      0           2205332.5            2203981                            0                    116
30q69rbpn7g75       653445232 20-OCT-20 06.00.49.657 AM                1          91007.493          26210     64276.474                  0                  9.972                      0             2208516            2206310                            0                    177
30q69rbpn7g75       653445232 20-OCT-20 07.00.02.613 AM                4         101878.314          26940     65491.475                  0                 30.476                      0          2210945.25         2204828.75                            0                  461.5
30q69rbpn7g75       653445232 20-OCT-20 08.00.15.526 AM               10         97441.3635          26496    67549.4579                  0                  3.178                      0           2197412.4          2192467.5                            0                  478.1
30q69rbpn7g75       653445232 20-OCT-20 09.00.28.457 AM                5         59362.9672          17038    41573.7714                  0                  7.767                      0           1416804.4            1416061                            0                   10.2
30q69rbpn7g75       653445232 20-OCT-20 10.00.41.653 AM                6         108681.505     30798.3333    75082.4997                  0             10.5146667                      0          2558854.67         2549531.83                            0                    405
30q69rbpn7g75       653445232 20-OCT-20 11.00.54.611 AM                1         186021.778          53440    131017.985                  0                 18.976                      0             4285650            4202806                            0                    814

But with the SQL Profile in place it still in consistently around 90,000 milliseconds even today. There are no executions under 1800 milliseconds like before. It is a puzzle, but it got us through the issue. I am planning to leave this in place to prevent another production outage, but I suspect that these reports may be running longer than normal in many cases. But at least they are consistent and the business needs are being met.

So, this is another example of a SQL Profile to the rescue in a production database performance issue. But this was different because it was not clear that the one plan was always the best one. I went so far as to put it in and take it out and then put it in again. I ended up leaving it in because it resolved a performance issue that we had twice and do not want to have in the future. Hopefully any less than optimal performance outside of these peak times will be worth it since we are preventing an outage with significant impact to the business.

Bobby

Categories: DBA Blogs

Upgrade Data Migration Choices – HP to Linux

Thu, 2020-10-15 18:40
1. Introduction

I just finished moving and upgrading an Oracle database from 11.2.0.4 on HP Unix Itanium to 19c on x86-64 Linux on VMWare along with an army of coworkers on our project team. I want to use this post to describe the choices I considered for migrating the data from the old to the new system. I mention my coworkers because this was a major collaborative effort and I do not want to take credit for the work of others. But I did have to think through the best way to move this big database cross platform and I think that my evaluation of those options in dialog with the project team was a way that I contributed. So, I can blog about it. Also, I think it is worth posting about this subject because the choice of data migration method was challenging and interesting to me. So, if I can find a way to communicate about it clearly enough it might have value to others and even to myself if I come back to this later.

First, I think I should explain some of the parameters for this upgrade/migration. The database files total about 15 terabytes of space. I guess actual data is 7-8 TB but everything else together adds up to 15. The database has tens of thousands of partitions and subpartitions. These partitions/subpartitions are manipulated by batch processes on the weekend and in the weekday evenings. Some of the tables are partitioned by date but others by columns that are not date related. The source system is running 11.2.0.4 and the target system 19c. The source and target platforms have different endianness. At the beginning of the project our goal was to minimize downtime for the final cutover weekend, so I tried to find methods of cutting over from the old to new database with only an hour or two of downtime max. The system has an online web front end that business users use during the weekdays as well as very resource intensive batch jobs on the weekend and on some weeknights. One goal of the project besides upgrading to a newer platform was to improve performance. On some weekends the batch processes did not complete in time for the users to see the updated data on Monday morning. The much faster x86-64 processors on the new system should speed up the weekend batch processes to meet the business need to complete before the Monday workday. CPU was pegged for several hours each weekend on the older HP Itanium system so we knew that newer, faster CPUs would help.

2. Transportable Tablespaces and Cross Platform Incremental Backup

Next, I want to talk about different options we considered. Another DBA on my team worked with an earlier proof of concept for this project to show the potential benefits of it. At some point she tried out the ideas in an earlier version of this Oracle support document:

V4 Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 2471245.1)

This is a fairly complicated procedure designed to speed up the move from Unix to Linux by allowing you to use Transportable Tablespaces to move your data over, converting to the new endianness, and then apply incremental backups of the source system to catch up any changes, changing endianness of the backups as you go. Transportable Tablespaces are kind of a pain, but I have worked on them before and they can be done. But the key limitation of this process is that just before you go live on the new system you must use Data Pump to import all the metadata from your source database. But since we have tens of thousands of partitions and subpartitions the time to import the metadata could be several hours. So, I felt like we could use this approach but with all the complexity and risk we still might have a lengthy period of downtime and another approach might work better. I think the key lesson from this approach is how hard it is to change endianness using a physical backup when you have a huge data dictionary. If we were moving from Linux to Linux, we probably could have used some sort of physical copy along with the application of incremental changes without the big metadata import at the end. But with the big data dictionary and the cross-platform nature of the upgrade the physical copy option (transportable tablespaces) seemed too slow.

One reason I liked this option was that it would work well with NOLOGGING changes. I was concerned early on that I could not use replication methods that pulled changes from the redo logs because some updates would not be logged. But incremental backups do pick up blocks that are updated even if the updates are not written to the redo logs. Eventually I did a bunch of research and found that I could turn on FORCE LOGGING on the source database and that opened up the option of using the redo logs. The problem of the long time to export/import the metadata with the Transportable Tablespaces option pushed me to pursue the FORCE LOGGING research to make sure we could use it.

3. Pre-Loading Static Historical Data

The second approach I considered and discussed with the team early on was exporting and importing historical data in advance and only migrating over actively updated data during the cutover window. If you have most of your data in tables that are partitioned by a date column you could potentially pre-load the older static data on your new system and then migrate a small percentage of your 15 terabyte database during the cutover window. I found that about 4 terabytes had not been updated in the past 90 days and that made me stop pursuing this option. Here is the script I used: statictabsum.sql. I just looked at the last_analyzed date on tables, indexes, partitions, and subpartitions. Not perfect but it gave me some idea of how much data was static. Slightly edited output:

    >select sum(bytes)/(1024*1024*1024*1024) total_tb
  2  from withlastanalyzed
  3  where
  4  last_analyzed < sysdate - &&DAYSOLD;

  TOTAL_TB
----------
4.34381223

The system that I was upgrading is a pretty complicated system with many batch jobs and many interfaces with external systems. I would have to work with the development team to understand what data was truly static and there would be the chance that we would make a mistake and not copy something that we should. I would hate to go live and find that I missed some data that should have been updated on the partitions that I thought were static. Yikes! With my imperfect estimate of how much data could be pre-loaded and knowing the complexity of the system I felt that it was not worth trying the pre-load of the static data.

4. Oracle GoldenGate Replication

The third thing I tried, and thought would work really well for us, was Oracle GoldenGate replication. Several years ago, a coworker of mine had upgraded and migrated the database for a transactional system using GoldenGate with minimal downtime. Oracle has a white paper about how to do this:

Zero Downtime Database Upgrade Using Oracle GoldenGate

The idea is to start capturing changes to your source system and then do a full export/import to your target system. Then apply all the changes that happened on the source database during the export/import to the target db using GoldenGate. We had this setup and began testing it but we had failures because of changes to table partitions and subpartitions by the application batch jobs. We had setup GoldenGate to replicate DML but not DDL. We researched it in the manuals and found that to replicate DDL (partition drops and adds) the source system had to be 11.2.0.4 or later and have compatibility set to 11.2.0.4 or later to do DDL replication in the most current and desirable way. See this section of the 12.2.0.1 installation manual:

13.1.1 Support for DDL capture in integrated capture mode

Unfortunately, even though our source system was on 11.2.0.4 the compatible parameter was set to 11.2.0, the default. I wonder why 11.2.0.4 defaults to compatible=11.2.0? Sure, we could change the parameter, but you cannot easily back out a change to compatible. And we would essentially have to retest our entire application before changing it. Given that our project was running out of time and money that was not an option. So, that led to our final choice.

5. Data Pump Export and Import

The last migration method that we considered and the one we just used in the production migration was Oracle Data Pump export and import, the expdp and impdp utilities. This is the simplest and most widely used method for copying data from one Oracle database to another and it works great across platforms with different endianness and different versions of Oracle. We used this method to populate our new databases for testing. Originally we thought about using the same data migration method for testing that we planned to do for the final production cutover but we did not have the time and money to get GoldenGate setup before all the testing began so we were not able to use it to populate our test databases. We had to use the simpler Data Pump utility. We had a whole crew of people lined up to work on the project, so we had to simply do a full export/import and cut them loose. As the project went on we streamlined our export/import method including working with Oracle support to get a fix for a bug that affected our data migrations. As our project had dragged on for a number of months beyond our initial deadline and as we discovered that GoldenGate was not going to work for us without a lot of more work I started to think about using Data Pump export/import, our last resort. At the time it was taking 50 hours to do the full export/import and I asked if we could get approval for two extra days of downtime – two full working days of application downtime. To me 50 hours to migrate a 15-terabyte database was not bad. I had expected it to take several days based on prior experiences populating dev and test databases from this same source system. The HP Unix hardware was older and had slower network adapters so that was part of the reason for the long data migration time.

Once we got the approval to take the extra downtime, I had to fend off suggestions to go back to preloading the historical data or to ignore tables that we think are obsolete. By this time, I was convinced that a full export/import made the most sense because all our testing was done on systems that were populated with the same method. We had the business approval for the downtime, and we knew that we had been testing for months on databases that had been populated with this method. Finally, our Unix/Linux/Storage team came up with a clever way to cut our export/import time almost in half without changing anything on the database side. Essentially, they figured out how to move the filesystem that we were exporting to onto an HP Unix blade with a faster network adapter. The original system had a 1 gigabit network adapter and the new system had 10 gigabit. Also, they setup an NFS mounted filesystem so we could import straight over the network rather than copy all the files to the target server and have all that duplicate storage space. We tested this approach several times and then the real thing went in without issues and even a little faster than we expected.

6. Conclusion

My main concern for this project was how to migrate such a large database and change the endianness of the data in a short enough time to meet the needs of our users and our company. This concern drove me to investigate several potentially complicated approaches to this data migration. But in the end, we used the simplest and surest method that we had already used to populate our test databases. We took several steps to tune our full Data Pump export/import process. The seven largest tables were broken out into their own parfile and exported to disk uncompressed in parallel. Their indexes were rebuilt parallel nologging. Then the Unix/Linux/Storage team did their magic with the faster network adapter. But even with these helpful performance enhancements our approach was still simple – Data Pump export and import. The more exotic methods that we tried were thwarted by the nature of the database we were upgrading. It had too many subpartitions. The application manipulated the subpartitions during the week. We had the wrong compatible value. Finally, budget and schedule concerns forced the decision to go with what worked, export/import. And in the end, it worked well.

I want to conclude this post with high level lessons that I learned from this process. There is value in simplicity. Do not use a more complex solution when a simpler one meets your needs. The Transportable Tablespaces and GoldenGate options both were cool to work on and think about. But it takes time to dig into things and complexity adds risk of failure. If a simpler solution is safer, meets the business need, and takes less time and money why not use it? I guess the other high-level lesson is that it is easier to get approval for downtime when your project is behind schedule and over budget. When push came to shove downtime was less important than when we went live. Maybe if I had pushed for the longer downtime from the beginning and planned all along to use export/import the project would have been shorter. But I did not know that GoldenGate would run into the issues it did, so it was harder to insist that there was no way to eliminate substantial downtime up front. I also did not know at the beginning of the project that the export/import method could be streamlined to run in 1 day instead of several. Maybe the real lesson is that you have to work through these data migration options along with your project team and make the best choices that you can at the time based on technical issues that you discover and the business needs as they evolve with the project. This project ended with a simple data migration approach that met the company’s needs, but future projects may have different technical and business parameters and the other more complex approaches may be more appropriate to future situations.

Bobby

Categories: DBA Blogs

Elementary, my dear DBA!

Tue, 2020-09-01 17:09

I read all the Arthur Conan Doyle Sherlock Holmes stories and have recently watched a couple Star Trek The Next Generation stories with the android Data pretending to be Holmes and it gave me the idea for this post about diagnosing an Oracle hang/bug that has been giving me fits lately. The logic involved in solving my Oracle problem reminds me of this often-quoted Sherlock Holmes saying:

“…when you have eliminated the impossible, whatever remains, however improbable, must be the truth…”

Sherlock Holmes

I have often had moments in diagnosing Oracle problems that resemble the idea behind this quote. I think this happens when the information I have in front of me does not directly lead to a solution. Most of the time it does. You get an error, you look it up, and you fix the problem. But sometimes things do not make sense. It looks like the database or hardware is not working correctly even though it is. Somehow, I have to discipline my mind to believe that the system is working correctly and follow the logical conclusions. Well, maybe I should talk about my example and it will be clearer.

I spent a week moving my daughter into her college dorm and then visiting my parents in Alabama. It was a lot of travel along with the COVID-19 fears although it all went well. (I wore my mask everywhere). But then I got back in the office (online from home) and found that the project I was working on had a batch job hung up. I was also on call last week, so it was a little stressful. The hung session was a query coordinator process for a parallel INSERT statement, and it was hung on an PX Deq: Join ACK wait. It was hung for several days until we killed it, so it wasn’t just slow. We had already tested this job and every other batch job with no problems. Most were running significantly faster on our newer, faster x86-64 processors. But we were doing a final batch schedule test that runs several jobs at the same time trying to mimic production. No problem I thought. I will just head to Oracle’s support site and look up the wait event and see what to do. But my search turned up empty. If I was on 10.2 or some old version of RAC maybe, but nothing for 19c non-RAC. I tried Google too but nothing. Hmm. So, I opened an SR with Oracle. I made sure to get a statedump and hanganalyze while the session was still hung up. I figured no problem. Oracle will peek at the dumps and give me a patch and we will be on our way. But no deal. Oracle support was helpful, and I was getting them the information they asked for but nothing just jumped out at us. Plus, I was doing my own investigation looking at logs, trace files, and DBA_HIST views, etc. I was getting frustrated. This upgrade is supposed to go live soon and now all of a sudden at the last minute I am getting this weird wait event that no one seems to have heard of even though we have tested on this system for months with nothing like this.

I think this is where the Sherlock Holmes logic of deduction applies. Underlying my frustration is an assumption about reality. My assumption is that this horrible unknown wait is showing up in the same conditions that all our successful testing was in. The idea that the current testing is in a radically different configuration is the “possible but improbable” category. In this case we have two database VM’s setup with the same number of cores and RAM and the same version of Oracle including patches and close to the same data, etc. We have a new production database and a performance database with the idea that development could do their final testing in a performance database that matches production. As it turns out all the init.ora parameters on the two databases are the same except for the ones that size the SGA and PGA memory. We are moving from an HP Unix system with a lot more physical memory than we have on these VMs so the SGA_TARGET and PGA_TARGET settings on that system together exceed the memory available on the new VMs but we copied the parameters from the HP system to the new Linux ones. Evidently, I correctly resized these down on the production system but never on the performance system. So, somewhere in our full-scale batch testing with a bunch of parallel SQL statements overloading the system and a ton of swapping of memory caused by the wrong parameters some parallel server process died leaving my query coordinator hung on PX Deq: Join ACK. I do not really know if this is true. But it fits the facts. I synced these memory parameters up with the working production database, we reran the problem job, and it ran like clockwork. It really helps to have free memory.

In some ways my example is simple. You cannot set SGA_TARGET and PGA_TARGET to more than your total memory and expect good things to happen. DBA 101. But the more interesting thing, at least to me, is the mindset that I need to have when facing a difficult Oracle problem. Here are some of the truths in my mind during this problem:

  1. Our new production database works great and testing has gone very well for months
  2. Our new performance database is very similar to our new production database
  3. Our batch jobs ran fast and error free individually in production
  4. Run together on perf our batch jobs hung up with a weird wait
  5. Lots of things I looked at including AWR reports, alert logs, etc. looked weird

I guess I would like my brain to more quickly jump to questioning my assumption that the performance database is like production. Sure, running a bunch of batch jobs at the same time will put more load on the system and may create problems. But suddenly hitting some weird bug that no one in the world has ever hit doesn’t really make sense. There must be some reasonable explanation that makes sense without assuming some terrible thing is wrong with our database despite months of success using this version of Oracle on this data and code. Other people would be reporting the same thing and blogging about it.

I am not sure how clear this is. I have something I am trying to say about logic and problem solving with Oracle databases but do not know if I have said it in a coherent way. A working database and hardware do not suddenly start doing weird things for no reason. After eliminating that virtual impossibility, you are left with possible solutions that you have not considered because they seem so unlikely. Elementary…

Bobby

Categories: DBA Blogs

Simple Performance Tuning Example

Tue, 2020-07-07 10:41

I helped resolve an Oracle database performance problem yesterday and I thought I would capture what I did in this blog post. It is a “simple” problem in that the solution was easy once I understood the issue and because this sort of problem happens often. But the process to get to the solution is not completely simple. I think documenting the process may have more value than writing down the details of this specific problem.

Today is Tuesday. Thursday last week our development team was doing some testing and performance of the system was terrible. I was busy working on another project that is my primary assignment right now, so I did not look at it. It was also Thursday before the July 4th holiday weekend and we are all mentally and physically checking out a bit early. Some key people were already off on vacation. So, we shut it all down until Monday (yesterday). Also, late Thursday there was some thought that there was a system issue maybe with our virtual machines’ host or networking. So, we let it go. Monday our dev team said they thought the slowness was caused by bad data including duplicates that were throwing errors on a target system (non-Oracle) in the cloud. So, they cleaned up the data and started a new run. But it was slow as molasses and the application front end was also super slow.

In retrospect, looking at the paragraph above, I bet a lot of DBAs with performance tuning experience would jump to the solution. I can think off the top of my head of a few types of things that might cause these symptoms. But I did not immediately think of it so I went through a process to get there. No leap of intuition was required except maybe at the very end.

Since the problem query was currently running the first step was to look at the session monitor in Toad. I may have looked at HP Unix’s Glance tool, which is like Top, to see the CPU usage and top processes at an OS level. What I saw was a parallel query running with 8 parallel child processes and the query had been running about 30 seconds.

Step two was to take an AWR report. First, I ran a command like this to get a snapshot after the current activity:

exec dbms_workload_repository.create_snapshot;

Then I ran an AWR report for the last two snapshots.

The “SQL ordered by Elapsed Time” part of the report looked like this:

SQL ordered by Elapsed Time

The top query azrtxpx1g6mxy was the one I saw in Toad. The second query is a subselect of the top one. Averaging 51.23 seconds and having 46.77% of its time on I/O I eventually went where I usually go with AWR reports. I went to the “Segments by Logical Reads” section of the AWR report. Here is part of that section:

Segments by Logical Reads

I hid the table names, but the point is that the top segment was doing 99.67% of the logical I/Os. So, the issue had to do with I/O. But looking at the “Top 5 Timed Foreground Events” section I/O seems very efficient.

Foreground Events

The top wait was the I/O event “direct path read” which can be deceptively low on systems that use asynchronous I/O. But this system uses HP Unix file system I/O which does not support asynchronous I/O so the direct path read number is correct. One of my list of possible things that could be wrong before I started was some sort of I/O contention but this report pointed away from I/O hardware.

So, the AWR report and my look at Toad session monitor points to a SQL statement doing parallel query full scans of a table and that SQL statement running 10s of seconds in length. So, I looked at the execution plan next. I like to use my getplans.sql SQL*Plus script to extract a query’s plan from the AWR using its SQL_ID. I ran this is SQL*Plus:

@getplans azrtxpx1g6mxy

Sure enough, the same top table was running a parallel full table scan in the plan.

Somewhere in this process I did some counts. I think the simplicity of using select count(*) queries may get in the way of people using it but count(*) queries are one of the biggest helps to me tuning a SQL statement. I did two types of counts for this case. I did a select count(*) on the top table and found out that it had 160 million rows. The count took 45 seconds even through it was doing a fast full scan of an index and not the full table. Then I queried grouping by the column that joins the table to its parent table. This is a typical header detail join where the child table joins the parent on the parent’s unique key. Then I got the idea that the query might run faster if there were an index on the joining column.

I ran my optimizer statistics scripts to check the index columns and to look at the column statistics to see how selective the joining column is. I found that there was no index on the column that the header table used to join to the problem detail table. So, I recommended that the dev team add an index on that column and including the other columns in the where clause for good measure. They might have to try two or three different column choices and orders to get the best index.

But the dev team was skeptical. This same query runs fine in production. It is not new so what changed? I tried running my sqlstat.sql script to get a history of the execution of SQL_ID azrtxpx1g6mxy but there was no history on our test system going back before July 2nd. I ran it on production and there was no history at all. I found out that this SQL had not run for two months on production and since we only keep 6 weeks history in our AWR any execution history of azrtxpx1g6mxy had fallen off the list.

Finally, I had a moment of insight. If I could have seen a faster plan on production, no insight would have been required. It would have been an almost mechanical process of comparing the good and bad plans. But I finally got the idea of running my same optimizer statistics scripts against production and looking at the indexes. Sure enough, production had an index on the joining column. Duh! I probably should have checked for differences between the tables and indexes in test and prod first, but I had no reason to suspect a difference. So, a coworker put the indexes back in test and everything was fine.

So, it is a simple thing. Indexes were dropped on a test system. Happens all the time. Not profound. But maybe the process of driving from problem statement through the different kinds of evidence and tools to the solution is more profound. It gets results. What does the AWR report point to? The top segments pointed to a table. The counts and column statistics suggested the need for an index. Finally, a compare of prod and test showed that an index was missing.

Bobby

Categories: DBA Blogs

Fixed issue with DBA_HIST_SQLSTAT script

Tue, 2020-04-14 18:58

I got rid of the old sqlstat.sql, sqlstat2.sql and sqlstat3.sql scripts in my GitHub repository and fixed an issue that was in sqlstat3.sql. Now the current version is just called sqlstat.sql. This one script should work well in all the situations where the other three were useful. (One script to rule them all…)

The old version of sqlstat3.sql was showing rows with plans that were not being run so I added a condition to only show plans that actually had any elapsed time. Slowly getting more useful.

Old output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      2903317857 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440
g0wqzdhxy62mc      1569680226 14-APR-20 01.00.04.320 PM                0
g0wqzdhxy62mc      2845678604 14-APR-20 01.00.04.320 PM                0

New output:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA
------------- --------------- ------------------------- ----------------
g0wqzdhxy62mc      4109593498 14-APR-20 01.00.04.320 PM              440

I really was not using the other two scripts anymore and was only using sqlstat3.sql. They are in the GitHub history if anyone needs to revisit them. Going forward I will use the new sqlstat.sql.

Bobby

Categories: DBA Blogs

How to email from Python

Tue, 2020-04-07 15:25

I got an email asking how to send an alert email from a Python script. Here is an example based on a script I have setup:

You need to replace “PUT YOUR SMTP SERVER HERE” with the host name of your smtp server.

Bobby

Categories: DBA Blogs

January 2020 Patches Not In Recommended Patch Advisor

Mon, 2020-03-30 10:58

Every quarter I download the latest patch set from Oracle’s support site by using the Recommended Patch Advisor. But the January 2020 patch sets did not show up. The RPA still recommends the October 2019 patch sets. The Recommended Patch Advisor is one of the options under the Patches and Updates tab in Oracle’s support site. I tried searching Oracle’s support site and Googling but did not find any answers. I am sure that the information is out there somewhere but the way I was searching for it did not find it. So, I thought I would post a quick note about it. The short answer is that you should use this Oracle support document to find the latest patch sets and not the Recommended Patch Advisor:

Master Note for Database Proactive Patch Program (Doc ID 756671.1)

To find this out I had to open an SR with Oracle Support. They quickly identified this document which explained the situation:

My Oracle Support: Patch and Health Recommendations No Longer Supported (Doc ID 2644769.1)

Note 2644769.1 seems to say that the Recommended Patch Advisor will continue to refer to the October 2019 patches until Oracle removes the RPA from their support site in May.

The funny thing is that I knew we had an internal document that a coworker and I had both contributed to that describes how to download the latest patch set and apply it to a database to create a tar or zip that we could use to deploy a new database. I decided to modify that document to refer to note 756671.1 and remove references to the RPA. To my surprise my coworker had already included note 756671.1 months ago in our instructions and not the RPA. I think that Oracle support must have recommended the note some time ago, but I was still using the advisor. I guess you learn something new every day.

Anyway, note to self: don’t use the Recommended Patch Advisor to find the quarterly patch updates. Use note 756671.1.

Bobby

Categories: DBA Blogs

Python Script to Scan Listener Log For Incoming Connections

Thu, 2020-02-27 15:55

Wrote a quick and dirty script to scan the listener log for incoming connections and produce a count of how many connections came from which host. Here is the URL:

https://github.com/bobbydurrett/miscpython/blob/master/gethostcounts.py

All it does is hack out the ip address from the second HOST= entry of the lines in the listener log that have HOST= in them. Then it uses the Python function socket.gethostbyaddr to get the hostname for those that can be looked up this way.

Counts the number of lines for a given ip address and sorts them from most to least connections. Might be useful to someone.

Bobby

Categories: DBA Blogs

Bounce Listener and Database After Adding oracle User to Group

Tue, 2020-02-25 11:37

I keep forgetting that I need to bounce both the listener and the database after changing the oracle user’s group membership, so I am writing a quick note to myself here. This is the second time I have been tripped up by something like this and I spent way too much time figuring it out again.

We have some PL/SQL code that writes to files on a database server. The most recent issue was using DBMS_XSLPROCESSOR.CLOB2FILE to write a clob to a file in a certain directory. This got the following error:

ORA-22288: file or LOB operation FILEOPEN failed
Permission denied

I tried writing to the same directory using UTL_FILE.FOPEN with write permission to create a new file. I got this error:

ORA-29283: invalid file operation: cannot open file [29435]

I got a Linux admin to add the oracle Linux user to the correct group and I bounced the database and the error did not go away. Then I bounced the listener and the database, and the errors were resolved.

NOTE TO SELF: if you add the oracle user to a new group to allow PL/SQL to write to a directory using group permissions bounce the database and the listener!

Bobby

P.S. This is on Oracle 19c on Redhat Linux 7 but should apply to a wide range of Oracle versions and Unix like platforms.

P.P.S. Bounce the listener first.

Categories: DBA Blogs

Plan Change Monitor goes off on SQL Analyze query

Fri, 2020-02-21 16:17

Last night, the plan change monitor script we have setup on one of our production databases alerted us about query with a SQL Analyze hint. It looked weird.

The query has sql_id 48tfz5d37xk28 and starts like this:

/* SQL Analyze(3141,1) */ select ...

The weird thing is the output of my sqlstat3.sql script for this sql_id value:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms
------------- --------------- ------------------------- ---------------- ------------------
48tfz5d37xk28      2311822442 20-FEB-20 10.00.28.096 PM                1          15836.016
48tfz5d37xk28      3597180003 20-FEB-20 10.00.28.096 PM                1           4398.618
48tfz5d37xk28      3992115155 20-FEB-20 11.00.01.054 PM                4         57198.8163
48tfz5d37xk28      3597180003 20-FEB-20 11.00.01.054 PM                3          74803.004
48tfz5d37xk28      2311822442 20-FEB-20 11.00.01.054 PM                4         65471.2243
48tfz5d37xk28      2172763952 20-FEB-20 11.00.01.054 PM                4         57206.2353
48tfz5d37xk28      2331870706 20-FEB-20 11.00.01.054 PM                4         57479.7398

It looks like one of the database’s built in processes was running this query with various plans. Plan_hash_value 3992115155 tripped the alert. I guess it was 10 times as long as plan 3597180003. The plan change monitor script could be improved in many ways to eliminate false positives like this, but I just added a condition to exclude all queries run by user SYS:

258c258,259
< m.sql_id not in ('dkqs29nsj23jq','58r2wgh5xx1bg','3m2su959r6tp8','3z60x9386n1v8','bx81s3m9zxutb','czwbdwxx5aqsk','001wbydnwaa7n','6quugay8yt73u')
---
> m.sql_id not in 
('dkqs29nsj23jq','58r2wgh5xx1bg','3m2su959r6tp8','3z60x9386n1v8','bx81s3m9zxutb','czwbdwxx5aqsk','001wbydnwaa7n','6quugay8yt73u') 
and s.username <> 'SYS'

This is the first time I remember seeing something like this cause an alert, but the workaround was easy.

Bobby

Categories: DBA Blogs

AWR records top 30 SQLs by default

Fri, 2020-02-14 16:22

I forget that Oracle’s AWR only records the top 30 SQL statements in each snapshot by default. I am not sure how long this link will last but here is a 19c manual page describing the default: 19c manual – see the topnsql setting. A lot of my query tuning assumes that the problem query is in the AWR but for very efficient queries on active systems they may mysteriously disappear or be absent from the AWR. It sometimes takes me a while to remember that the snapshots only include a fixed number of SQLs.

I use my sqlstat3.sql query to look at a history of a particular sql_id’s executions. Often it shows the query running faster on one plan_hash_value than another. Then I look at why the sql_id changed plans. But what about when the good plan does not show up at all? Several times I have looked at sqlstat3.sql output and thought that a query had not run in the past with an efficient query even though it had. It had run so efficiently that it was not on the report, so it looked like the query was a new, slow, SQL statement.

Often I will fix a query’s plan with a SQL Profile and rerun sqlstat3.sql on a busy system after manually running dbms_workload_repository.create_snapshot to capture the most recent activity and the problem query with the new plan will not show up. Usually I remember that it is not in the top 30 queries and that is why it is missing but sometimes I forget. Here is a partial sqlstat3.sql output showing a long running SQL disappearing after I fixed its plan on Wednesday:

SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Ave ms
------------- --------------- --------------------- ---------------- ------------------
acn0557p77na2      3049654342 12-FEB-20 05.00.01 AM                1          16733.256
acn0557p77na2      3049654342 12-FEB-20 06.00.03 AM                2           49694.32
acn0557p77na2      3049654342 12-FEB-20 07.00.53 AM                6          47694.527
acn0557p77na2      3049654342 12-FEB-20 08.00.54 AM               11         50732.0651
acn0557p77na2      3049654342 12-FEB-20 09.00.33 AM               15         53416.5183
acn0557p77na2      3049654342 12-FEB-20 10.00.43 AM               21         86904.4385
acn0557p77na2      3049654342 12-FEB-20 11.00.02 AM               27          84249.859
acn0557p77na2      3049654342 12-FEB-20 12.00.20 PM               27         125287.757
acn0557p77na2      3049654342 12-FEB-20 01.00.36 PM               69         156138.176

Sometimes I query the V$ tables to verify it is currently running a good plan. Here is example output from vsqlarea.sql showing the good plan running today.

LAST_ACTIVE         SQL_ID        PLAN_HASH_VALUE Avg Elapsed ms
------------------- ------------- --------------- --------------
2020-02-14 16:11:40 acn0557p77na2       867392646             14

This is just a quick note to me as much as anyone else. A query that is missing from an AWR report or my sqlstat3.sql report may not have run at all, or it may have run so well that it is not a top 30 query.

Bobby

Categories: DBA Blogs

DBMS_UTILITY.FORMAT_CALL_STACK Change in 12.2 and later

Thu, 2020-02-13 15:47

Quick note. During my 11.2.0.4 to 19c upgrade that I have been writing about we found a difference in behavior of DBMS_UTILITY.FORMAT_CALL_STACK. I tested it on several versions, and it switched in 12.2. Now it puts the procedure name within the package in the stack.

Old output:

----- PL/SQL CALL STACK -----
  OBJECT      LINE  OBJECT
  HANDLE    NUMBER  NAME
0X15BFA6930         9  PACKAGE BODY MYUSER.MYPKG
0X10C988058         1  ANONYMOUS BLOCK

New output:

----- PL/SQL CALL STACK -----
  OBJECT      LINE  OBJECT
  HANDLE    NUMBER  NAME
0XA796DF28         9  PACKAGE BODY MYUSER.MYPKG.MYPROC
0X7ADFEEB8         1  ANONYMOUS BLOCK

Test code:

select * from v$version;

CREATE OR REPLACE PACKAGE MYPKG
AS

PROCEDURE MYPROC;

END MYPKG;
/
SHOW ERRORS

CREATE OR REPLACE PACKAGE BODY MYPKG
AS

PROCEDURE MYPROC
IS

BEGIN

DBMS_OUTPUT.PUT_LINE(UPPER(dbms_utility.format_call_stack));

END MYPROC;

END MYPKG;
/
SHOW ERRORS;

execute mypkg.myproc;
show errors;

Might be useful to someone else. We had some code that depended on the package name being the last thing on its line, but the new version includes the name of the procedure after the package name.

Bobby

Categories: DBA Blogs

Python 3.8.1 Linux Install Without Root

Tue, 2020-02-11 11:40

I wanted to install the latest Python on Linux in a low powered user’s home directory and not have the install interfere with the Python that comes with Linux. I wanted my own local copy of Python that I could install Python packages to without interfering with anything else. I was willing to use root to install the needed Linux packages, but everything else was done with my low powered user.

I did a yum update on my Oracle Enterprise Linux 7.7 install before I started to get everything on the latest version. As root I used yum to install the Linux packages I needed to install Python 3.8.1 from source:

[root@pythonvm ~]# yum install gcc openssl-devel bzip2-devel libffi-devel tcl* tk* -y
Loaded plugins: langpacks, ulninfo
mysql-connectors-community                                                   | 2.5 kB  00:00:00
mysql-tools-community                                                        | 2.5 kB  00:00:00
mysql57-community                                                            | 2.5 kB  00:00:00
ol7_UEKR4                                                                    | 2.5 kB  00:00:00
ol7_latest                                                                   | 2.7 kB  00:00:00
Package gcc-4.8.5-39.0.3.el7.x86_64 already installed and latest version
Package 1:openssl-devel-1.0.2k-19.0.1.el7.x86_64 already installed and latest version
Package bzip2-devel-1.0.6-13.el7.x86_64 already installed and latest version
Package libffi-devel-3.0.13-18.el7.x86_64 already installed and latest version
Package 1:tcl-devel-8.5.13-8.el7.x86_64 already installed and latest version
Package 1:tcl-8.5.13-8.el7.x86_64 already installed and latest version
Package tcl-pgtcl-2.0.0-5.el7.x86_64 already installed and latest version
Package 1:tk-devel-8.5.13-6.el7.x86_64 already installed and latest version
Package 1:tk-8.5.13-6.el7.x86_64 already installed and latest version
Nothing to do

I must have already installed these on this VM. I did the rest of these steps as my low powered user “bobby”. The next thing I did was download the latest Python source, the 3.8.1 version. I had to use the -k option to get around certificate issues:

[bobby@pythonvm ~]$ curl -k -O https://www.python.org/ftp/python/3.8.1/Python-3.8.1.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.8M  100 22.8M    0     0  7830k      0  0:00:02  0:00:02 --:--:-- 7828k

For some reason I had to do this twice. The first download failed. Next I gunzipped and untarred this archive under my home directory:

tar zxfv Python-3.8.1.tgz

Python-3.8.1/Objects/sliceobject.c
Python-3.8.1/Objects/listobject.c
Python-3.8.1/Objects/typeslots.inc
Python-3.8.1/Objects/weakrefobject.c
Python-3.8.1/Objects/unicodeobject.c
Python-3.8.1/Objects/complexobject.c
Python-3.8.1/Objects/picklebufobject.c
Python-3.8.1/Objects/odictobject.c
Python-3.8.1/Objects/genobject.c
[bobby@pythonvm ~]$

I created a directory called “python” under my home directory to use as the top-level directory for my Python install. This directory tree will hold the binaries and any Python packages that I install.

[bobby@pythonvm ~]$ mkdir python

After changing directory to where I untarred the source files I configured the Python make and install to use the directory I just created as a “prefix”:


[bobby@pythonvm ~]$ cd Python-3.8.1

[bobby@pythonvm Python-3.8.1]$ pwd
/home/bobby/Python-3.8.1

./configure --prefix=/home/bobby/python

config.status: creating Misc/python-embed.pc
config.status: creating Misc/python-config.sh
config.status: creating Modules/ld_so_aix
config.status: creating pyconfig.h
creating Modules/Setup.local
creating Makefile


If you want a release build with all stable optimizations active (PGO, etc),
please run ./configure --enable-optimizations

Then I did the make and make altinstall from the same directory:

make
...
renaming build/scripts-3.8/idle3 to build/scripts-3.8/idle3.8
renaming build/scripts-3.8/2to3 to build/scripts-3.8/2to3-3.8
/usr/bin/install -c -m 644 ./Tools/gdb/libpython.py python-gdb.py
gcc -pthread -c -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall    -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration  -I./Include/internal  -I. -I./Include    -DPy_BUILD_CORE -o Programs/_testembed.o ./Programs/_testembed.c
gcc -pthread     -Xlinker -export-dynamic -o Programs/_testembed Programs/_testembed.o libpython3.8.a -lcrypt -lpthread -ldl  -lutil -lm   -lm
sed -e "s,@EXENAME@,/home/bobby/python/bin/python3.8," < ./Misc/python-config.in >python-config.py
LC_ALL=C sed -e 's,\$(\([A-Za-z0-9_]*\)),\$\{\1\},g' < Misc/python-config.sh >python-config
[bobby@pythonvm Python-3.8.1]$

make altinstall
...
Looking in links: /tmp/tmpdrnzr3vb
Collecting setuptools
Collecting pip
Installing collected packages: setuptools, pip
Successfully installed pip-19.2.3 setuptools-41.2.0
[bobby@pythonvm Python-3.8.1]$

So, now Python 3.8.1 is installed in /home/bobby/python but I want to put the bin directory in the path so I can run python or pip and by default have them be this version. The install created python and pip as python3.8 and pip3.8 but I created links to them so that I could access them without 3.8 at the end of their names:

[bobby@pythonvm Python-3.8.1]$ cd /home/bobby/python/bin
[bobby@pythonvm bin]$ ls -al
total 16704
drwxr-xr-x. 2 bobby bobby     4096 Feb 10 15:36 .
drwxrwxr-x. 6 bobby bobby       52 Feb 10 15:36 ..
-rwxrwxr-x. 1 bobby bobby      109 Feb 10 15:36 2to3-3.8
-rwxrwxr-x. 1 bobby bobby      249 Feb 10 15:36 easy_install-3.8
-rwxrwxr-x. 1 bobby bobby      107 Feb 10 15:36 idle3.8
-rwxrwxr-x. 1 bobby bobby      231 Feb 10 15:36 pip3.8
-rwxrwxr-x. 1 bobby bobby       92 Feb 10 15:36 pydoc3.8
-rwxr-xr-x. 1 bobby bobby 17075312 Feb 10 15:35 python3.8
-rwxr-xr-x. 1 bobby bobby     3095 Feb 10 15:36 python3.8-config

[bobby@pythonvm bin]$ ln -s python3.8 python
[bobby@pythonvm bin]$ ln -s pip3.8 pip

Then I added/home/bobby/python/bin to the front of the path to keep these new python and pip links ahead of the python links or binaries in the rest of the path.

[bobby@pythonvm bin]$ cd
[bobby@pythonvm ~]$ vi .bashrc

export ORACLE_HOME=/home/oracle/app/oracle/product/12.1.0/client_1
export PATH=$PATH:$ORACLE_HOME/bin
export LD_LIBRARY_PATH=/home/oracle/app/oracle/product/12.1.0/client_1/lib
export PATH=/home/bobby/python/bin:$PATH

~
~
~
~
~
~
".bashrc" 17L, 451C written     

I logged out and in as bobby and tried python and pip to see that they were the correct version:

[bobby@pythonvm ~]$ python
Python 3.8.1 (default, Feb 10 2020, 15:33:01)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39.0.3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
[bobby@pythonvm ~]$ pip list
Package    Version
---------- -------
pip        19.2.3
setuptools 41.2.0
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
[bobby@pythonvm ~]$   

Finally, I updated pip just to show that we can update something in this new install. Then I installed numpy to see if I could install a new package.

[bobby@pythonvm ~]$ pip install --upgrade pip
Collecting pip
  Downloading https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl (1.4MB)
     |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1.4MB 863kB/s
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
      Successfully uninstalled pip-19.2.3
Successfully installed pip-20.0.2

[bobby@pythonvm ~]$ pip list
Package    Version
---------- -------
pip        20.0.2
setuptools 41.2.0

[bobby@pythonvm ~]$ pip install numpy
Collecting numpy
  Downloading numpy-1.18.1-cp38-cp38-manylinux1_x86_64.whl (20.6 MB)
     |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 20.6 MB 62 kB/s
Installing collected packages: numpy
Successfully installed numpy-1.18.1
[bobby@pythonvm ~]$ pip list
Package    Version
---------- -------
numpy      1.18.1
pip        20.0.2
setuptools 41.2.0

This is a quick and dirty post, but I want to remember these steps for myself and it might be helpful for someone who uses Redhat/Centos/OEL.

Bobby

Categories: DBA Blogs

AZORA with Viscosity lunch and learn Thursday 2/13/20

Mon, 2020-02-03 17:51

Quick note. The next AZORA meetup will be next Thursday, 2/13/2020.

#meetup_oembed .mu_clearfix:after { visibility: hidden; display: block; font-size: 0; content: " "; clear: both; height: 0; }* html #meetup_oembed .mu_clearfix, *:first-child+html #meetup_oembed .mu_clearfix { zoom: 1; }#meetup_oembed { background:#eee;border:1px solid #ccc;padding:10px;-moz-border-radius:3px;-webkit-border-radius:3px;border-radius:3px;margin:0; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; }#meetup_oembed h3 { font-weight:normal; margin:0 0 10px; padding:0; line-height:26px; font-family:Georgia,Palatino,serif; font-size:24px }#meetup_oembed p { margin: 0 0 10px; padding:0; line-height:16px; }#meetup_oembed img { border:none; margin:0; padding:0; }#meetup_oembed a, #meetup_oembed a:visited, #meetup_oembed a:link { color: #1B76B3; text-decoration: none; cursor: hand; cursor: pointer; }#meetup_oembed a:hover { color: #1B76B3; text-decoration: underline; }#meetup_oembed a.mu_button { font-size:14px; -moz-border-radius:3px;-webkit-border-radius:3px;border-radius:3px;border:2px solid #A7241D;color:white!important;text-decoration:none;background-color: #CA3E47; background-image: -moz-linear-gradient(top, #ca3e47, #a8252e); background-image: -webkit-gradient(linear, left bottom, left top, color-stop(0, #a8252e), color-stop(1, #ca3e47));disvplay:inline-block;padding:5px 10px; }#meetup_oembed a.mu_button:hover { color: #fff!important; text-decoration: none; }#meetup_oembed .photo { width:50px; height:50px; overflow:hidden;background:#ccc;float:left;margin:0 5px 0 0;text-align:center;padding:1px; }#meetup_oembed .photo img { height:50px }#meetup_oembed .number { font-size:18px; }#meetup_oembed .thing { text-transform: uppercase; color: #555; }
Kickstart to the Cloud Begin Your Journey and Optimize Along the Way

Thursday, Feb 13, 2020, 11:00 AM

The Henry
4455 E Camelback Rd Phoenix, AZ

2 AZORAS Attending

Register for one of two Lunch & Learns at The Henry in Phoenix, or stay for both presentations! Seating is limited and an RSVP is required. To RSVP for Session 1, Session 2 or the entire event go to https://tips.viscosityna.com/modernize2020-phoenix/ and complete the form and make your selection. Oracle continues to be the fastest, state-of-the-art…

Check out this Meetup →

Be sure to RSVP on Viscosity’s web site: https://tips.viscosityna.com/modernize2020-phoenix/

I don’t know about you but we are busy moving to or implementing on 19c so these sessions related to 19c and other topics should be valuable.

Bobby

Categories: DBA Blogs

Query Plan Change Diagnosis Example

Tue, 2020-01-28 15:02

This week I investigated an issue with a query that was suddenly a lot slower in one test environment than another. It runs about 2 seconds in the good case as well as in production. But it was now running more than 10 seconds which is painfully slow through the GUI and impacting the team’s ability to finish a software release. I was able to figure out that the query was running a worse plan and get it to run the good plan and helped the team finish their testing.

I thought I would write this blog post documenting the tools that I used and my reasoning while trying to resolve the issue. This post may be redundant with some of my other posts because they document similar tools and approaches, but I thought it could not hurt to put together a complete example. I considered just writing about one part of the process to focus in on that, but I thought I would write a longer post instead that shows the overall process with details along the way. This blog is my experience as an Oracle Database Administrator/Architect, and I work for an individual company so my experiences may not apply to everyone else’s situation. You may work somewhere that has licensed different versions or features of the Oracle database so your choices would be different. But I still can put out there what I am doing and let others apply it to their situation as they can. Also, there are lots of very sharp and hardworking Oracle people out there that write about these same topics and have contributed their own tools to the Oracle community so I wonder about the value of putting my tools on GitHub and writing about how I use them. Why not use their tools and their approaches? Maybe they will work better? Sure, but I know that my approach and tools work well for me. Also most of the ideas I use came from books, talks, and posts by the same sharp people that I have in mind so my approach really reflects my translation of ideas that are out there in the Oracle community with additions of my own so it is not a choice of me versus them. Anyway, I thought I would work through this example on this post so others can review it and find value where they can. Inevitably I will get a lot out of writing it down myself and it will help me remember what I was thinking at this time when I read this post in the future. Hopefully others will benefit as well.

Here is what I knew when I first looked at the problem. A coworker emailed me the SQL for the problem query and verified that the tables in both test environments were about the same size. Sometimes our test environments can be much smaller than production and different from each other in terms of data volume. Usually, when I get a performance question, I ask for two things: the name of the database and the date and time that the issue stopped and started, including time zone. In this case I knew the database name but not the time frame. But I figured that they had been working on it during the current day, so I just got an AWR report for that day from 8 am to present assuming they worked roughly in working hours.

Fortunately, the AWR report had the problem query as the top query so I did not have trouble finding it. It helped that I already knew the text of the SQL statement. The main thing I got from the AWR report at first is the SQL_ID value. I already had the SQL text, but it helps to get the SQL_ID to plug into other scripts I used.

Trimmed down version of AWR report showing problem SQL_ID 4n5ssud3fgsyq

My next step was to run my sqlstat.sql script to see if the problem query, 4n5ssud3fgsyq, has changed plans. Not sure why I did not use the updated version, sqlstat3.sql. I manually updated the query with the SQL_ID value:

where ss.sql_id = '4n5ssud3fgsyq'

and then I looked for plan changes in the output:


SQL_ID        PLAN_HASH_VALUE END_INTERVAL_TIME         EXECUTIONS_DELTA Elapsed Average ms
------------- --------------- ------------------------- ---------------- ------------------
4n5ssud3fgsyq        65249283 08-JAN-20 05.00.05.769 AM               12         47.4083333
4n5ssud3fgsyq        65249283 08-JAN-20 06.00.25.733 AM               21         44.0260952
4n5ssud3fgsyq        65249283 08-JAN-20 08.00.05.250 AM               16         54.5971875
4n5ssud3fgsyq        65249283 08-JAN-20 11.00.04.846 AM               22         165.703818
4n5ssud3fgsyq        65249283 08-JAN-20 12.00.24.960 PM               23          30.143913
4n5ssud3fgsyq        65249283 08-JAN-20 02.00.05.436 PM                1            298.585
4n5ssud3fgsyq        65249283 08-JAN-20 04.00.49.696 PM                3         323.666667
4n5ssud3fgsyq      2414495514 08-JAN-20 05.00.14.441 PM                4          14582.158
4n5ssud3fgsyq      2414495514 08-JAN-20 06.00.34.711 PM                6           9716.264
4n5ssud3fgsyq      2414495514 08-JAN-20 11.00.16.958 PM                8         10078.6125
4n5ssud3fgsyq      2414495514 09-JAN-20 12.00.36.802 AM               20         10876.6251
4n5ssud3fgsyq      2414495514 09-JAN-20 01.00.56.759 AM               10         10751.3919
4n5ssud3fgsyq      2414495514 09-JAN-20 02.00.16.085 AM               13          10297.929
4n5ssud3fgsyq      2414495514 09-JAN-20 03.00.36.804 AM                1          14824.517

I use the PLAN_HASH_VALUE as a name for the plan. I verified that the good plan, 65249283, was run on the other test server that was fine and that it was used in production. So, I had to figure out why the plan changed to 2414495514 on this one test server.

At some point in this process I paused and considered using a SQL Profile to just force the plan back to the good plan. I have written a number of posts about SQL Profiles but this did not seem like a situation where one made sense. Why not spend a couple hours trying to figure out why the plan changed? If I use a SQL Profile, then I will have to migrate the profile to production. And if the query changes the tiniest bit in a future release the SQL Profile will no longer be valid. I have one situation where we have a partitioned table with some almost empty partitions and queries that access the table with the partitioning column passed into the where clause as bind variables. I have been using SQL Profiles to handle that situation on this particular set of production and test databases, but it is a big pain. The last release I had to redo 7-9 SQL Profiles. I’m probably going to have to bite the bullet and find another way around it rather than updating profiles when a release modifies the affected queries. Probably will have to fake the stats on the empty partitions or see if they can be removed. I doubt I will be able to get development to take the partitioning columns out of bind variables. I guess some set of hints might work as well. But anyway, the point is that in this situation that this particular post is about I decided to dig into it and try to find the underlying problem instead of using a SQL Profile band aid.

Having decided to move forward with figuring out why the plan changed I next looked at the two plans. I used my getplans.sql script to extract the plans from the AWR based on SQL_ID and PLAN_HASH_VALUE. I just ran this when logged into the problem database using sqlplus:

@getplans 4n5ssud3fgsyq

I noticed that the slow plan did full table scans on two tables that the fast plan did index lookups on.

SQL_ID 4n5ssud3fgsyq
--------------------
...
Plan hash value: 65249283 - GOOD PLAN

----------------------------------------------------------
| Id  | Operation                             | Name     |
----------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |
|  17 |      INDEX RANGE SCAN                 | TABLE1_2 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |
|  48 |            INDEX UNIQUE SCAN          | TABLE2_0 |


Plan hash value: 2414495514 - BAD PLAN

----------------------------------------------------------
| Id  | Operation                             | Name     |
----------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1   |
...						       
|  48 |           TABLE ACCESS FULL           | TABLE2 	 |
----------------------------------------------------------  
						       

I also looked at the Segments part of the AWR report and TABLE2 was at the top of the logical and physical reads reports.

Edited picture of AWR report. Top Table is TABLE2 in execution plans.

So, at this point in my attempt to understand why the plan changed I have the clue that the bad plan has full table scans, and that one of the two tables with the full scans has a substantial percentage of the overall I/O on the database. So, I had two ideas in mind. First, I wanted to check that the two tables had the same indexes as on the working system. Maybe the indexes got dropped during some development work. If the indexes were there, I could just look at the optimizer statistics for the two tables and compare them to what they have on the working test system and production.

I use my optimizer statistics scripts to look at indexes and information that the optimizer uses to choose the best plan. I know that there are other ways to get the same information, such as in a graphical program like Toad or OEM, but I like having scripts that give me exactly the information that I know is important. It is a little clunky to use but not difficult. The top level script is all.sql and I usually comment out the calls to scripts that produce an excessive amount of output. I probably should put all.sql out like this in GitHub:

@tablelist.sql
--@colpartstats.sql
@columnstats.sql
--@histograms.sql
@indexcolumns.sql
@indexstats.sql
--@indpartstats.sql
--@parthists.sql
@tablestats.sql
@tabpartstats.sql
@tabsubpartstats.sql

Seeing all the histogram information for every partition on a table with thousands of partitions is a lot of output and not really needed. If you have many subpartitions you might comment out tabsubpartstats.sql also. To use these scripts you just put your table owners and names in tablelist.sql and run all.sql. The script indexstats.sql listed out the indexes and statistics on them and indexcolumns.sql showed what columns the indexes were on. Unfortunately, in this case the indexes and their columns were the same for the two full scan tables on both test environments and on production. So the switch of the plan to full scans was not due to the lack of indexes. These tables were not partitioned so the tablestats.sql and columnstats.sql reports were the main ones I looked at to compare statistics. Maybe one of them had very out of date or even missing statistics. Unfortunately, the statistics on the two tables looked pretty similar on all three databases. So, my look at the indexes and statistics of the two tables with full scans on the bad plan did not explain the change in plan.

At some point in this process I decided to run the problem query to see if it gave me clues about the plan change. I am putting this post together remembering the steps I went through, but I may not have the exact chronological order. But I should be able to capture the main things I thought about in an order that makes sense. In order to test the query, I needed to come up with values for the bind variables. I used my bind2.sql script to extract some of the bind variable values on the problem test database. I edited bind2.sql to have the problem SQL_ID, 4n5ssud3fgsyq. I chose the most recent two sets of bind variable values to plug into the problem query. I extracted the query text using dumpsql.sql and used Toad to format it nicely. Then I plugged the formatted query into my test2.sql script. To use test2.sql you need to replace the select statement that is there on GitHub:

SELECT /*+gather_plan_statistics*/ * from dual where dummy <> ' ';

with your own query. But you have to make sure that the gather_plan_statistics hint stays in the select statement. There are different ways to handle bind variables and their values in a test script like this but in this case, I manually edited the test script replacing the bind variables with literals that I got from bind2.sql. I chose two sets because one set had nulls for a couple of values so I thought the nulls might cause the bad plan. There might be some odd choice of bind variable values on this test system that was causing the bad plan, or so I thought. But I ran the test2.sql script as described and it ran slow on the one test system for both sets of bind variable values and fast on the other.

My tests with the bind variable values showed that the problem was not related to those sets of bind variable values and that even with literals in place the problem query runs slow on the problem test database. The plan was not the same as the bad plan I had extracted for the original query, but it still had the two full scans. In the same way, my tests had index lookups on the two tables on the other test environment even with the literals replacing the bind variables. So, all this really proved so far was that the two sets of bind variable values that I chose did not affect the problem.

To go further I started looking at the output of test2.sql. The way I get the plan here shows estimated and actual rows. If I find discrepancies in estimated and actual row counts that could point me to where the problem lies. This query in test2.sql dumps out the plan in a format that shows estimated and actual rows:

select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

Here is the edited output for the bad plan for the two full scan tables:

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Starts | E-Rows | A-Rows |   A-Time   |
----------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL        | TABLE1  |      2 |   1593K|   3186K|00:00:01.36 |
...
|  48 |           TABLE ACCESS FULL  | TABLE2  |      2 |   1628K|   3344K|00:00:15.13 |
----------------------------------------------------------------------------------------

But I am not sure what this is telling me. For TABLE1 the optimizer expected 1593K rows and got 3186K. It expected 1.5 million rows but got 3 million. Similarly, about 1.6 million and 3.3. But how does this help me figure out what the problem is? I looked at the good plan in the same way but was equally puzzled:

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |      0 |      1 |      0 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |      0 |      1 |      0 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |      0 |      1 |      0 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |      0 |      1 |      0 |
-------------------------------------------------------------------------------------

Here the estimated rows is 1 but actual is 0. But how do I use that to figure out what changed the plan on the one test environment? Then I got the key insight to solve this whole thing. It was not new to me, but I don’t do this every day and it was a good reminder and I wanted to blog about it. I considered blogging about just this one point and maybe I should edit this post down to just this idea. But then I wouldn’t put out the overall steps and tools I used to get to this point. The key insight was to use a full set of outline hints to force the bad plan and good plan to run in both environments and to compare the estimated and actual rows with each plan. I am not sure how this post will end but that is the most important point. Presumably the optimizer considered both plans and gave each of them a cost and chose the lowest cost plan to run. On our problem system the much slower plan has a lower cost than the faster one. The opposite is true on the working systems. What we want is to force the query to run the bad plan both places and see where the estimated rows differ. Then do the same for the good plan. A key idea that I learned about query tuning from people like Tom Kyte, Jonathan Lewis, and others is that the optimizer chooses a bad plan because of bad row estimates or bad estimates of the cost per row of certain operations. Most of the time it comes down to bad row estimates. So, which table is getting the bad row estimates and why? My way of finding it out in this case was to use outline hints to force the same plan to run on both databases and then use the test2.sql output to look at estimated and actual rows.

I was not sure how to extract the plans from the AWR and get the outline hints. I’m sure there is a way but in my hurry I ended up just using my plan.sql script to use EXPLAIN PLAN to get the plan and pull the outline hint from there. I checked the plans to make sure they had the same index versus full scan issues as I have been seeing. At the end of the output for plan.sql is an outline hint section that starts like this:

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA

You just copy and paste everything from /*+ through */ as your hint. I put this in my test2.sql script that I had used before with one set of the bind variable values as constants and I merged the gather_plan_statistics hint with the outline hint so the query started like this:

 SELECT /*+gather_plan_statistics 
      BEGIN_OUTLINE_DATA

This way the gather_plan_statistics hint collects the actual and estimated rows and the outline hint forces the plan to be used.

At first, I looked at the estimated and actual rows on the good and bad plans on both systems for the two tables that switched to full scans on the problem system.

Good DB Bad Plan Full Scan Tables

------------------------------------------------------------------------------------
| Id  | Operation                             | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1  |      1 |   1590K|   1593K|
...
|  48 |           TABLE ACCESS FULL           | TABLE2  |      1 |   1957K|   1957K|
------------------------------------------------------------------------------------

Bad DB Bad Plan Full Scan Tables

Plan hash value: 1397976319

------------------------------------------------------------------------------------
| Id  | Operation                             | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS FULL                 | TABLE1  |      1 |   1593K|   1593K|
...
|  48 |           TABLE ACCESS FULL           | TABLE2  |      1 |   1628K|   1672K|
------------------------------------------------------------------------------------

Good DB Good Plan Full Scan Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |     16 |      1 |     18 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |     16 |      1 |     18 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |     12 |      1 |     12 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |     12 |      1 |     12 |
-------------------------------------------------------------------------------------

Bad DB Good Plan Full Scan Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  16 |     TABLE ACCESS BY INDEX ROWID       | TABLE1   |     16 |      1 |     18 |
|* 17 |      INDEX RANGE SCAN                 | TABLE1_2 |     16 |      2 |     18 |
...
|  47 |           TABLE ACCESS BY INDEX ROWID | TABLE2   |     12 |      1 |     11 |
|* 48 |            INDEX UNIQUE SCAN          | TABLE2_0 |     12 |      1 |     11 |
-------------------------------------------------------------------------------------

No obvious pattern shows up on these to me. It seems like the optimizer has the same estimates for these two tables on both databases for both plans. But then I got the second key idea: look at the other tables. It is typical when a plan switches to a full scan for there to be something wrong with that table’s stats. But nothing pointed to that in this situation. So, I got the idea of stepping back and looking at the other tables. That led to me finding the actual issue. Two tables that were not the ones with the full scans had different degrees of discrepancies between actual and estimated rows on the bad and good databases.

Good DB Bad Plan Other Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID    | TABLE3   |      1 |      3 |      6 |
|* 13 |         INDEX RANGE SCAN              | TABLE3_1 |      1 |      3 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID      | TABLE4   |      6 |     55 |     11 |
|* 15 |       INDEX RANGE SCAN                | TABLE4_1 |      6 |     55 |     11 |
...
-------------------------------------------------------------------------------------

Bad DB Bad Plan Other Tables

-------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Starts | E-Rows | A-Rows |
-------------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID    | TABLE3   |      1 |     22 |      6 |
|* 13 |         INDEX RANGE SCAN              | TABLE3_1 |      1 |     22 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID      | TABLE4   |      6 |    153 |     11 |
|* 15 |       INDEX RANGE SCAN                | TABLE4_1 |      6 |    153 |     11 |
...
-------------------------------------------------------------------------------------

Good DB Good Plan Other Tables

----------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID | TABLE3   |      1 |      3 |      6 |
|* 13 |         INDEX RANGE SCAN           | TABLE3_1 |      1 |      3 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID   | TABLE4   |      6 |     55 |     11 |
|* 15 |       INDEX RANGE SCAN             | TABLE4_1 |      6 |     55 |     11 |

----------------------------------------------------------------------------------

Bad DB Good Plan Other Tables

----------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------
...
|  12 |        TABLE ACCESS BY INDEX ROWID | TABLE3   |      1 |     22 |      6 |
|* 13 |         INDEX RANGE SCAN           | TABLE3_1 |      1 |     22 |      6 |
|  14 |      TABLE ACCESS BY INDEX ROWID   | TABLE4   |      6 |    153 |     11 |
|* 15 |       INDEX RANGE SCAN             | TABLE4_1 |      6 |    153 |     11 |
...
----------------------------------------------------------------------------------

Looking at these two tables the bad database had significantly higher row estimates than the good database and the actual rows was the same on both. Errors in estimated rows multiply through the plan as joins are considered. So, at this point I thought it could be that the estimates on these tables (TABLE3 and TABLE4) led to the switch to full scans on TABLE1 and TABLE2. So, I went back to my optimizer statistics scripts and took TABLE1 and 2 out of tablelist.sql and put TABLE3 and 4 in and looked for differences. I found the difference in the columnstats.sql output. The NUM_BUCKETS value for the first column of TABLE3 and TABLE4 was 1 on the bad database and 254 or 255 on the good. So, column statistics had been gathered differently on these two tables on the bad database at some point. Maybe in the process of testing or developing on this system statistics were gathered in a non-standard way. The database in question uses Oracle’s default statistics job. (This is 11.2.0.3, HP Unix Itanium by the way). So, I thought I would try gathering statistics on the two tables with dbms_stats.gather_table_stats with default options because that would be similar to what the default stats job would do. I did something like this:

execute dbms_stats.gather_table_stats('MYSCHEMA','TABLE3');

execute dbms_stats.gather_table_stats('MYSCHEMA','TABLE4');

After gathering stats on the two tables I verified that the first column of each had a 254/5 bucket histogram. Then I just tested the problem query on both, and they ran the same fast plan. Then the development team tested through the application and verified that it was back to normal. I flushed the shared pool before turning it over to them in case the bad plan was cached.

This problem was a little tricky because the stats were wrong on two tables that were not the ones that switched to full scans. A lot of times the situation is simpler such as a table is missing statistics altogether. But I thought it would be helpful to work through a description of the process that I took to get to that point and the tools that I used. The key point is the one I put in bold above which is to look at the estimated and actual rows for the same plan on both databases and look for differences. Once I did that it was just a matter of finding the tables with the different estimates on the two dbs. That led me to look at their statistics and to regather them in a better way. It took a little time to figure this out, but it was nice to find the problem and do the least amount of change necessary to resolve the issue instead of just randomly trying things. Sometimes people will just gather stats on all the tables in the query. That probably would have worked here if they gathered them in the right way. Or you could just use a SQL Profile and forget trying to figure out why the new plan popped up. But it was worthwhile to understand why the plan changed in this case. For one thing I was concerned whether the same plan change would happen in production, so I wanted to know why it changed on the test environment. Knowing that the two tables had statistics gathered on them in a way that we would not use in production I have more confidence that prod is fine. There really is value in digging into why a query’s plan has changed but it does take some work. Often the bad plan is a result of a bad row estimate in some part of the plan and it makes sense to track down which table has the bad row estimate and dig into why the estimate was off. In many cases bad row estimates relate to how and when statistics were gathered on the problem table as was the case with our two tables in this situation.

Categories: DBA Blogs

Datapump Import Partitioned Tables ORA-00600 qesmaGetPamR-NullCtx

Wed, 2019-12-18 10:28

I have not yet had time to build a test case and prove this out, but I wanted to document one last bug that we found so far in our 11.2.0.4 to 19c upgrade. We tried copying a bunch of partitioned tables on our source database to the new one using Datapump Import (impdp) over a database link. We got a boatload of errors like this:

ORA-00600: internal error code, arguments: [qesmaGetPamR-NullCtx], 

There are many Oracle bugs like this, but they seem to have been fixed in 11.2.0.4. For example:

Bug 12591399 – ORA-600[qesmagetpamr-nullctx] / ORA-14091 with distributed query with local partition table (Doc ID 12591399.8)

Puzzling. We ended up just exporting to disk and that has worked well so no big deal, but I wonder if this is some sort of recession of a fixed bug.

Anyway, I am off for the rest of the year. This should be my last post unless I mess with Nethack over vacation and post something about that. I hope everyone out there has a good new year.

Bobby

P.S. Created a simple partitioned table with 2 partitions and 100 rows in each one. I got the error importing over a link from 11.2.0.4 to 19c. It worked perfectly going from 11.2.0.4 to 11.2.0.4. Same source table. Parfile:

$ cat bobby_link_test.par
userid=MYUSER/MYPASSWORD
JOB_NAME=BOBBY_TEST
DIRECTORY=BOBBY_DIR
NETWORK_LINK=MYLINK
LOGFILE=bobby_link_test.log
tables=TEST

Table:

CREATE TABLE test
(
  PART_COL              NUMBER,
  data                  NUMBER
)
PARTITION BY RANGE (PART_COL)
(  
  PARTITION PART_COL_1 VALUES LESS THAN (100),  
  PARTITION PART_COL_2 VALUES LESS THAN (200)
)
;

PPS. Works fine going from 11.2.0.4 to 18c. Going to try a different 19c database just to be sure it isn’t the one that has the problem.

PPPS. Definitely a 19c bug. It fails on two different 19c databases but not on 18c. In every case source is same 11.2.0.4 database and same small partitioned table. Does anyone have time to file the bug report?

Categories: DBA Blogs

db_securefile PREFERRED results in ORA-60019 with small uniform extents

Tue, 2019-12-17 17:35

Last 19c upgrade issue. Working on our new 19c database, several things died off with errors like this:

SQL> execute DBMS_STATS.CREATE_STAT_TABLE ('MYSCHEMA','MYSTATTAB','MYTS');
BEGIN DBMS_STATS.CREATE_STAT_TABLE ('MYSCHEMA','MYSTATTAB','MYTS'); END;

*
ERROR at line 1:
ORA-60019: Creating initial extent of size 14 in tablespace of extent size 8
ORA-06512: at "SYS.DBMS_STATS", line 20827
ORA-06512: at "SYS.DBMS_STATS", line 20770
ORA-06512: at "SYS.DBMS_STATS", line 20765
ORA-06512: at line 1

Our tablespaces had small uniform extents and our 19c database had defaulted the parameter db_securefile to PREFERRED. We bumped our uniform extent sizes up to 1 megabyte and the problem went away. Setting db_securefile to PERMITTED also resolved the issue.

Oracle’s support site has a bunch of good information about this. This might be a relevant bug:

Bug 9477178 : ORA-60019: CREATING INITIAL EXTENT OF SIZE X IN TABLESPACE FOR SECUREFILES

Bobby

Categories: DBA Blogs

Datapump Import Fails on Tables With Extended Statistics

Tue, 2019-12-17 17:11

Quick post before I leave on vacation. We used Datapump to import a schema from an 11.2 HP-UX database to a 19c Linux database and got errors on a few tables like these:

ORA-39083: Object type TABLE:"MYSCHEMA"."TEST" failed to create with error:
ORA-00904: "SYS_STU0S46GP2UUQY#45F$7UBFFCM": invalid identifier

Failing sql is:
ALTER TABLE "MYSCHEMA"."TEST"  MODIFY ("SYS_STU0S46GP2UUQY#45F$7UBFFCM" NUMBER GENERATED
ALWAYS AS (SYS_OP_COMBINED_HASH("COL1","COL2","COL3")) VIRTUAL )

Workaround was to create the table first empty with no indexes, constraints, etc. and import. Today I was trying to figure out why this happened. Apparently, the table has extended statistics on the three primary key columns. I found a post by Jonathan Lewis that shows a virtual column like the one this table has with extended statistics. The error is on the datapump import, impdp, of the table that has extended statistics. This error is similar to some Oracle documented issues such as:

DataPump Import (IMPDP) Raises The Errors ORA-39083 ORA-904 Due To Virtual Columns Dependent On A Function (Doc ID 1271176.1)

But I could not immediately find something that says that extended statistics cause a table to not be importable using Datapump impdp.

If you want to recreate the problem, try added extended stats like this (which I derived from Jonathan Lewis’s post):

select dbms_stats.create_extended_stats(NULL,'TEST','(COL1, COL2, COL3)') name from dual;

select * from user_tab_cols where table_name='TEST';

Then export table from 11.2 and import to 19c database using datapump. Anyway, posting here for my own memory and in case others find it useful. Maybe this is a bug?

Bobby

Categories: DBA Blogs

Merge Always Updates Sequence Number

Tue, 2019-12-17 11:55

This is nothing new, but I wanted to throw out a quick post to document it. If you have a sequence.nextval in the insert part of a merge statement the merge calls nextval for all the updated rows as well.

Oracle has a bug report about this from a 9.2 issue, so this is nothing new:

Bug 6827003 : SEQUENCE # IN MERGE BEING UPDATED FOR BOTH INSERT AND UPDATE

I created a couple of testcases if you want to try them: sequencewithmerge.zip

Oracle’s bug report says you can work around the issue by encasing the sequence.nextval call in a function so I tried it and it works.

Anyway, you can’t count on the sequence only being advanced on inserted rows with merge statements if you include sequence.nextval in the insert part of the merge statement.

Bobby

Categories: DBA Blogs

Database Link to 9.2 Database from 19c

Fri, 2019-12-13 15:12

I have mentioned in previous posts that I am working on migrating a large 11.2 database on HP Unix to 19c on Linux. I ran across a database link to an older 9.2 database in the current 11.2 database. That link does not work in 19c so I thought I would blog about my attempts to get it to run in 19c. It may not be that useful to other people because it is a special case, but I want to remember it for myself if nothing else.

First, I’ll just create test table in my own schema on a 9.2 development database:

SQL> create table test as select * from v$version;

Table created.

SQL> 
SQL> select * from test;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
PL/SQL Release 9.2.0.5.0 - Production
CORE	9.2.0.6.0	Production
TNS for HPUX: Version 9.2.0.5.0 - Production
NLSRTL Version 9.2.0.5.0 - Production

Next, I will create a link to this 9.2 database from a 19c database. I will hide the part of the link creation that has my password and the database details, but they are not needed.

SQL> create database link link_to_92
... removed for security reasons ...

Database link created.

SQL> 
SQL> select * from test@link_to_92;
select * from test@link_to_92
                   *
ERROR at line 1:
ORA-03134: Connections to this server version are no longer supported.

So I looked up ways to get around the ORA-03134 error. I can’t remember all the things I checked but I have a note that I looked at this one link: Resolving 3134 errors. The idea was to create a new database link from an 11.2 database to a 9.2 database. Then create a synonym on the 11.2 database for the table I want on the 9.2 system. Here is what that looks like on my test databases:

SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
... removed for brevity ...

SQL> create database link link_from_112
... removed for security ...

Database link created.

SQL> create synonym test for test@link_from_112;

Synonym created.

SQL> 
SQL> select * from test;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

Now that I have the link and synonym on the 11.2 middleman database, I go back to the 19c database and create a link to the 11.2 database and query the synonym to see the original table:

SQL> select * from v$version;

BANNER                                                                           ...
-------------------------------------------------------------------------------- ...
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production           ...
...										    

SQL> create database link link_to_112
...

Database link created.
...
SQL> select * from v$version@link_to_112;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
...

SQL> select * from test@link_to_112;

BANNER
----------------------------------------------------------------
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production

So far so good. I am not sure how clear I have been, but the point is that I could not query the table test on the 9.2 database from a 19c database without getting an error. By jumping through an 11.2 database I can now query from it. But, alas, that is not all my problems with this remote 9.2 database table.

When I first started looking at these remote 9.2 tables in my real system, I wanted to get an execution plan of a query that used them. The link through an 11.2 database trick let me query the tables but not get a plan of the query.

SQL> truncate table plan_table;

Table truncated.

SQL> 
SQL> explain plan into plan_table for
  2  select * from test@link_to_112
  3  /

Explained.

SQL> 
SQL> set markup html preformat on
SQL> 
SQL> select * from table(dbms_xplan.display('PLAN_TABLE',NULL,'ADVANCED'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Error: cannot fetch last explain plan from PLAN_TABLE

SQL> 
SQL> select object_name from plan_table;

OBJECT_NAME
------------------------------------------------------------------------------

TEST

Kind of funky but not the end of the world. Only a small number of queries use these remote 9.2 tables so I should be able to live without explain plan. Next, I needed to use the remote table in a PL/SQL package. For simplicity I will show using it in a proc:

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test@link_to_112;
  9  
 10  END BOBBYTEST ;
 11  /

Warning: Procedure created with compilation errors.

SQL> SHOW ERRORS;
Errors for PROCEDURE BOBBYTEST:

LINE/COL ERROR
-------- -----------------------------------------------------------------
6/3      PL/SQL: SQL Statement ignored
6/3      PL/SQL: ORA-00980: synonym translation is no longer valid

I tried creating a synonym for the remote table but got the same error:

SQL> create synonym test92 for test@link_to_112;

...

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test92;
  9  
 10  END BOBBYTEST ;
 11  /

Warning: Procedure created with compilation errors.

SQL> SHOW ERRORS;
Errors for PROCEDURE BOBBYTEST:

LINE/COL ERROR
-------- -----------------------------------------------------------------
6/3      PL/SQL: SQL Statement ignored
6/3      PL/SQL: ORA-00980: synonym translation is no longer valid

Finally, by chance I found that I could use a view for the remote synonym and the proc would compile:

SQL> create view test92 as select * from test@link_to_112;

View created.

...

SQL> CREATE OR REPLACE PROCEDURE BOBBYTEST
  2  AS
  3  ver_count number;
  4  
  5  BEGIN
  6    SELECT
  7    count(*) into ver_count
  8    FROM test92;
  9  
 10  END BOBBYTEST ;
 11  /

Procedure created.

SQL> SHOW ERRORS;
No errors.
SQL> 
SQL> execute bobbytest;

PL/SQL procedure successfully completed.

SQL> show errors
No errors.

Now one last thing to check. Will the plan work with the view?

SQL> explain plan into plan_table for
  2  select * from test92
  3  /

Explained.

SQL> select * from table(dbms_xplan.display('PLAN_TABLE',NULL,'ADVANCED'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------
Error: cannot fetch last explain plan from PLAN_TABLE

Sadly, the view was not the cure all. So, here is a summary of what to do if you have a procedure on a 19c database that needs to access a table on a 9.2 database:

  • Create a link on a 11.2 database to the 9.2 database
  • Create a synonym on the 11.2 database pointing to the table on the 9.2 database
  • Create a link on the 19c database to the 11.2 database
  • Create a view on the 19c database that queries the synonym on the 11.2 database
  • Use the view in your procedure on your 19c database
  • Explain plans may not work with SQL that use the view

Bobby

Categories: DBA Blogs

Pages