Tuning Goldilocks SQL

articles: 

Comments

First, I think this article is brilliant, it taught me a lot. But second, I did wonder about a couple of other approaches. Am I right in thinking that the problem in case 3 is in part due to the reference table blocks not being cached, so that each INDEX UNIQUE SCAN is followed by a TABLE ACCESS BY INDEX ROWID that requires a physical read? Parallel query speeds things up by a brute force approach. However, it requires Enterprise Edition licences, and I'm not sure it would scale up too well when you think of the impact on CPU and I/O. How about these alternatives:

Case 5: restructure the reference tables as IOTs. I can see that this would not be appropriate if the reference tables have many columns, but if they are sufficiently few (or if sensible use can be made of an overflow segment) there might be a considerable saving from cutting out the TABLE ACCESS BY INDEX ROWID.

Case 6: restructure the reference tables as single table hash clusters, which would replace both the INDEX UNIQUE SCAN and the TABLE ACCESS BY INDEX ROWID with TABLE ACCESS HASH.

If one has EE licences, use parallel query as well! If it is feasible to provide the data sets, I would be interested in doing the test myself - though I'm not sure I could obtain a suitable environment just now.

Enterprise license didn't occur to me, I've always been lucky enough to work for large wealthy clients who buy Oracle with all the trimmings. But the point is well taken. I am pretty confident neither would come close to Parallel, but it is still theoretically possible to get improvements approaching 50%.

I would be very wary of an IOT approach. They are great for PK access - agreed - but they are punishingly slow for selective access on alternate indexes. I think there would be unsatisfyingly few situations where IOTs would do more good than harm. I'm also wary of setting precedents that other people copy without understand the caveats.

The hash cluster possibility is more interesting. Coincidently, I built a Data Mart a while ago where I built all of the dimensions as Hash Clusters for just this purpose. I went to write up the experience for an article on this site, but was unable to create artificial test cases where the hash clusters materially out-performed unique indexes. In the end, the article morphed into a completely different topic (http://www.orafaq.com/node/1446): Worlds fastest scalable join.

During that exercise, I suspect that I did not adequately manage the physical disk cache and the buffer cache, and my volumes were too small. Certainly my buffer cache hit ratio was too high to see a difference. It would be interesting to have another go with the Goldilocks example.

My only reservations on the cluster example are:
- You cannot partition a hash cluster
- INSERT performance is impacted
You'd need to evaluate these costs when deciding to hash cluster a table as part of a Goldilocks solution.

I might have a go at this. Otherwise, stay tuned and I will see if I can post the test data generators.

Thanks for the feedback. Nice to know someone's reading out there.

Here is the output from the hash example. I set up both REF tables in Hash Clusters with a key size of 1 block and 100,000 hashkeys (these tables previously occupied about 70K blocks). Here is the trace:

SELECT    /*+ first_rows*/ *
FROM      test_join_src  src
LEFT JOIN test_join_ref1 ref1 ON ref1.pk1 = src.fk1
LEFT JOIN test_join_ref2 ref2 ON ref2.pk2 = src.fk2
LEFT JOIN test_join_tgt  tgt  ON tgt.pk = src.pk
where rownum > 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.21       0.37         86        183          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1     10.06     306.30      29100      69768          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3     10.27     306.67      29186      69951          0           0

Misses in library cache during parse: 1
Optimizer mode: FIRST_ROWS
Parsing user id: 140

Rows     Row Source Operation
-------  ---------------------------------------------------
      0  COUNT  (cr=69768 pr=29100 pw=0 time=306303222 us)
      0   FILTER  (cr=69768 pr=29100 pw=0 time=306303206 us)
  10000    NESTED LOOPS OUTER (cr=69768 pr=29100 pw=0 time=321080054 us)
  10000     NESTED LOOPS OUTER (cr=49887 pr=19531 pw=0 time=218657909 us)
  10000      NESTED LOOPS OUTER (cr=29997 pr=10034 pw=0 time=103099253 us)
  10000       TABLE ACCESS FULL TEST_JOIN_SRC (cr=88 pr=6 pw=0 time=58280 us)
   9907       TABLE ACCESS BY INDEX ROWID TEST_JOIN_TGT (cr=29909 pr=10028 pw=0 time=93153961 us)
   9907        INDEX UNIQUE SCAN TEST_JOIN_TGT_PK (cr=20002 pr=9942 pw=0 time=92078523 us)(object id 142723)
  10000      TABLE ACCESS HASH TEST_JOIN_REF2 (cr=19890 pr=9497 pw=0 time=105084541 us)
  10000     TABLE ACCESS HASH TEST_JOIN_REF1 (cr=19881 pr=9569 pw=0 time=106784586 us)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                     29101        0.50        299.55
  SQL*Net message to client                       1        0.00          0.00
  db file scattered read                          2        0.00          0.01
  SQL*Net message from client                     1       13.89         13.89

So it compares favourably with the indexed NL (484 sec), a 37% improvement. I think if I had of clustered the TEST_JOIN_TGT as well, it would have been freakishly close to my estimate of 50% improvement.

Verdict: Hash Clusters - definitely a tool to drop in the Goldilocks kit.

Ross,

Very informative article. Can you comment on the degree of parallelism and how to determine the most appropriate setting for that, for your test case?

--Andrew

Thanks for reading and responding, Andrew.

I have not done much experimenting with degrees of parallelism, and I certainly did not experiment with this test case.

My approach to parallelism has been to enable it on tables and indexes, and let Oracle sort out the number of parallel threads it uses. On my system - when nothing else is contending for parallel threads - Oracle allocates 128 to a SQL, with a maximum of 640 for the instance. In my parallel test case above it used 128 threads and the server was otherwise idle.

If you would like me to speculate (and lets face it, it's pretty hard to stop me), I would say that you could still get massive performance gains in this test case from much fewer slaves. My thinking here is:
- The critical path is the disk latency (round trip to disk). This represented 476 seconds out of a 484 second query.
- We mitigate this by parallelising those round trips.
- Too much parallelism will clog either the network (my storage is NAS) or the disk.
- If the disk was infinitely scaleable, we would halve the disk latency with 2 slaves, one third with 3 slaves, etc.
- With my parallel example, the 128 slaves spent 20 seconds acquiring the data. Based on test case 3, I expect most of this was disk latency.
- 20 seconds is about 1/24th of the non-parallel 476 seconds spent retrieving from disk, whereas each thread processed 1/128th of the data.

My guess is that disk latency / parallelism related performance gains on my system tails off at around 20 parallel slaves.

So, I could probably be more frugal in my use of parallel resources by overriding the degree of parallelism to (say) 24. But this is a special case where we are dealing with disk latency. If the parallelism were being used to break down a GROUP BY - or some other intensive process - then the ideal degree might change.

About all I can say about the degree of parallelism is that Oracle seems to manage it pretty well. If there is a lot of competition for parallel slaves, then the initialisation parameter parallel_adaptive_multi_user can be used to automatically limit the number of slaves allocated to each process.

I haven't yet found a need - in 10g - to manually throttle parallelism in Oracle.