Star Joins

What is a Star Join?

A Star Join is where you join a really big fact table (or a really big subset of an enormous table) to a number of smaller tables. They are not to be confused with a Star Query. A Star Query also joins a big (fact) table to small (dimension) tables, but it will use WHERE predicates to constrain the rows selected to a small volume of data. A Star Query should use indexes, so is not technically a High Volume SQL.

A typical use for a Star Join would be the creation of an aggregated fact table from an atomic fact table. A typical use for a star query would be an end-user query on any fact table.

Consider the following:

SELECT ...
FROM   really_big_fact f    -- 50 million rows
,      dim1 d1              -- 50 rows
,      dim2 d2              -- 100 rows
,      dim3 d3              -- 10 rows
,      dim4 d4              -- 15 rows
WHERE  f.key1 = d1.key1
AND    f.key2 = d2.key2
AND    f.key3 = d3.key3
AND    f.key4 = d4.key4

The best way to join this table would be to hash join f to d1, hash join the result to d2, hash join the result to d3, hash join the result to d4. Since d1, d2, d3, and d4 are all small, the hash joins are really efficient (hash joins work particularly well with one small table). The four dimension tables are all hashed into memory, and then the fact table is processed row-by-row, looking up the matching values from the four in-memory hash tables.

If Oracle is choosing an alternate plan, it is easy to enforce the above approach using the ORDERED hint (ensure the fact table is listed first in the FROM clause) and a USE_HASH hint that lists all of the table aliases. eg. /*+ ORDERED USE_HASH(f,d1,d2,d3,d4)*/

This method does not work when one of the dimensions is large (>100,000 rows). If the smaller of two tables in a hash join is still too big to hash in memory, then Hash Joins become much less efficient. Worse, the process described above where the join performs a single pass of the fact table no longer works; both the fact and large dimension tables must be pre-processed into TEMP space for the hash-join. This effectively doubles the IO on the two largest tables in the join.

The best way to deal with this problem is to treat the large dimension as a Sparse Join.