Home » SQL & PL/SQL » SQL & PL/SQL » Creating Large Cluster Table More Efficiently (Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi)
Creating Large Cluster Table More Efficiently [message #438218] Thu, 07 January 2010 09:57 Go to next message
annagel
Messages: 220
Registered: April 2006
Senior Member
I am experimenting with using clusters for table storage on a new project I am working on. From everything I have read they seem like a good idea my project has


*A large (about 6 million rows in one table and 10 million in a second) data-set based on a monthly snapshot, the data-set is never updated once it is created monthly
*The data-set consists of two tables always accessed together
*I will in almost all cases be accessing multiple rows from each of the tables


Based on everything I have read this seems like a good setup for using clusters so I decided to give in and create a test version of the tables I would be using so I could do some comparisons between Clusters and just straight index access methods.

A skeleton of the code I used to create the cluster follows:

CREATE  CLUSTER cluster1(col1 NUMBER(8));

CREATE INDEX cluster1_idx ON CLUSTER cluster1;

CREATE TABLE cluster_table1
CLUSTER cluster1(col1) AS
SELECT col1, col2, col3, col4, col5, col6
FROM source1 -- Source 1 here has about 6 million rows

CREATE TABLE cluster_table2
CLUSTER cluster1(col1) AS
SELECT col1, col2, col3, col4, col5, col6
FROM source2 -- Source 2 here has about 10 million rows


EDIT: After posting it occurred to me that this might matter, though I have simplified the load procedure a bit here, source1 and source2 are actually pulled from two tables with a set of access conditions 6 and 10 million rows is what I expect to get out of these joins the tables themselves are about 33% larger than that.

Table 1 has been loading for a couple hours and looking at long ops it looks like it has another hour and a half to go. I can see why loading a cluster table would take significantly longer than loading a normal table, but having no experience with them previously I have no idea if this kind of load time is normal or if I am doing something wrong. I didn't set any options on the cluster, cluster index or table, could looking into this increase my load speed? Or is this just the load time I am going to have to deal with once a month to get the benefits of less IO for the rest of the month?

[Updated on: Thu, 07 January 2010 10:02]

Report message to a moderator

Re: Creating Large Cluster Table More Efficiently [message #438222 is a reply to message #438218] Thu, 07 January 2010 10:12 Go to previous messageGo to next message
BlackSwan
Messages: 24917
Registered: January 2009
Senior Member
SQL> desc V_$SESSION_LONGOPS
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 SID						    NUMBER
 SERIAL#					    NUMBER
 OPNAME 					    VARCHAR2(64)
 TARGET 					    VARCHAR2(64)
 TARGET_DESC					    VARCHAR2(32)
 SOFAR						    NUMBER
 TOTALWORK					    NUMBER
 UNITS						    VARCHAR2(32)
 START_TIME					    DATE
 LAST_UPDATE_TIME				    DATE
 TIMESTAMP					    DATE
 TIME_REMAINING 				    NUMBER
 ELAPSED_SECONDS				    NUMBER
 CONTEXT					    NUMBER
 MESSAGE					    VARCHAR2(512)
 USERNAME					    VARCHAR2(30)
 SQL_ADDRESS					    RAW(4)
 SQL_HASH_VALUE 				    NUMBER
 SQL_ID 					    VARCHAR2(13)
 QCSID						    NUMBER


querying above might provide clues.
EXPLAIN PLAN might provide clues.

next time
ALTER SESSION SET SQL_TRACE=TRUE;

CREATE TABLE cluster_table1
CLUSTER cluster1(col1) AS
SELECT col1, col2, col3, col4, col5, col6
FROM source1 -- Source 1 here has about 6 million rows

CREATE TABLE cluster_table2
CLUSTER cluster1(col1) AS
SELECT col1, col2, col3, col4, col5, col6
FROM source2 -- Source 2 here has about 10 million rows

ALTER SESSION SET SQL_TRACE=FALSE;

tkprof will show where time is being spent
Re: Creating Large Cluster Table More Efficiently [message #438223 is a reply to message #438222] Thu, 07 January 2010 10:18 Go to previous messageGo to next message
annagel
Messages: 220
Registered: April 2006
Senior Member
Thanks for the input, the explain plan is pretty simple Table access full on both the source tables and hash join the results for insert. I will try running a trace on insert #2 when this one finishes up and we will see what falls out.
Re: Creating Large Cluster Table More Efficiently [message #438268 is a reply to message #438218] Thu, 07 January 2010 14:01 Go to previous messageGo to next message
John Watson
Messages: 6403
Registered: January 2010
Location: Global Village
Senior Member
I used this to set up a test:
create table source1(col1 number(8),col2 timestamp,
col3 varchar2(20),col4 varchar2(20),col5 varchar2(20),
col6 varchar2(20));
create table source2(col1 number(8),col2 timestamp,
col3 varchar2(20),col4 varchar2(20),col5 varchar2(20),
col6 varchar2(20));

begin for i in 1..6000000 loop
insert into source1 values(i,systimestamp,'source1','source1','source1','source1');
insert into source2 values(i,systimestamp,'source2','source2','source2','source2');
insert into source2 values(i,systimestamp,'source2','source2','source2','source2');
end loop;
commit;
end;
/

then used your code to create the objects and insert the rows. And yes, it was going to take about two hours for the fist table. The wait events were buffer busy wait and db file sequential read. So tried it with a hash cluster:
CREATE  CLUSTER cluster1(col1 NUMBER(8)) hashkeys 100000;

(I specified a million haskeys first, but that failed with an ORA 600) and this was much worse. According to v$session_longops, it was about 15 times worse. The wait event was almost always db file sequential read.
If you take your tests further, I would very much like to know your results. I'm using hash clusters on a project at the moment, where they have solved buffer contention problems for concurrent inserts perfectly. This problem you've found is rather worrying! btw, my tests were on 11.1.0.7.
Re: Creating Large Cluster Table More Efficiently [message #438298 is a reply to message #438218] Thu, 07 January 2010 23:02 Go to previous messageGo to next message
ayush_anand
Messages: 417
Registered: November 2008
Senior Member
As the data is too much I will advice using the size parameter to set the size of the clusters
The size parameter therefore controls the maximum number of cluster keys per block. It is the single largest influence on the space utilization of your cluster. Set the size too high
and youʹll get very few keys per block and youʹll use more space then you need. Set the size too low and youʹll get excessive chaining of data, which offsets the purpose of the cluster to store all of the data together on a single block. It is the important parameter for a cluster.
I find a clue to your problem in expert one on one

tkyte@TKYTE816> begin
2 for x in ( select * from scott.dept )
3 loop
4 insert into dept
5 values ( x.deptno, x.dname, x.loc );
6 insert into emp
7 select *
8 from scott.emp
9 where deptno = x.deptno;
10 end loop;
11 end;
12 /
PL/SQL procedure successfully completed.


You might be asking yourself ʹWhy didnʹt we just insert all of the DEPT data and then all of the EMP data or vice‐versa, why did we load the data DEPTNO by DEPTNO like that?ʹ The reason is in the design of the cluster. I was simulating a large, initial bulk load of a cluster. If I had loaded all of the DEPT rows first ‐ we definitely would have gotten our 7 keys per block (based on the SIZE 1024 setting we made) since the DEPT rows are very small, just a couple of bytes. When it came time to load up the EMP rows, we might have found that some of the departments had many more than 1,024 bytes of data. This would cause excessive chaining on those cluster key blocks. By loading all of the data for a given cluster key at the same time, we pack the blocks as tightly as possible and start a new block when we run out of room. Instead of Oracle putting up to seven cluster key values per block, it will put as many as can fit.

Follow these guidelines when deciding whether to cluster tables:

Cluster tables that are accessed frequently by the application in join statements.
Do not cluster tables if the application joins them only occasionally or modifies their common column values frequently. Modifying a row's cluster key value takes longer than modifying the value in an unclustered table, because Oracle might need to migrate the modified row to another block to maintain the cluster.
Do not cluster tables if the application often performs full table scans of only one of the tables. A full table scan of a clustered table can take longer than a full table scan of an unclustered table. Oracle is likely to read more blocks, because the tables are stored together.
Cluster master-detail tables if you often select a master record and then the corresponding detail records. Detail records are stored in the same data block(s) as the master record, so they are likely still to be in memory when you select them, requiring Oracle to perform less I/O.
Store a detail table alone in a cluster if you often select many detail records of the same master. This measure improves the performance of queries that select detail records of the same master, but does not decrease the performance of a full table scan on the master table. An alternative is to use an index organized table.
Do not cluster tables if the data from all tables with the same cluster key value exceeds more than one or two Oracle blocks. To access a row in a clustered table, Oracle reads all blocks containing rows with that value. If these rows take up multiple blocks, then accessing a single row could require more reads than accessing the same row in an unclustered table.
Do not cluster tables when the number of rows for each cluster key value varies significantly. This causes waste of space for the low cardinality key value; it causes collisions for the high cardinality key values. Collisions degrade performance.
Consider the benefits and drawbacks of clusters with respect to the needs of the application. For example, you might decide that the performance gain for join statements outweighs the performance loss for statements that modify cluster key values. You might want to experiment and compare processing times with the tables both clustered and stored separately.

[Updated on: Fri, 08 January 2010 00:44]

Report message to a moderator

Re: Creating Large Cluster Table More Efficiently [message #438491 is a reply to message #438298] Sat, 09 January 2010 17:15 Go to previous messageGo to next message
annagel
Messages: 220
Registered: April 2006
Senior Member
Update on my "problem" I indeed was doing it wrong and the key to the answer is in Ayush's post which for some reason I never got an email notification on...anyway I read pretty much the exact same passage this morning from Oracle Database Architecture (which unfortunately sat on my bookshelf at home not as work) and decided to give it a try.

I set my size value to a good size based on the average data to be inserted, inserted data in a loop by key value rather than in bulk and dropped the need for an external join by pre-compiling my data for insert. I also added a second set of loads to throw data into a hash cluster for the purposes of comparison (and I may be giving a sorted hash cluster a try as well but I have not decided yet)

With these changes my load time for all 4 tables (master and detail for index and had cluster) was about the same as my load time on just the first detail table in my initial try (about 3:30 - that's hours:minutes) and my space usage dropped by 75%.
Re: Creating Large Cluster Table More Efficiently [message #438502 is a reply to message #438491] Sat, 09 January 2010 20:37 Go to previous messageGo to next message
rleishman
Messages: 3724
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Before you commit to clusters, do some benchmarking first against some other architectures. You might find that clusters are not the best option.

With all this talk of loading and static data, it sounds like you are developing a Data Warehouse application or similar. If the cluster key between the two tables is also a foreign key, you could consider denormalising the parent table columns into the child table. This could be done during the ETL, or even with a fast-refresh Materialized View after the data is loaded.

Another option for very large tables that are commonly joined in large joins is to equi-partition the tables. I did some benchmarking a couple of years ago and found partitioned hash joins significantly faster than cluster joins over large volumes.

What I took out of this investigation is that clusters are better suited to OLTP application that have hundreds of small-volume joins over different subsets of the data. Eg. Accessing an invoice and its invoice lines to display on a user's screen. This could also be beneficial for a reporting application if the usage was characterised by frequent small-volume joins.

The one occasion I found clusters well-suited to a Data Mart is in a Basket Analysis application. e.g. For all sales of beer, what is the most popular snack included in the same sale? By clustering rows on the invoice/sale id, we could ensure that after finding the beer sales, all related rows of interest would be found in the blocks already accessed. This type of problem could not be solved by partitioning or Materialised Views.

Back to your problem, it makes sense that it would be more efficient to load data in order of the cluster key. That way you could fill a block quickly and then age it out of the Buffer Cache.

Ross Leishman

[Updated on: Sat, 09 January 2010 20:40]

Report message to a moderator

Re: Creating Large Cluster Table More Efficiently [message #438536 is a reply to message #438502] Sun, 10 January 2010 13:01 Go to previous message
annagel
Messages: 220
Registered: April 2006
Senior Member
Ross,
In the majority of cases the number of records retrieved from this system is going to be very small, on average for a group of records of which there are about 450,000 there are 4 records in Table 1 and for each of these 4 records 3 in table 2. We are going to be displaying this data in two ways:

1. The first x records from table 1 and all the associated records from table 2 will be displayed in a predefined order with more records available upon requests (pageable grid basically).

2. A data dump of all records in both tables. (XML or Excel dump).

And I am still early in the planning stages on this and the reason I am creating so many versions of the same tables is for comparative purposes to find the best solution for our problem.
Previous Topic: Triggers, autonumbers and primary keys
Next Topic: package
Goto Forum:
  


Current Time: Sat Oct 01 17:52:19 CDT 2016

Total time taken to generate the page: 0.19334 seconds