Re: partiioning option not worth it?

From: Keith Boulton <kboulton_at_ntlworld.com>
Date: Fri, 1 Mar 2002 08:03:26 -0000
Message-ID: <tCGf8.70406$Ah1.9004839@news2-win.server.ntlworld.com>

Mark Townsend <markbtownsend_at_attbi.com> wrote in message news:B8A4390D.143A4%markbtownsend_at_attbi.com...
>
> Well not to belabour the point, but some additional comments inline
>
> in article a4kf8.61658$Ah1.7832759_at_news2-win.server.ntlworld.com, Keith
> Boulton at kboulton_at_ntlworld.com wrote on 2/27/02 10:24 PM:
>
> > As I said, one day I'll sit down with the partitioning manual however...
> >
> >
> > Mark Townsend <markbtownsend_at_attbi.com> wrote in message
> > news:B8A2EE6E.13F17%markbtownsend_at_attbi.com...
> >> For small amounts of data building your own partitioning scheme IS
> > trivial.
> >> But how much data are you adressing ?
> > The quantity of data is irrelevant. The number of partitions to be
supported
> > is an issue.
>
> In YOUR situation the quantity of data is irrelevant. For many PO users,
it
> is not. There are many, many sites that will have single tables exceeding
1
> Tb in size. Partitioning provides true benefits to these sites in terms of
> simply dividing and conquering the management costs large databases bring
to
> bear. For instance, with 1 TB of data, how are you going to manage IO
> balancing across the 1000's of 9 Gb drives you are going to use ? Are you
> going to back this all up at once ?
>

The issues are the same with multiple tables as with multiple partitions. Even if partition supports "backup partition" syntax, you can address the issue by putting each table in its own tablespace. This is a minor administrative overhead. As I keep repeating the point is not does PO make things easier, but is it worth the money.

> >> Doing multiple full table scans on what is basically a union all view
is
> > easy when you only have 10 Mb of data
> > As I said, the point actually is to perform searches on only the
relevant
> > tables. This is easy to achieve but tedious unless you design your
> > application with an architecture in which the runtime query is generated
> > from a single component. In any case, this is a good idea.
> >
> >> address +500 Gb of data in a single table. And your solution doesn't
> > support
> >
> >> secondary indices, prune partitions
> > If by pruning partitions you mean excluding tables not required for a
query,
> > that is exactly what my solution does.
>
> Yes, but in your situation you have a very simple case - either the entire
> table, or only 1 partition. Say you are partitioning on date - how will
you
> do a month-to-month comparison and only access the months you want. Your
> solution also doesn't prune partitions on a join, or do a parallel range
> scan across multiple partitions. With a little data, these are not issues.
> With screeds of data, these are very big issues. Let me give you a
business
> case

To modify my solution to scan multiple tables would be trivial. The same with a date range. You look up the tables required from a reference table and build a union all statment that merges them.

As I said wrt to partitioning in joins, that is also a straightforward matter of replacing
'select ... from table1, table2 where ...' with 'select ... from table1a, table2 where ... union all select table1b, table2 where...'

In most cases, I suspect you will be partitioning only e.g. the fact table in a datamart. Where multiple tables are addressed, often the partition key will be the same. In a datawarehouse, you should avoid joins to the partitioned table in any case.

Parallel query is a different issue from partitioning. I've never checked if parallel query will split up a union all - it should, but without testing, who knows. I think it's even part of the standard edition now.

> A telco I worked with sunddenly got new competition from an overseas telco
> company. The new entrant into the market did not own any local networks
> (they actually rented these from the incumbent telco), but did have a very
> good international network in place. So part of their marketing campaign
was
> a 5 cent weekend. Every weekend, they would chose a new overseas
> destination, and let people call these locations for 5 cents a minute.
>
> The incumbent telco had to respond - but they needed to work out what
> matching this offer would do to their existing business (i.e
> cannabilization) by looking at the call volumnes over the last 3, 6 and 12
> months and determining the probable level of cannabilization. Their
existing
> Billing DW was not on Oracle, and was designed to do 2 things - find a
> customers monthly spend, or a customers spend over the entire 12 months
(i.e
> the 1 or all partition model you have built). The new query took 48 hours
to
> run. The telco could not match the offer becaue by the time they had
worked
> out the cost, the weekend had already passed.
>
> We took the data and put it up on Oracle8, with range partitioning. Kicked
> the query of at lunchtime one Friday, and came back in 40 minutes. We had
a
> result (we then spent the rest of the weekend verifying that the query was
> indeed correct, but thats another story).
>
> So my point - if your business model changes, your design will need to
> change as well. With partitioning, you will need to change your design
less
> often.
>

Sometimes it may be worth spending the money. Your statement however that you solved the problem using partitioning is irrelevant to the issue of partitioning being required to solve the problem.

> You also conviently left out secondary indexes.

I didn't *conveniently* leave it out - I didn't know what it was, nor can I find a reference to secondary index in the data warehousing guide. The other oracle books I looked at listed secondary index only wrt IOT. If you mean additional indexes on the table, then I don't see a great disadvantage to having that index on each table/partition.

>
> >> determine the best optimizer plan from both table and partition level
> > statistics
> > Under what circumstances would this be an issue? Almost certainly, if
the
> > partition key was given as part of the query, a partition level search
is
> > most efficient.
>
> Not correct - it depends on how many partitions are involved, and the
> relevant data skew between these partitions. In some circumstances, a
> parallel range scan could be much more efficient than a partition index
> probe and lookup, especially in a DW, where most operations are peformed
> against a set of data, instead of a single row. Stats at the table level
> will allow the optimizer to determine this.

The term "partition level search" includes both scans and index searches.

>
> Alternatively, partition level stats allows the optimizer to determine the
> best access plan on a partition by partition basis, once again when
multiple
> partitions are involved.

And how often is it's decision different from what I suggested? And how often does it matter? We're getting into the (annoying for me) area of theoretical benefits.

>
> >> perform partition-wise joins or
> > I agree this is a more interesting problem, but, as I said, it can be
> > addressed by using templated sql statements. In any case, for a
reporting
> > database when partitioning is most likely to be used, you are, I
believe,
> > generally better off not joining the partitioned table to anything else,
but
> > to perform joins only after aggregation for reporting purposes has taken
> > place.
>
> Hmm - I'd have to disagree on that one - in fact a typical star schema
will
> preclude this type of operation altogether.
The fact that things are generally done one way doesn't mean it's generally a good way to do it.
However, I will backtrack on this one. It depends on the type of query being run.
It is not necessary to join to expand dimension codes/ids into labels or additional descriptive information - it is usually better to cache that information in the client. If you're already doing that, then you don't need to join to the dimension tables to look up individual dimension codes. If you had a very complicated query, you might well be better off joining to other tables.

>
> >> prevent you from updating rows so that they no longer blong to the
correct
> > Check constraint
> >
> >> (or move the rows if you do).
> > Agreed. But then, I believe in most cases you have a bulk dataload with
more
> > appends than updates. After all, what are you going to partition on?
> > Normally it will be a date or business unit.
>
> My point is what if somebody makes a mistake, and a row ends up in the
wrong
> 'table'. Queries from your application would never see this errant row,
yet

I'll say "check constraint" again.

> you would quite happily get results (albeit wrong). And unless you went
> looking for misplaced rows, you would never know you had a problem.
> Constraints don't address the problem either, becuase in many large VLDBS
> the constraints are not actually enforced on ayhtning other than the load
> cycle as they are to expensive.

When that has been tested and proven on a constraint along the lines of "column = '1'" then I'll care. In most cases I don't believe that a single column constraint like that is going to make a big difference. If it mattered 18 months ago, it doesn't matter now - my CPU has just got 2x faster.

Again the fact that something is generally done is irrelevant to any particular case. And if you think that things are generally done a given way because that is the best way to do it, then I assume you also believe that C was a very good programming language, and VHS was better than betamax.

> When we used to do manual partitioning
> before the 7.3 days, this was a huge problem - eventually it would happen
Sometimes your actions have consequences. This is an issue of code/testing quality and (in my opinion) an unreasonable decision to disable constraints.

By the time you get to terabyte datawarehouses of course, the money wasted on high end unix hardware and over-paid DBAs almost certainly swamps the cost of the partitioning option and you should always address the big costs first and that includes perpetual costs like ongoing support contracts.

But then I am a cheapskate/skinflint. Received on Fri Mar 01 2002 - 02:03:26 CST