REPOST: Re: Interesting warehouse design challenge (long)

From: Keith Boulton <kboulton_at_ntlworld.com>
Date: Sun, 27 Jan 2002 11:09:10 -0000
Message-ID: <0$--$%%%%--%_-$-%$@news.noc.cabal.int>

As I said, I'm no expert, but I had a similar requirement recently and the two things that made the biggest difference were:

Partitioning by a column that was queried in 90% of cases. Each partition was <10% of the whole table, so non-index searches (I don't have any indexes on the table) were at least 10x faster in most cases.
Replacing the actual text/numeric values for dimension in the fact table with keys generated per unique value in each dimension. This keeps key lengths very short, so table scans are again much faster (because the table is much denser) and the table was small enough to fit entirely in the SGA so very little disk IO was required.

This was an interactive reporting tool with drill-down through categories and the above changes took the per-query (=drill-down) time from 20-30 seconds - which is not acceptable 0 to about 0.3 seconds which is reasonable.

markoos <no_at_spam.com> wrote in message
news:6_p48.51619$d16.8708171_at_typhoon.nyc.rr.com...
> We have a 9i system functioning as a data warehouse on Win2k. We're stuck
> with a bad old design of a previous warehouse and need to optimize things.
>
> We have a table with 700,000 records and 10 columns that tracks our effort
> data (person works n hours on project 'abc' performing function 'def').
> Lets call it tbl_effort.
>
> We have ~10 manhour reports that run against this table from a CF web
> application. Each report can be run by any one of multiple user
> departments. The issue is that each department has a completely different
> set of criteria in which to query this table. Some departments filter by
> one column, others another column, others combinations of columns.
> Sometimes there are OR clauses as the rules change between date ranges.
> There is no direct rhyme or reason to these filters, they are just the way
> the business rules work. It's making it very difficult to index.
>
> Unfortunately we're applying these filters at runtime. The web
application
> pulls the 'where clause' from tbl_where_clause, and dynamically constructs
> the query (not my idea, trust me). Each department has one and only one
> where clause. Some effort records could be used by two departments, this
is
> many-to-many.
>
> The reports run between 10 seconds-2 minutes, often kicking off full table
> scans. I would like to improve this speed of this system by pre-filtering
> the data at night, creating any other tables, views, indexes, or
> materialized views as necessary to optimize the runtime speed.
>
> One of the requirements is that departments and where clauses can be added
> or edited mid-day, and the system needs to reflect the new filter
> immediately.
>
>
> Option 1:
>
> My thought is that we add a new table that joins to tbl_effort on
effort_ID,
> call it tbl_effort_join.
>
> CREATE TABLE tbl_effort_join (
> effort_ID INT NOT NULL,
> dept_ID INT NOT NULL
> );
> --and the appropriate indexes.
>
> At night we loop through a cursor of the departments, applying each where
> clause and inserting one record in the join table for each record returned
f
> rom the effort table. This could be implemented as a procedure that takes
in
> one department ID, dynamically constructs the query against tbl_effort,
and
> runs an INSERT INTO tbl_effort_join SELECT (dynamic query). I haven't
done
> this yet but I don't know why it wouldn't work.
>
> We then could tweak the end-user queries to use this table joined (or
create
> this as a view). All queries simply filter on dept=dept_ID instead of the
> long-winded where clause.
>
> If a where clause changes during the day, we call the procedure once for
> that department. That procedure would delete & re-insert rows, or more
> efficiently if possible.
>
>
>
> Option 2:
>
> My second design candidate is to have only one table. That table is
created
> with the same schema as tbl_effort, but with dept_ID added. Any of the
> original effort records that crossed depts would now have duplicate
records
> with different dept_IDs. Storage space is not an issue.
>
> We construct the table in the same way as option 1, with a procedure and a
> loop through the dept cursor. Also if where clause is added or updated,
we
> run the procedure which deletes & reinserts the records.
>
> I'm thinking that this is a stronger design, as having one table will be
> easier to index and create materialized views against. Also I'm thinking
> the queries may potentially run faster without having to join.
>
> I would consider index organizing this table if that sped up the queries.
>
>
>
> So the question is, which design is better? Should I opt for full
> normalization or go with one repetitive table? Is there a third design
> candidate I'm missing? Am I missing anything else?
>
>
> Thanks for any suggestions!
>
>
>
>
>
>
>
>
>
>

WAS CANCELLED BY =======: From: "Keith Boulton" <kboulton_at_ntlworld.com> Control: cancel <39R48.32941$ka7.5586711_at_news6-win.server.ntlworld.com> Subject: cmsg cancel <39R48.32941$ka7.5586711_at_news6-win.server.ntlworld.com> Date: Sun, 27 Jan 2002 23:34:54 GMT Message-ID: <cancel.39R48.32941$ka7.5586711_at_news6-win.server.ntlworld.com> X-No-Archive: yes Newsgroups: microsoft.test,alt.flame.niggers,comp.databases.oracle.server NNTP-Posting-Host: w088.z064003087.lax-ca.dsl.cnc.net 64.3.87.88 Lines: 1 Path: news.uni-stuttgart.de!cert.uni-stuttgart.de!news.belwue.de!news.uni-ulm.de!rz.uni-karlsruhe.de!blackbush.xlink.net!blackbush.de.kpnqwest.net!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!fr.clara.net!heighliner.fr.clara.net!news.stealth.net!msrtrans1!msrnewsc1!cppssbbsa01.microsoft.com!tkmsftngp01!tkmsftngp04!u&n&a&c&anceller Xref: news.uni-stuttgart.de control:40720903

This message was cancelled from within The Unacanceller's glorious new software, Lotus 1-2-3 For Rogue Cancellers. Received on Sun Jan 27 2002 - 05:09:10 CST