COUNT(DISTINCT) on summarized data

From: Stacy White <harsh_at_computer.org>
Date: Sun, 22 Feb 2004 20:08:21 GMT
Message-ID: <V_7_b.99601$jk2.468011@attbi_s53>

I'm working on a system to report on large amounts of transactional data
(I'm expecting at least a few hundred million transactions in a couple
years).

On other projects I've been getting good performance by aggregating the data at multiple levels (totals by product, by supplier, by retailer, by division, etc), and grabbing data from the smallest table/materialized-view that can satisfy the query.

But on this project, our requirements include reporting on the total number of unique customers for a user-selectable subset of the data. So, for instance, the end-users should be able to run a report that gives them
(among other things) the total number of unique customers in sales region A
that ordered product B over a user-definable date range.

If I were only showing total transactional information, I might build a table or materialized view keyed on product #/region #/day, which included fields for total_revenue/total_order_count, etc. We've been able to build our reports in real time on other systems by summarizing data like this.

But I can't add a 'unique_customers' field, and 'SELECT SUM(unique_customers)' since that would count some customers twice.

I'm having trouble coming up with a cunning solution to the problem, and I'm starting to think that a relational database might not be the right approach.

Any suggestions? Received on Sun Feb 22 2004 - 14:08:21 CST