Real world issue: How to "split" queries in presence of record replication and replication sensitive aggregate functions ?

From: <pamelafluente_at_libero.it>
Date: 14 Sep 2006 01:54:44 -0700
Message-ID: <1158224084.007892.62800_at_d34g2000cwd.googlegroups.com>

I 'd like to repropose my original problem which got lost among vain fights
and pollution.

I am infact trying to write as an exercise a "reporting" program. Or more
precisely a program which is capable to generate queries useful to make reports.

This can be useful in OLAP systems as well as for classical dbms's.

I'd like to get 2 pieces of information:

First what is the relevant literature and how this problem is called in the theory
Most importantly, I need the actual "algorithm", that is a systematic way to produce these queries

I report here my part of my original posting and some answer.

---
Assume you have a few tables. Assume that there are a
few relationships and that for instance if you have 2 table A, B  in
1-N relationship, you have on table A defined some function that is,
say, "replication
sensitive", such as count (not count distinct) or sum. When whe make
some join of these tables and compute such functions we obtain an
incorrect calculation of the functions due to record replication.
Many softwares such as Business Objects are able to devise some union
of subquery to avoid that replication problem. Experiment show that
such software are usually able to deal with that if functions are on
"fact" table (say on the N side of the relationship), but they seem to
have big problems when the functions are applied on the dimension table

(say on side 1 of relationship).


Would you be able to suggest an optimal split/union algorithm for this
situation. I do have some "empirical" ideas on how to do that in
general,
but I would like to confront it with you.

---

I make a first example to start discussion. Let take 5 simple tables.
Let's
avoid talking about name, key issue or pathological design
issue. Let's just focus on the abstract reporting problem.


CLIENT
Name
U1
U2
U3


LINE_A
ItemA
A_1
A_2
A_3
A_4


LINE_B
ItemB
B_1
B_2
B_3


TRANS_A
Name    ItemA   Amount
U1      A_1     10
U1      A_3     20
U1      A_1     20
U1      A_3     10
U1      A_1     20
U3      A_2     10


TRANS_B
Name    ItemB   Amount
U1      B_1     20
U1      B_1     10
U1      B_3     10
U1      B_1     20
U1      B_2     10
U3      B_1     20


Assume one wants this simple report (given the above):


Name | ItemA | ItemB | TotalAmountItemA | TotalAmountItemB


One first question is what is the best way to create this kind of
result. And
for a general scheme which might incorporate similar problems what is
the best strategy to build a query that provides the expected info.
Other examples might include functions on "dimension" tables.

-----


Alexandr Savinov's reply

You can order your tables as follows:

LINE_A   CLIENT    LINE_B
      \  /      \   /
     TRANS_A    TRANS_B


If a table references another table then it is positioned below. It is
a
6-dimensinoal model (2 numeric dimensions for amounts are not shown).
If
you formally add bottom table then it will have the following
structure:


ItemA | Name | Name | ItemB | AmountItemA | AmountItemB


You see that two tables TRANS_A and TRANS_B are formally independent
because they do not have a common subtable. So in order to get the
desired result one can perform two steps:


1. Produce two aggregate queries for TRANS_A and TRANS_B by grouping
records and summing up the amount.


2. Combine them into one result set using the same client name as a
criterion. (However, I seriously doubt if such a result will be
meaningful.)

--------


Hi Alexandr ,

That's fine and correct. But I need the general logic (that's is what I

meant for "algorithm") to deal with *any* arrangement of tables and
relationship. Again my perspective is a programmer's one. One or more
istances of a problem are good for discussions and counterexamples, but

I need the general logic to deal with any set of "tables" (can be views

or whatever actually) and relationships.


The information I assume to be known is:


- table structure for each table
- relationships
- functions applied to each field and their properties(row function/
aggregate function / replication sensitive or not)


Then the "algorithms" (a systematic method to perform this task) must
be able to make automatically the partioning and union of subqueries,
according to some logic, which is the object of my question.


PS





> LINE_A   CLIENT    LINE_B


>       \  /      \   /

>      TRANS_A    TRANS_B



1. I am not sure to understand this notation. What do you mean by
"reference". You mean is in "relationship with" ?

Intuitively I see it as (I am "inventing" a representation)


                                           m / 1    LINE_A
                1 / m    TRANS_A
  CLIENT
                1 \ m    TRANS_B    m \ 1
                                                        LINE_B


1/m  means the kind of relationship (one to many).


On the observation that this may not be very easy to do in general,
I noticed that:

 All decent reporting programs are able to deal with any database.
Since I was thinking to create as an exercise a reporting tool, I was
trying
 to understand how to do that. Programs must be able to deal with
 anything.


 A relatively naive reporting program would probably spit out something

 like:


SELECT
       C.[Name] AS [Name],
       sum(T1.Amount) AS [Tot_A],
       sum(T.Amount) AS [Tot_B],
       T.ItemB AS [ItemB],
       T1.ItemA AS [ItemA]
FROM
       Trans_A T1,
       Client C,
       Trans_B T
WHERE
       C.[Name] = T1.[Name] AND
       C.[Name] = T.[Name]
GROUP BY
       C.[Name],
       T.ItemB,
       T1.ItemA;


 where instead, if I understand your suggestion, it would be better to
 give:


SELECT
       C.[Name] AS [Name],
       sum(T1.Amount) AS [Tot_A],
       NULL AS [Tot_B],
       NULL AS [ItemB],
       T1.ItemA AS [ItemA]
FROM
       Trans_A T1,
       Client C
WHERE
       C.[Name] = T1.[Name]
GROUP BY
       C.[Name],
       T1.ItemA;
UNION


SELECT
       C.[Name] AS [Name],
       NULL AS [Tot_A],
       sum(T.Amount) AS [Tot_B],
       T.ItemB AS [ItemB],
       NULL AS [ItemA]
FROM
       Trans_B T,
       Client C
WHERE
       C.[Name] = T.[Name]
GROUP BY
       C.[Name],
       T.ItemB;


 or perhaps the possibility to execute the 2 subqueries separately


 I need a systematic method to split set of relationships and make the
subqueries.

-P

Received on Thu Sep 14 2006 - 10:54:44 CEST

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message