Real world issue: How to "split" queries in presence of record replication and replication sensitive aggregate functions ?
From: <pamelafluente_at_libero.it>
Date: 14 Sep 2006 01:54:44 -0700
Message-ID: <1158224084.007892.62800_at_d34g2000cwd.googlegroups.com>
I 'd like to repropose my original problem which got lost among vain fights
and pollution. Most importantly, I need the actual "algorithm", that is a
systematic way to produce
these queries
Date: 14 Sep 2006 01:54:44 -0700
Message-ID: <1158224084.007892.62800_at_d34g2000cwd.googlegroups.com>
I 'd like to repropose my original problem which got lost among vain fights
and pollution.
I am infact trying to write as an exercise a "reporting" program. Or
more
precisely a program which is capable to generate queries useful to make
reports.
This can be useful in OLAP systems as well as for classical dbms's.
I'd like to get 2 pieces of information:
- First what is the relevant literature and how this problem is called in the theory
I report here my part of my original posting and some answer.
--- Assume you have a few tables. Assume that there are a few relationships and that for instance if you have 2 table A, B in 1-N relationship, you have on table A defined some function that is, say, "replication sensitive", such as count (not count distinct) or sum. When whe make some join of these tables and compute such functions we obtain an incorrect calculation of the functions due to record replication. Many softwares such as Business Objects are able to devise some union of subquery to avoid that replication problem. Experiment show that such software are usually able to deal with that if functions are on "fact" table (say on the N side of the relationship), but they seem to have big problems when the functions are applied on the dimension table (say on side 1 of relationship). Would you be able to suggest an optimal split/union algorithm for this situation. I do have some "empirical" ideas on how to do that in general, but I would like to confront it with you. --- I make a first example to start discussion. Let take 5 simple tables. Let's avoid talking about name, key issue or pathological design issue. Let's just focus on the abstract reporting problem. CLIENT Name U1 U2 U3 LINE_A ItemA A_1 A_2 A_3 A_4 LINE_B ItemB B_1 B_2 B_3 TRANS_A Name ItemA Amount U1 A_1 10 U1 A_3 20 U1 A_1 20 U1 A_3 10 U1 A_1 20 U3 A_2 10 TRANS_B Name ItemB Amount U1 B_1 20 U1 B_1 10 U1 B_3 10 U1 B_1 20 U1 B_2 10 U3 B_1 20 Assume one wants this simple report (given the above): Name | ItemA | ItemB | TotalAmountItemA | TotalAmountItemB One first question is what is the best way to create this kind of result. And for a general scheme which might incorporate similar problems what is the best strategy to build a query that provides the expected info. Other examples might include functions on "dimension" tables. ----- Alexandr Savinov's reply You can order your tables as follows: LINE_A CLIENT LINE_B \ / \ / TRANS_A TRANS_B If a table references another table then it is positioned below. It is a 6-dimensinoal model (2 numeric dimensions for amounts are not shown). If you formally add bottom table then it will have the following structure: ItemA | Name | Name | ItemB | AmountItemA | AmountItemB You see that two tables TRANS_A and TRANS_B are formally independent because they do not have a common subtable. So in order to get the desired result one can perform two steps: 1. Produce two aggregate queries for TRANS_A and TRANS_B by grouping records and summing up the amount. 2. Combine them into one result set using the same client name as a criterion. (However, I seriously doubt if such a result will be meaningful.) -------- Hi Alexandr , That's fine and correct. But I need the general logic (that's is what I meant for "algorithm") to deal with *any* arrangement of tables and relationship. Again my perspective is a programmer's one. One or more istances of a problem are good for discussions and counterexamples, but I need the general logic to deal with any set of "tables" (can be views or whatever actually) and relationships. The information I assume to be known is: - table structure for each table - relationships - functions applied to each field and their properties(row function/ aggregate function / replication sensitive or not) Then the "algorithms" (a systematic method to perform this task) must be able to make automatically the partioning and union of subqueries, according to some logic, which is the object of my question. PSReceived on Thu Sep 14 2006 - 10:54:44 CEST
> LINE_A CLIENT LINE_B
> \ / \ /
> TRANS_A TRANS_B
1. I am not sure to understand this notation. What do you mean by "reference". You mean is in "relationship with" ? Intuitively I see it as (I am "inventing" a representation) m / 1 LINE_A 1 / m TRANS_A CLIENT 1 \ m TRANS_B m \ 1 LINE_B 1/m means the kind of relationship (one to many). On the observation that this may not be very easy to do in general, I noticed that: All decent reporting programs are able to deal with any database. Since I was thinking to create as an exercise a reporting tool, I was trying to understand how to do that. Programs must be able to deal with anything. A relatively naive reporting program would probably spit out something like: SELECT C.[Name] AS [Name], sum(T1.Amount) AS [Tot_A], sum(T.Amount) AS [Tot_B], T.ItemB AS [ItemB], T1.ItemA AS [ItemA] FROM Trans_A T1, Client C, Trans_B T WHERE C.[Name] = T1.[Name] AND C.[Name] = T.[Name] GROUP BY C.[Name], T.ItemB, T1.ItemA; where instead, if I understand your suggestion, it would be better to give: SELECT C.[Name] AS [Name], sum(T1.Amount) AS [Tot_A], NULL AS [Tot_B], NULL AS [ItemB], T1.ItemA AS [ItemA] FROM Trans_A T1, Client C WHERE C.[Name] = T1.[Name] GROUP BY C.[Name], T1.ItemA; UNION SELECT C.[Name] AS [Name], NULL AS [Tot_A], sum(T.Amount) AS [Tot_B], T.ItemB AS [ItemB], NULL AS [ItemA] FROM Trans_B T, Client C WHERE C.[Name] = T.[Name] GROUP BY C.[Name], T.ItemB; or perhaps the possibility to execute the 2 subqueries separately I need a systematic method to split set of relationships and make the subqueries. -P