Re: Computing truncated mean

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: 7 Feb 2007 09:51:43 -0800
Message-ID: <1170870703.311008.270730@h3g2000cwc.googlegroups.com>

On Feb 7, 12:29 pm, Walt <walt_ask..._at_SHOESyahoo.com> wrote:
> Walt wrote:
> > John K. Hinsdale wrote:
> >>> On Feb 7, 7:50 am, "Charles Hooper" <hooperc2..._at_yahoo.com> wrote:
>
> > Thanks guys. I'm testing your candidate scripts with the following data:
>
> > INSERT INTO T1 VALUES (1);
> > INSERT INTO T1 VALUES (2);
> > INSERT INTO T1 VALUES (3);
> > INSERT INTO T1 VALUES (4);
> > INSERT INTO T1 VALUES (5);
> > INSERT INTO T1 VALUES (10);
> > INSERT INTO T1 VALUES (10);
> > INSERT INTO T1 VALUES (10);
> > INSERT INTO T1 VALUES (10);
> > INSERT INTO T1 VALUES (10);
>
> > A 20% trim should include the middle six values and return a trimmed
> > average of 42/6 == 7.
>
> > So far, the only "formula" I've tried that gives the right answer is the
> > first one John posted (actually a slight modification thereof). The
> > others (see below) appear to have problems. I'll try to understand them
> > further to pinpoint the error.
>
> Ok, here's the deal. Trimming according to cume_dist or percent_ranking
> trims by value, not by record. That is, all the records with n=10 come
> and go as a group - either they're in the sample used to compute the
> average or they're not.
>
> John's first formula trims the top and bottom 20% of records, the others
> trim all records with a value between the 20th percentile and 80th
> percentile. Not the same thing (although it is if the values are
> distinct, as with some of the test data.)
>
> Perhaps this is due to some ambiguity in the way I stated the problem.
> If so, I apologize. A 20% truncated mean is the average over the middle
> 60% of *records*, not over all the records whose *values* fall in the
> the 20th and 80th percentile range.
>
> In my particular situation, I'm crunching some numbers from a survey
> where the only possible responses are 1 through 5. Trimming by value
> will result in all the 1s and 5s disappearing from the average, which is
> not at all what the client is looking for.
>
> //Walt- Hide quoted text -
>
> - Show quoted text -

Your clarification helps. What we need to do is something like this, which is similar to the PERCENT_RANK(), but 1 is added to the COUNT of the records:

SELECT
  N,
  (ROW_NUMBER() OVER (ORDER BY N))/(COUNT(*) OVER (PARTITION BY 1)+1) POSITION_RANKING
FROM
  T1;

N POSITION_RANKING

---------- ----------------
         1       .090909091
         2       .181818182
         3       .272727273
         4       .363636364
         5       .454545455
        10       .545454545
        10       .636363636
        10       .727272727
        10       .818181818
        10       .909090909

Now, sliding into an inline view like before: SELECT
   AVG(N) AVERAGE,
   COUNT(*) NUM_RECORDS,
   MIN(N) MINIMUM,
   MAX(N) MAXIMUM,
   PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY N) MIDPOINT_1,    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY N DESC) MIDPOINT_2 FROM
  (SELECT
    N,
    (ROW_NUMBER() OVER (ORDER BY N))/(COUNT(*) OVER (PARTITION BY 1)+1) POSITION_RANKING
  FROM
    T1)
WHERE
  POSITION_RANKING BETWEEN .20 AND .80;    AVERAGE NUM_RECORDS MINIMUM MAXIMUM MIDPOINT_1 MIDPOINT_2 ---------- ----------- ---------- ---------- ---------- ----------

7 6 3 10 5 10

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc. Received on Wed Feb 07 2007 - 11:51:43 CST