# Re: Pearson-r in SQL

From: Alan <alan_at_erols.com>
Date: Wed, 22 Dec 2004 15:52:39 -0500
Message-ID: <32u54nF3qdo9bU1_at_individual.net>

> I was playing with doing som stats in SQL, and I need some feedback.
> What should you do with NULLs in a computation of Pearson's r? Let me
> do a "cut & paste" of the definition of r and some SQL code at the
> bottom of this posting.
>
> If (x, y) = (NULL, NULL) then my query will drop them both and the
> answer stays the same. That seems to follow the usual SQL aggregate
> function rules for singleton stats like AVG()about dropping NULLs
> before computing. That was the reason for using COUNT(x) and not
> COUNT(*).
>
> But what is the correct (or reasonable) behavior if (x,y) has one and
> only one NULL in the pair? My initial thoughts are:
>
> 1) Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL
> AND y IS NOT NULL" clause.
>
> 2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or
> something that will smooth out the missing values.
>
> 3) Should the code use COUNT(*) instead of COUNT(x) to show that the
> missing values are affecting the answer?
>
> ================
> The strength of the linear association between two variables is
> quantified by the correlation coefficient. Given a set of observations
> (x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation
> coefficient is given by this query:
>
> CREATE TABLE Foobar
> (x REAL NOT NULL,
> y REAL NOT NULL);
>
> INSERT INTO Foobar VALUES (1.0, 2.0);
> INSERT INTO Foobar VALUES (2.0, 5.0);
> INSERT INTO Foobar VALUES (3.0, 6.0);
> -- r= 0.9608
>
> SELECT
> ((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
> / (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x)))
> * (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x))))))
> AS pearson_r
> FROM Foobar;
>
> The correlation coefficient always takes a value between -1 and 1, with
> 1 or -1 indicating perfect correlation (all points would lie along a
> straight line in this case). A positive correlation indicates a
> positive association between the variables (increasing values in one
> variable correspond to increasing values in the other variable), while
> a negative correlation indicates a negative association between the
> variables (increasing values is one variable corresponding to
> decreasing values in the other variable). A correlation value close to
> 0 indicates no association between the variables.
>
> Since the formula for calculating the correlation coefficient
> standardizes the variables, changes in scale or units of measurement
> will not affect its value. For this reason, the correlation coefficient
> is often more useful than a graphical depiction in determining the
> strength of the association between two variables.
>

I am not familiar with the Pearson-r function, so, FWIW, I believe the answer lies in the answer to the question, "What does null mean in your data?" If "null" represents a valid state or value, then it has meaning and should be included. To do so, if possible (I don't know if 0 is already a valid value for you), you could use NVL(x,0) in place of x (this would cause any null to be represented by a zero), and likewise for y. If "null" connotes, "I don't know so I can't say for sure", then I would drop the pair, but report on existence of these pairs seperately. BTW, your table does not allow nulls. Received on Wed Dec 22 2004 - 21:52:39 CET

Original text of this message