# Re: Pearson-r in SQL

Date: Wed, 22 Dec 2004 15:52:39 -0500

Message-ID: <32u54nF3qdo9bU1_at_individual.net>

"-CELKO-" <jcelko212_at_earthlink.net> wrote in message
news:1103746401.990096.37280_at_f14g2000cwb.googlegroups.com...

> I was playing with doing som stats in SQL, and I need some feedback.

*> What should you do with NULLs in a computation of Pearson's r? Let me
**> do a "cut & paste" of the definition of r and some SQL code at the
**> bottom of this posting.
**>
**> If (x, y) = (NULL, NULL) then my query will drop them both and the
**> answer stays the same. That seems to follow the usual SQL aggregate
**> function rules for singleton stats like AVG()about dropping NULLs
**> before computing. That was the reason for using COUNT(x) and not
**> COUNT(*).
**>
**> But what is the correct (or reasonable) behavior if (x,y) has one and
**> only one NULL in the pair? My initial thoughts are:
**>
**> 1) Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL
**> AND y IS NOT NULL" clause.
**>
**> 2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or
**> something that will smooth out the missing values.
**>
**> 3) Should the code use COUNT(*) instead of COUNT(x) to show that the
**> missing values are affecting the answer?
**>
**> ================
**> The strength of the linear association between two variables is
**> quantified by the correlation coefficient. Given a set of observations
**> (x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation
**> coefficient is given by this query:
**>
**> CREATE TABLE Foobar
**> (x REAL NOT NULL,
**> y REAL NOT NULL);
**>
**> INSERT INTO Foobar VALUES (1.0, 2.0);
**> INSERT INTO Foobar VALUES (2.0, 5.0);
**> INSERT INTO Foobar VALUES (3.0, 6.0);
**> -- r= 0.9608
**>
**> SELECT
**> ((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
**> / (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x)))
**> * (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x))))))
**> AS pearson_r
**> FROM Foobar;
**>
**> The correlation coefficient always takes a value between -1 and 1, with
**> 1 or -1 indicating perfect correlation (all points would lie along a
**> straight line in this case). A positive correlation indicates a
**> positive association between the variables (increasing values in one
**> variable correspond to increasing values in the other variable), while
**> a negative correlation indicates a negative association between the
**> variables (increasing values is one variable corresponding to
**> decreasing values in the other variable). A correlation value close to
**> 0 indicates no association between the variables.
**>
**> Since the formula for calculating the correlation coefficient
**> standardizes the variables, changes in scale or units of measurement
**> will not affect its value. For this reason, the correlation coefficient
**> is often more useful than a graphical depiction in determining the
**> strength of the association between two variables.
**>
*

I am not familiar with the Pearson-r function, so, FWIW, I believe the answer lies in the answer to the question, "What does null mean in your data?" If "null" represents a valid state or value, then it has meaning and should be included. To do so, if possible (I don't know if 0 is already a valid value for you), you could use NVL(x,0) in place of x (this would cause any null to be represented by a zero), and likewise for y. If "null" connotes, "I don't know so I can't say for sure", then I would drop the pair, but report on existence of these pairs seperately. BTW, your table does not allow nulls. Received on Wed Dec 22 2004 - 21:52:39 CET