两数值变量的相关性分析

最后发布时间 : 2022-08-25 14:38:21 浏览量 :

两数值变量的相关性分析

==相关性的计算==

“R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial.” (“R in action”, p. 153) (pdf)

Pearson, Spearman, Kendall, partial, ploychroic, ployserial
$r=\frac{\sum^{n}_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum^{n}_{i=1}(x_i-\bar{x})\sum^{n}_{i=1}(y_i-\bar{y})^2}}$

> states <- state.x77[,1:6]
> states
               Population Income Illiteracy Life Exp Murder HS Grad
Alabama              3615   3624        2.1    69.05   15.1    41.3
Alaska                365   6315        1.5    69.31   11.3    66.7
Arizona              2212   4530        1.8    70.55    7.8    58.1

> cor(states)
            Population     Income Illiteracy    Life Exp     Murder     HS Grad
Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000

> sum( (Population-mean(Population)) * (Income-mean(Income)) ) /                          
  sqrt(sum( (Population-mean(Population))^2 ) * sum( (Income-mean(Income))^2 ))
[1] 0.2082276

==相关系数检验==

“由于抽样的随机性以及样本量的影响，简单相关系数仅能体现样本所表现的相关性。样本来自的总体是否相关或无关，还需要进行相关系数检验” (“基于R的统计分析与数据挖掘”, p. 68) (pdf)，计算简单相关系数检验的检验统计量t的公式如下:

$t_r = \frac{r-\rho}{s_r} \sim t-distribution$ 为总体的相关系数， $s_r$ 为抽样误差。

$t= \frac{r-0}{\sqrt{\frac{1-r^2}{n-2}}}= \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$

> cor.test(Population,Income)                                                           

        Pearson's product-moment correlation

data:  Population and Income
t = 1.475, df = 48, p-value = 0.1467
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.07443435  0.45991855
sample estimates:
      cor 
0.2082276 

> r = 0.2082276
> n = nrow(states) # 50
> ( r*sqrt(n-2) ) / ( sqrt(1-r^2) )
[1] 1.474974

“偏相关系数是在控制了其他数值型变量（这些变量称为控制变量）的条件下，计算两数值型变量间的相关系数，从而消除其他变量对相关系数值的影响” (“基于R的统计分析与数据挖掘”, p. 70) (pdf)

==两数值变量相关性的可视化==

> plot(Population~Income,states)
> lm_fit <- lm(Population~Income,as.data.frame(states))
> lm_fit

Call:
lm(formula = Population ~ Income, data = as.data.frame(states))

Coefficients:
(Intercept)       Income  
  -2464.492        1.513 
  
> abline(lm_fit,col="red")
> coef(lm_fit)
# > coef(lm_fit)
#  (Intercept)       Income 
# -2464.491538     1.512898
> abline(-2464.491538,1.512898, col="red")
> loess_fit <- loess(Population~Income,as.data.frame(states))
> loess_fit
Call:
loess(formula = Population ~ Income, data = as.data.frame(states))

Number of Observations: 50 
Equivalent Number of Parameters: 5.46 
Residual Standard Error: 4447 

> ord <- order(loess_fit$x)
> lines(loess_fit$x[ord], loess_fit$fitted[ord], col="blue")

library(ggplot2)
ggplot(as.data.frame(states),aes(Income,Population))+
	geom_point()+
	geom_smooth(method="loess",se=F,col="blue")+
	geom_smooth(method="lm",se=F,col="red")

两分类变量的相关性分析

“两分类型变量相关性描述的工具是编制列联表” (“基于R的统计分析与数据挖掘”, p. 73) (pdf)

“两分类型变量相关性的检验是在列联表的基础上，利用列联表数据，分析表中两分类型变量的总体相关性。采用的方法是 #卡方检验卡方检验的原假设是：列联表中两分类型变量独立” (“基于R的统计分析与数据挖掘”, p. 76) (pdf)