差异表达分析

最后发布时间 : 2023-02-23 09:05:28 浏览量 :

资源

RNAdiffAPP
https://uclouvain-cbio.github.io/WSBIM2122/sec-rnaseq.html

^[1]

## https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
# Input data
dds <- DESeqDataSetFromMatrix(
  countData = RNAseqObj@count,
  colData = RNAseqObj@metadata,
  design= ~ group)

# Differential expression analysis
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)

# dds <- DESeq(dds_filt, parallel = T)

results(dds, name = "group_AAA_vs_BBB")

DeSeq2的理论

Size factor estimation（median of ratios)

Count modeling

counts distribution for a typical RNAseq sample

上图中每一个点代表一个基因
平均值不等于方差(数据点的散布不落在对角线上)
对于平均表达量较高的基因，在重复样本中的方差倾向于大于平均值(散点高于红线)。
对于低平均表达的基因，我们看到相当多的分散。我们通常称之为“异方差(heteroscedasticity)”。也就是说，对于在低范围内基因表达的水平，我们观察到方差值的很多变化。这种现象称为Over dispersion。

Note

如果在一个样本组的生物学重复之间 mRNA 的比例保持完全不变，我们可以预期一个泊松分布(其中均值=方差)
但是在生物学重复，总是存在一定程度的可变性。
如果我们继续添加更多的重复(即n > 20) ，我们最终会看到分散开始减少，高表达式数据点更接近红线
所以在理论上，如果我们有足够的复制品，我们可以使用泊松。

Dispersion estimation

Final dispersion estimate

DESeq2 Generalized linear model

K_{ij} \sim \textrm{NB}(\mu_{ij}, \alpha_i) \\ \mu_{ij} = s_j q_{ij} \\ \log_2(q_{ij}) = x_{j.} \beta_i

where counts $K_{ij}$ for gene $i$ , sample $j$ are modeled using a negative binomial distribution with fitted mean $\mu_{ij}$ and a gene-specific dispersion parameter $\alpha_i$ . The fitted mean is composed of a sample-specific size factor $s_j$ and a parameter $q_{ij}$ proportional to the expected true concentration of fragments for sample $j$ . The coefficients $\beta_i$ give the log2 fold changes for gene $i$ for each column of the model matrix $X$ . Note that the model can be generalized to use sample- and gene-dependent normalization factors $s_{ij}$ .

The dispersion parameter $\alpha_i$ defines the relationship between the variance of the observed count and its mean value. In other words, how far do we expected the observed count will be from the mean value, which depends both on the size factor $s_j$ and the covariate-dependent part $q_{ij}$ as defined above.

\textrm{Var}(K_{ij}) = E[ (K_{ij} - \mu_{ij})^2 ] = \mu_{ij} + \alpha_i \mu_{ij}^2

Final estimate of logarithmic fold changes

K_{ij} \sim \textrm{NB}(\mu_{ij}, \alpha_i) \\ \mu_{ij} = s_j q_{ij} \\ \log_2(q_{ij}) = x_{j.} \beta_i

参考

RNA sequencing: the teenage years
↩

RNA-seq数据标准化 DeSeq2