Biomedical Knowledge Mining using GOSemSim and clusterProfiler
超几何检验(Hypergeometric Test)是一种常用于基因富集分析(如 GO/KEGG)的统计方法。下面是它的核心统计理念:
零假设 H_0:你观察到的基因富集(例如某通路中的命中基因数)是完全随机的,不存在富集现象。换句话说,你的目标基因集在注释类(如某 GO term)中的出现频率,与整个背景基因集中的出现频率没有显著差异。
超几何检验的统计量是一个离散随机变量 X,它表示:
从总共 N 个基因中,随机抽取 n 个基因,在一个已知包含 K 个目标基因的功能集合中,命中了 k 个目标基因的概率。
X \sim \text{Hypergeometric}(N, K, n)
你实际计算的是:
P(X \geq k) = \sum_{i = k}^{\min(n, K)} \frac{\binom{K}{i} \binom{N-K}{n-i}}{\binom{N}{n}}
代表:在零假设成立的前提下,观察到至少 k 个命中的概率。
假设你有:
就可以用超几何检验求 P(X \geq 20)
# 计算 P(X >= k) phyper(q = k - 1, m = K, n = N - K, k = n, lower.tail = FALSE)
如果你需要 Python 实现或对这部分代码更感兴趣,我也可以提供对应代码。是否需要?
keggapi
devtools::install_github("BioinfoFungi/enrichplot2") library(clusterProfiler) library(org.Mm.eg.db) library(KEGG.db) annotations_orgDb <- AnnotationDbi::select(org.Mm.eg.db, keys = rownames(wt_control_deg), columns = c("ENTREZID","GENENAME"), keytype = "SYMBOL") kegg<- clusterProfiler::enrichKEGG(gene =annotations_orgDb$ENTREZID, organism = "mmu", keyType = "SYMBOL", pvalueCutoff = 0.05, pAdjustMethod = "BH", minGSSize = 10, maxGSSize = 500, qvalueCutoff = 10000, use_internal_data = TRUE) kegg <- setReadable(kegg, 'org.Mm.eg.db', 'ENTREZID') png(filename = paste0(wt_control_vs_ko_control,"/KEGG.png"),width = 7,height = 6,res = 300,units = "in") dotplot(kegg,showCategory=20,color="pvalue") dev.off() png(filename = paste0(wt_control_vs_ko_control,"/heat.png"),width = 10,height =5,res = 300,units = "in") heatplot(kegg,showCategory=20,color="pvalue") dev.off() write_tsv(arrange(kegg@result,pvalue), file=paste0(wt_control_vs_ko_control,"/KEGG.tsv"))
注意pvalueCutoff=10000,dotplot默认使用在enrichKEGG函数中设置的pvalueCutoff = 0.05筛选显著富集的通路。及使用0.05的阈值筛选矫正后的pvalue,进行可视化的。因此如果基因太少则不能富集到通路,此时可以将改值调大,或者修改dotplot函数的源码实现根据pvalue进行矫正。有关dotplot源码相关内容见enrichplot源码阅读。
pvalueCutoff=10000
dotplot
enrichKEGG
pvalueCutoff = 0.05
suppressMessages(library(KEGG.db)) suppressMessages(library(clusterProfiler)) suppressMessages(library(org.Mm.eg.db)) ls("package:KEGG.db") frame = toTable(KEGGPATHID2EXTID) x <- as.list(KEGG.db::KEGGPATHID2EXTID) x["mmu04020"][[1]] |> length() annotations <- AnnotationDbi::select(org.Mm.eg.db, keys = x["mmu04020"][[1]] , columns = c("ENTREZID","SYMBOL"), keytype = "ENTREZID") |> na.omit() annotations$SYMBOL |> length() intersect(rownames(deg),annotations_foldchanges$SYMBOL) ggvenn( list("Calcium-signaling-pathway" = annotations_foldchanges$SYMBOL ,"mcherry-PKH vs mcherry deg" = rownames(deg) ), stroke_size = 0.6, set_name_size = 2.8, text_size = 5,show_percentage=F )
library(clusterProfiler) library(org.Hs.eg.db) entrez_id <- bitr(top20_gene, fromType = "SYMBOL", toType = c("ENTREZID"), OrgDb = org.Hs.eg.db)%>% pull("ENTREZID")%>%unique() kegg<- clusterProfiler::enrichKEGG(gene =entrez_id,organism = 'hsa',pAdjustMethod="none") kegg <- setReadable(kegg, 'org.Hs.eg.db', 'ENTREZID') png(filename = "figure/KEGG.png",width = 7,height = 6,res = 300,units = "in") dotplot(kegg,showCategory=20) dev.off() ego<-enrichGO(gene = entrez_id,OrgDb = org.Hs.eg.db,keyType = 'ENTREZID',ont = 'all',readable = T ,pAdjustMethod="BH") png(filename = "figure/GO.png",width = 10,height = 9,res = 300,units = "in") colorSel="qvalue" dotplot(ego,showCategory = 10,split="ONTOLOGY",orderBy = "GeneRatio", color = colorSel) + facet_grid(ONTOLOGY~., scale='free') dev.off()
https://github.com/YuLab-SMU/createKEGGdb
remotes::install_github("YuLab-SMU/createKEGGdb") library(createKEGGdb) species <-c("hsa","rno","mmu") create_kegg_db(species) install.packages("./KEGG.db_1.0.tar.gz", repos=NULL) library(KEGG.db) library(clusterProfiler) data(geneList, package="DOSE") gene <- names(geneList)[abs(geneList) > 2] kk <- enrichKEGG(gene = gene, organism = 'hsa', pvalueCutoff = 0.05, qvalueCutoff = 0.05, use_internal_data =T)
ALL
BP