Pathview: pathway based data integration and visualization

最后发布时间:2022-11-14 17:12:47 浏览量:

pathview[1]是一个基于通路的可视化工具集。
首先安装pathview R package.

BiocManager::install("pathview")

使用内部数据快速开始

library(pathview)
data(gse16873.d)
library(pathview)
data(gse16873.d)
pv.out <- pathview(gene.data = gse16873.d[, 1], pathway.id = "04110",species = "hsa", gene.idtype ="entrez",out.suffix = "gse16873")
  • gene.data: either vector (single sample) or a matrix-like data (multiple sample). Vector should be numeric with gene IDs as names or it may also be character of gene IDs. Character vector is treated as discrete or count data. Matrix-like data structure has genes as rows and samples as columns. Row names should be gene IDs. Here gene ID is a generic concepts, including multiple types of gene, transcript and protein uniquely mappable to KEGG gene IDs. KEGG ortholog IDs are also treated as gene IDs as to handle metagenomic data. Check details for mappable ID types.
  • pathway.id: character vector, the KEGG pathway ID(s), usually 5 digit, may also include the 3 letter KEGG species code.
  • species: character, either the kegg code, scientific name or the common name of the target species. This applies to both pathway and gene.data or cpd.data. When KEGG ortholog pathway is considered, species="ko". Default species="hsa", it is equivalent to use either "Homo sapiens" (scientific name) or "human" (common name).
  • gene.idtype: character, ID type used for the gene.data, case insensitive. Default gene.idtype="entrez", i.e. Entrez Gene, which are the primary KEGG gene ID for many common model organisms. For other species, gene.idtype should be set to "KEGG" as KEGG use other types of gene IDs. For the common model organisms (to check the list, do: data(bods); bods), you may also specify other types of valid IDs. To check the ID list, do: data(gene.idtype.list); gene.idtype.list.
  • out.suffix: character, the suffix to be added after the pathway name as part of the output graph file. Sample names or column names of the gene.data or cpd.data are also added when there are multiple samples. Default out.suffix="pathview".

生信小木屋

上图中红色表示,相对于对照组,基因表达上调的基因,绿色表示基因表达下调的基因;颜色越深,基因上调或下调的倍数越高。

接下来,我们查看KEGG ID为04110所对应的KEGG名称

data(paths.hsa)
paths.hsa["hsa04110"]
# hsa04110: 'Cell cycle'

gse16873.d[, 1]的数据格式如下:

head(data.frame(gse16873.d[, 1]))

# 	gse16873.d[, 1]
# 10000	-0.30764480
# 10001	0.41586805
# 10002	0.19854925
# 10003	-0.23155297
# 100048912	-0.04490724
# 10004	-0.08756237

第一列为Entrez Gene的基因id,第二列为logFC

pathview 函数的输出结果pv.out如下,其中行表示映射的基因/化合物

kegg.nameslabelsall.mappedtypexywidthheightmol.datamol.col
1029CDKN2A1029gene53212446170.129198738972622#BEBEBE
51343FZR151343gene9195364617-0.404325630326951#5FDF5F
4171MCM24171,4172,4173,4174,4175,4176gene5535564617-0.420218063479512#5FDF5F
4998ORC14998,4999,5000,5001,23594,23595gene49455646170.986487281754076#FF0000
996CDC27996,8697,8881,10393,25847,25906,29882,51433gene91929746170.936301774095574#FF0000
996CDC27996,8697,8881,10393,25847,25906,29882,51433gene91951946170.936301774095574#FF0000
  • kegg.names: standard KEGG IDs/Names for mapped nodes. It's Entrez Gene ID or KEGG Compound Accessions.
  • labels: Node labels to be used when needed.
  • all.mapped: All molecule (gene or compound) IDs mapped to this node.
  • type: node type, currently 4 types are supported: "gene","enzyme", "compound" and "ortholog".
  • x: x coordinate in the original KEGG pathway graph.
  • y: y coordinate in the original KEGG pathway graph.
  • width:node width in the original KEGG pathway graph.
  • height: node height in the original KEGG pathway graph.
  • other columns: columns of the mapped gene/compound data and corresponding pseudo-color codes for individual samples

Compound and gene data

In examples above, we viewed gene data with canonical signaling pathways. We frequently want to look at metabolic pathways too.Besides gene nodes, these pathways also have compound nodes. Therefore, we may integrate or visualize both gene data and compound data with metabolic pathways. Here gene data is a broad concept including genes, transcripts, protein , enzymes and their expression, modifications and any measurable attributes. Same is compound data, including metabolites, drugs, their measurements and attributes.[1]

Here we use the breast cancer microarray dataset as gene data. We then generate simulated compound or metabolomic data, and load proper compound ID types (with sufficient number of unique entries) for demonstration.

# data(gene.idtype.list)
# gene.idtype.list
# data(bods)
# bods

  1. http://www.bioconductor.org/packages/release/bioc/vignettes/pathview/inst/doc/pathview.pdf