功能分析(HUMAnN)

最后发布时间 : 2025-03-27 13:46:30 浏览量 :

HUMAnN2方法的概述

分层组学搜索

生信小木屋

HUMAnN2进行宏基因组的功能分析主要包括四步:

  • 第一步:鉴定该样本中存在的物
  • 第二步:映射reads到鉴定到物种的泛基因组
  • 第三步:使用blastp在蛋白数据库搜没有映射到泛基因组的reads

重要参数

  • --bypass-prescreen:添加了该参数,HUMAnN就不执行上述第一步了,这样在第二步鉴定宏基因组测序的reads中的基因时,就需要使用完整的ChocoPhlAn数据库。之前是只将reads mapping到鉴定的物种的泛基因组上的,加上该参数后,就mapping到所有泛基因组上,具体结果文件如下:
    文件XXX_2_genefamilies.tsvXXX_3_reactions.tsvXXX_4_pathabundance.tsv内容
# XXX_2_genefamilies.tsv
UniRef90_A0A174UN13	4316.9913461275
UniRef90_A0A174UN13|unclassified.t__SGB2091	2539.7728292147

# XXX_3_reactions.tsv
2.7.13.3-RXN	2935.7107442709
2.7.13.3-RXN|unclassified.t__SGB1871	2378.7765338079

# XXX_4_pathabundance.tsv
PWY-6527: stachyose degradation	212.0226556742
PWY-6527: stachyose degradation|unclassified.t__SGB1871	129.4247760629

由于与泛基因组数据库比对,因此结果中包含了菌信息(也就是核苷酸对应的基因组信息)
注意unclassified.t_SGB1871代表一个菌,ChocoPhlAn数据库有其对应的基因组信息

  • --bypass-nucleotide-search: 跳过第一步和第二步,直接使用reads搜索蛋白数据库,输出格式与--bypass-prescreen区别是只包含unclassifiedUniRef90以及pathway的关系:
    文件XXX_2_genefamilies.tsvXXX_3_reactions.tsvXXX_4_pathabundance.tsv内容
# XXX_2_genefamilies.tsv
UniRef90_A0A3R4DA77	5982.0488794472
UniRef90_A0A3R4DA77|unclassified	5982.0488794472

# XXX_3_reactions.tsv
3.2.1.21-RXN	308.4762686802
3.2.1.21-RXN|unclassified	308.4762686802

# XXX_4_pathabundance.tsv
PWY-6527: stachyose degradation	130.1038499309
PWY-6527: stachyose degradation|unclassified	130.1038499309

可以看到如果让reads只与蛋白数据苦苦比对的话,是没有菌信息的

一个例子

humann --input examples/demo.fastq --output $OUTPUT_DIR   --bypass-prescreen
Output files will be written to: /workspaces/humann/output

Creating custom ChocoPhlAn database ........


Running bowtie2-build ........


Running bowtie2 ........

Total bugs from nucleotide alignment: 3
unclassified.t__SGB1871: 78602 hits
unclassified.t__SGB2091: 27259 hits
unclassified.t__SGB2301: 8688 hits

Total gene families from nucleotide alignment: 3163

Unaligned reads after nucleotide alignment: 19.5532020984 %


Running diamond ........


Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

Total bugs after translated alignment: 4
unclassified.t__SGB1871: 78602 hits
unclassified.t__SGB2091: 27259 hits
unclassified.t__SGB2301: 8688 hits
unclassified: 5610 hits

Total gene families after translated alignment: 3384

Unaligned reads after translated alignment: 15.6161555154 %


Computing gene families ...

Computing reaction and pathway abundance ...

Output files created: 
/workspaces/humann/output/demo_2_genefamilies.tsv
/workspaces/humann/output/demo_3_reactions.tsv
/workspaces/humann/output/demo_4_pathabundance.tsv
/workspaces/humann/output/demo_0.log
humann --input examples/demo.fastq --output $OUTPUT_DIR --bypass-nucleotide-search
Output files will be written to: /workspaces/humann/output

Running diamond ........


Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

Total bugs after translated alignment: 1
unclassified: 63367 hits

Total gene families after translated alignment: 1359

Unaligned reads after translated alignment: 56.6194492629 %


Computing gene families ...

Computing reaction and pathway abundance ...

Output files created: 
/workspaces/humann/output/demo_2_genefamilies.tsv
/workspaces/humann/output/demo_3_reactions.tsv
/workspaces/humann/output/demo_4_pathabundance.tsv
/workspaces/humann/output/demo_0.log

可以看到如果直使用全部的核苷酸数据库搜素(--bypass-prescreen)、接跳过核苷酸搜索(--bypass-nucleotide-search)区别如下:

--bypass-prescreen--bypass-nucleotide-search
Total bugs from nucleotide alignment(核苷酸比对的细菌数)30
Total gene families from nucleotide alignment(比对上的基因家族数)31630
Unaligned reads after nucleotide alignment(核苷酸比对后没有比对上核苷酸的reads数)19.5532020984 %0
Total gene families after translated alignment(翻译比对后的总基因家族)3163+221=33841359
Unaligned reads after translated alignment(翻译比对后没有比对上的reads)15.6161555154 %56.6194492629 %

可以看到使用全部的核苷酸数据库搜素后有使用蛋白搜索找到221个基因,加上核苷酸搜索一共3384个基因;直接使用蛋白搜索找到1359个基因。

基因通路定量

生信小木屋

数据库

查看数据目录命令: humann_config --print

database_folders : nucleotide = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO
database_folders : protein = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO
database_folders : utility_mapping = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO

查看核酸数据库内容

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO
$ ls
# SGB1871_pangenome90.fna.gz  SGB2091_pangenome90.fna.gz  SGB2301_pangenome90.fna.gz
$ zcat SGB1871_pangenome90.fna.gz | grep ">" | head
# >SGB1871|SGB1871|UniRef90_A0A0P0EZQ9|UniRef50_A0A0P0L0B2|954
# >SGB1871|SGB1871|UniRef90_A0A1C7GUW0|UniRef50_A0A0P0FA44|474

可以看到核酸数据库一个文件就是一个菌,每个文件里报告了该菌的所有基因

查看蛋白数据库

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO
$ ls
# demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

蛋白数据库包含一个dmnd的索引文件,使用改文件可以将demo_diamond_aligned.tsv转换成demo_diamond_aligned.tsv

$ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_unaligned.fa
# >61NLYAAXX100508:5:100:10009:13087/1
# ACCACCGGCATAGTGGTCACATATATCATAGGGCTTTTCTATGGCATGGCTCCCATGGATGGTGTCAAATTGCTCGGTGACGGTATTGACTCTATGGGCAA
# >61NLYAAXX100508:5:100:10018:9949/1
# GATCCGTTCGTTCCCGGTGACGGGAATCCTACACGATCCTGGTAAGTATTATTCCCCCACAAAGAGGTAATTTTGGAAGAAGAATGGGATAGACGCATGGT
$ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_aligned.tsv
# 61NLYAAXX100508:5:100:10001:14485/1|101 UniRef90_A0A2A3N3F7|1467        72.7    33      9       0       1       99      383     415     # 1.53e-11        52.4
# 61NLYAAXX100508:5:100:10006:12287/1|101 UniRef90_A0A125MG82|1782        72.7    33      9       0       101     3       175     207     # 2.41e-15        63.2

关于utility数据库

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO
$ ls
# README.txt                  map_go_name.txt.gz      map_level4ec_name.txt.gz     map_pfam_name.txt.gz                                   # mpa_vJan21_CHOCOPhlAnSGB_202103.tsv  vOct22_SGB_mapping.tsv
# map_eggnog_name.txt.gz      map_ko_name.txt.gz      map_metacyc-pwy_name.txt.gz  metacyc_pathways_structured_filtered_v24_subreactions  unipathway_pathways
# map_eggnog_uniref90.txt.gz  map_ko_uniref90.txt.gz  map_metacyc-rxn_name.txt.gz  metacyc_reactions_level4ec_only.uniref.bz2             unipathway_uniprots.uniref.bz2

utility数据库是将UniRef90的ID转为其他Id的对应关系(例如KO、GO)

参考












本文要讨论的是在研究宏基因组数据时,要解决的问题"What are they doing?"。我们很容易想到从meta'omic数据分析微生物群落的功能,包括以下几个方式:

  • 使用balstp在蛋白数据库中搜索所有reads

HUMAnN通过分层搜索生成宏基因组的reads已知未知物种分类的基因序列映射,这些映射按照质量和序列长度进行加权,以估计单个物种或者整个群落基因家族的丰度。根据基因家族丰度可以重新组合到其他功能系统(例如 COGs, KOs, Pfam domains, 和GO terms)。最终基因家族被注释为代谢酶,对代谢酶进一步分析,以重建和量化每个物种和群落的完整的代谢途径(默认通过MetaCyc)。
HUMAnN命令如下:

humann  --nucleotide-database /data/databases/humann/chocophlan_v31_201901 \
	--protein-database /data/databases/humann/uniref90_v201901_ec_filtered \
	--threads 10 \
	--remove-temp-output  \
	--input test.clean.gz \
	--output test   \
	--metaphlan-options="--index mpa_vJan21_CHOCOPhlAnSGB_202103     \
		--bowtie2db /data/databases/metaphlan_databases_202103"

数据库文件包括3个文件:

  • chocophlan_v31_201901
  • uniref90_v201901_ec_filtered
  • metaphlan_databases_202103

chocophlan_v31_201901的目录如下:

生信小木屋

生信小木屋

uniref90_v201901_ec_filtered的目录如下:

生信小木屋

metaphlan_databases_202103的目录如下:
生信小木屋

结果文件如下:

  • test_genefamilies.tsv
  • test_pathabundance.tsv
  • test_pathcoverage.tsv

生信小木屋

生信小木屋

生信小木屋

根据基因家族丰度可以重新组合到KO系统

humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_go       -u N -p N -o   ${meta.id}/go/${meta.id}.GO.txt
humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_ko       -u N -p N -o   ${meta.id}/${meta.id}.KO.txt

HUMAnN(HMP Unified Metabolic Analysis Network),用于从metagenomicsmetatranscriptomics量化microbial metabolic pathwaysother molecular functions

从HUMAnN的源代码开始学习

git clone git@github.com:biobakery/humann.git
cd humann
pip install .
.
├── genefamilies.tsv
├── humann_temp_y68c605p
├── pathabundance.tsv
├── pathcoverage.tsv
├── eggnog
│   ├── eggnog.rpkm_stratified.txt
│   ├── eggnog.rpkm.txt
│   ├── eggnog.rpkm_unstratified.txt
│   └── eggnog.txt
├── gbm
│   ├── coverage-plot.svg
│   ├── Abundance-RPKs.modules
│   └── GBM.txt
├── genefamilies
│   ├── genefamilies.rpkm_stratified.tsv
│   ├── genefamilies.rpkm.tsv
│   ├── genefamilies.rpkm_unstratified.tsv
│   ├── genefamilies_stratified.tsv
│   └── genefamilies_unstratified.tsv
├── gmm
│   ├── coverage-plot.svg
│   ├── Abundance-RPKs.modules
│   └── GMM.txt
├── go
│   ├── go.rpkm_stratified.txt
│   ├── go.rpkm.txt
│   ├── go.rpkm_unstratified.txt
│   └── go.txt
├── ko
│   ├── ko.rpkm_stratified.txt
│   ├── ko.rpkm.txt
│   ├── ko.rpkm_unstratified.txt
│   └── ko.txt
├── level4ec
│   ├── level4ec.rpkm_stratified.txt
│   ├── level4ec.rpkm.txt
│   ├── level4ec.rpkm_unstratified.txt
│   └── level4ec.txt
├── module
│   └── Module.txt
├── pathabundance
│   ├── pathabundance.rpkm_stratified.tsv
│   ├── pathabundance.rpkm.tsv
│   ├── pathabundance.rpkm_unstratified.tsv
│   ├── pathabundance_stratified.tsv
│   └── pathabundance_unstratified.tsv
├── pathway
│   └── Pathway.txt
├── pfam
│   ├── pfam.rpkm_stratified.txt
│   ├── pfam.rpkm.txt
│   ├── pfam.rpkm_unstratified.txt
│   └── pfam.txt
└── rxn
    ├── rxn.rpkm_stratified.txt
    ├── rxn.rpkm.txt
    ├── rxn.rpkm_unstratified.txt
    └── rxn.txt

pathabundance.tsv

# PathwayDM1_1_Abundance
UNMAPPED21221155.3616584
UNINTEGRATED5690441.85241983
UNINTEGRATED|g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_3_1878965.418971247
UNINTEGRATED|g__Mucispirillum.s__Mucispirillum_schaedleri682884.014131736
UNINTEGRATED|unclassified564228.771907507
UNINTEGRATED|g__Muribaculum.s__Muribaculum_intestinale438537.115838629

pathcoverage.tsv

# PathwayDM1_1_Coverage
UNMAPPED1
UNINTEGRATED1
UNINTEGRATED|g__Acutalibacter.s__Acutalibacter_muris1
UNINTEGRATED|g__Adlercreutzia.s__Adlercreutzia_equolifaciens1
UNINTEGRATED|g__Akkermansia.s__Akkermansia_muciniphila1
UNINTEGRATED|g__Asaccharobacter.s__Asaccharobacter_celatus1

genefamilies.tsv

# Gene FamilyDM1_1_Abundance-RPKs
UNMAPPED57762775
UniRef90_A7V4G289245.5797634043
UniRef90_A7V4G2|g__Bacteroides.s__Bacteroides_uniformis89245.5797634043
UniRef90_A0A3A9B61739787.5705471666
UniRef90_A0A3A9B617|g__Muribaculum.s__Muribaculum_intestinale39787.5705471666
UniRef90_V2QIK634333.3333333333