功能分析(HUMAnN)

最后发布时间 : 2025-08-12 15:38:37 浏览量 :

HUMAnN2方法的概述

分层组学搜索

HUMAnN2进行宏基因组的功能分析主要包括四步：

第一步:鉴定该样本中存在的物
第二步:映射reads到鉴定到物种的泛基因组
第三步:使用blastp在蛋白数据库搜没有映射到泛基因组的reads

重要参数

--bypass-prescreen：添加了该参数，HUMAnN就不执行上述第一步了，这样在第二步鉴定宏基因组测序的reads中的基因时，就需要使用完整的ChocoPhlAn数据库。之前是只将reads mapping到鉴定的物种的泛基因组上的，加上该参数后，就mapping到所有泛基因组上，具体结果文件如下:
文件XXX_2_genefamilies.tsv、XXX_3_reactions.tsv和XXX_4_pathabundance.tsv内容

# XXX_2_genefamilies.tsv
UniRef90_A0A174UN13	4316.9913461275
UniRef90_A0A174UN13|unclassified.t__SGB2091	2539.7728292147

# XXX_3_reactions.tsv
2.7.13.3-RXN	2935.7107442709
2.7.13.3-RXN|unclassified.t__SGB1871	2378.7765338079

# XXX_4_pathabundance.tsv
PWY-6527: stachyose degradation	212.0226556742
PWY-6527: stachyose degradation|unclassified.t__SGB1871	129.4247760629

由于与泛基因组数据库比对，因此结果中包含了菌信息(也就是核苷酸对应的基因组信息)
注意unclassified.t_SGB1871代表一个菌，ChocoPhlAn数据库有其对应的基因组信息

--bypass-nucleotide-search：跳过第一步和第二步，直接使用reads搜索蛋白数据库，输出格式与--bypass-prescreen区别是只包含unclassified与UniRef90以及pathway的关系:
文件XXX_2_genefamilies.tsv、XXX_3_reactions.tsv和XXX_4_pathabundance.tsv内容

# XXX_2_genefamilies.tsv
UniRef90_A0A3R4DA77	5982.0488794472
UniRef90_A0A3R4DA77|unclassified	5982.0488794472

# XXX_3_reactions.tsv
3.2.1.21-RXN	308.4762686802
3.2.1.21-RXN|unclassified	308.4762686802

# XXX_4_pathabundance.tsv
PWY-6527: stachyose degradation	130.1038499309
PWY-6527: stachyose degradation|unclassified	130.1038499309

可以看到如果让reads只与蛋白数据苦苦比对的话，是没有菌信息的

一个例子

humann --input examples/demo.fastq --output $OUTPUT_DIR   --bypass-prescreen

Output files will be written to: /workspaces/humann/output

Creating custom ChocoPhlAn database ........


Running bowtie2-build ........


Running bowtie2 ........

Total bugs from nucleotide alignment: 3
unclassified.t__SGB1871: 78602 hits
unclassified.t__SGB2091: 27259 hits
unclassified.t__SGB2301: 8688 hits

Total gene families from nucleotide alignment: 3163

Unaligned reads after nucleotide alignment: 19.5532020984 %


Running diamond ........


Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

Total bugs after translated alignment: 4
unclassified.t__SGB1871: 78602 hits
unclassified.t__SGB2091: 27259 hits
unclassified.t__SGB2301: 8688 hits
unclassified: 5610 hits

Total gene families after translated alignment: 3384

Unaligned reads after translated alignment: 15.6161555154 %


Computing gene families ...

Computing reaction and pathway abundance ...

Output files created: 
/workspaces/humann/output/demo_2_genefamilies.tsv
/workspaces/humann/output/demo_3_reactions.tsv
/workspaces/humann/output/demo_4_pathabundance.tsv
/workspaces/humann/output/demo_0.log

humann --input examples/demo.fastq --output $OUTPUT_DIR --bypass-nucleotide-search

Output files will be written to: /workspaces/humann/output

Running diamond ........


Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

Total bugs after translated alignment: 1
unclassified: 63367 hits

Total gene families after translated alignment: 1359

Unaligned reads after translated alignment: 56.6194492629 %


Computing gene families ...

Computing reaction and pathway abundance ...

Output files created: 
/workspaces/humann/output/demo_2_genefamilies.tsv
/workspaces/humann/output/demo_3_reactions.tsv
/workspaces/humann/output/demo_4_pathabundance.tsv
/workspaces/humann/output/demo_0.log

可以看到如果直使用全部的核苷酸数据库搜素(--bypass-prescreen)、接跳过核苷酸搜索(--bypass-nucleotide-search)区别如下:

	--bypass-prescreen	--bypass-nucleotide-search
Total bugs from nucleotide alignment(核苷酸比对的细菌数)	3	0
Total gene families from nucleotide alignment(比对上的基因家族数)	3163	0
Unaligned reads after nucleotide alignment(核苷酸比对后没有比对上核苷酸的reads数)	19.5532020984 %	0
Total gene families after translated alignment(翻译比对后的总基因家族)	3163+`221`=3384	1359
Unaligned reads after translated alignment(翻译比对后没有比对上的reads)	15.6161555154 %	56.6194492629 %

可以看到使用全部的核苷酸数据库搜素后有使用蛋白搜索找到221个基因，加上核苷酸搜索一共3384个基因；直接使用蛋白搜索找到1359个基因。

基因通路定量

数据库

查看数据目录命令: humann_config --print

database_folders : nucleotide = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO
database_folders : protein = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO
database_folders : utility_mapping = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO

查看核酸数据库内容

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO
$ ls
# SGB1871_pangenome90.fna.gz  SGB2091_pangenome90.fna.gz  SGB2301_pangenome90.fna.gz
$ zcat SGB1871_pangenome90.fna.gz | grep ">" | head
# >SGB1871|SGB1871|UniRef90_A0A0P0EZQ9|UniRef50_A0A0P0L0B2|954
# >SGB1871|SGB1871|UniRef90_A0A1C7GUW0|UniRef50_A0A0P0FA44|474

可以看到核酸数据库一个文件就是一个菌，每个文件里报告了该菌的所有基因

查看蛋白数据库

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO
$ ls
# demo_proteins_uniref90_uniref50_like_v2019_06.dmnd

蛋白数据库包含一个dmnd的索引文件，使用改文件可以将demo_diamond_aligned.tsv转换成demo_diamond_aligned.tsv

$ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_unaligned.fa
# >61NLYAAXX100508:5:100:10009:13087/1
# ACCACCGGCATAGTGGTCACATATATCATAGGGCTTTTCTATGGCATGGCTCCCATGGATGGTGTCAAATTGCTCGGTGACGGTATTGACTCTATGGGCAA
# >61NLYAAXX100508:5:100:10018:9949/1
# GATCCGTTCGTTCCCGGTGACGGGAATCCTACACGATCCTGGTAAGTATTATTCCCCCACAAAGAGGTAATTTTGGAAGAAGAATGGGATAGACGCATGGT
$ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_aligned.tsv
# 61NLYAAXX100508:5:100:10001:14485/1|101 UniRef90_A0A2A3N3F7|1467        72.7    33      9       0       1       99      383     415     # 1.53e-11        52.4
# 61NLYAAXX100508:5:100:10006:12287/1|101 UniRef90_A0A125MG82|1782        72.7    33      9       0       101     3       175     207     # 2.41e-15        63.2

关于utility数据库

$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO
$ ls
# README.txt                  map_go_name.txt.gz      map_level4ec_name.txt.gz     map_pfam_name.txt.gz                                   # mpa_vJan21_CHOCOPhlAnSGB_202103.tsv  vOct22_SGB_mapping.tsv
# map_eggnog_name.txt.gz      map_ko_name.txt.gz      map_metacyc-pwy_name.txt.gz  metacyc_pathways_structured_filtered_v24_subreactions  unipathway_pathways
# map_eggnog_uniref90.txt.gz  map_ko_uniref90.txt.gz  map_metacyc-rxn_name.txt.gz  metacyc_reactions_level4ec_only.uniref.bz2             unipathway_uniprots.uniref.bz2

utility数据库是将UniRef90的ID转为其他Id的对应关系(例如KO、GO)

参考

本文要讨论的是在研究宏基因组数据时，要解决的问题"What are they doing?"。我们很容易想到从meta'omic数据分析微生物群落的功能，包括以下几个方式:

使用balstp在蛋白数据库中搜索所有reads

HUMAnN通过分层搜索生成宏基因组的reads到已知或未知物种分类的基因序列映射，这些映射按照质量和序列长度进行加权，以估计单个物种或者整个群落基因家族的丰度。根据基因家族丰度可以重新组合到其他功能系统(例如 COGs, KOs, Pfam domains, 和GO terms)。最终基因家族被注释为代谢酶，对代谢酶进一步分析，以重建和量化每个物种和群落的完整的代谢途径(默认通过MetaCyc)。
HUMAnN命令如下：

humann  --nucleotide-database /data/databases/humann/chocophlan_v31_201901 \
	--protein-database /data/databases/humann/uniref90_v201901_ec_filtered \
	--threads 10 \
	--remove-temp-output  \
	--input test.clean.gz \
	--output test   \
	--metaphlan-options="--index mpa_vJan21_CHOCOPhlAnSGB_202103     \
		--bowtie2db /data/databases/metaphlan_databases_202103"

数据库文件包括3个文件:

chocophlan_v31_201901
uniref90_v201901_ec_filtered
metaphlan_databases_202103

chocophlan_v31_201901的目录如下:

uniref90_v201901_ec_filtered的目录如下:

metaphlan_databases_202103的目录如下:

结果文件如下:

test_genefamilies.tsv
test_pathabundance.tsv
test_pathcoverage.tsv

根据基因家族丰度可以重新组合到KO系统

humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_go       -u N -p N -o   ${meta.id}/go/${meta.id}.GO.txt
humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_ko       -u N -p N -o   ${meta.id}/${meta.id}.KO.txt

HUMAnN(HMP Unified Metabolic Analysis Network)，用于从metagenomics或metatranscriptomics量化microbial metabolic pathways和other molecular functions

从HUMAnN的源代码开始学习

git clone git@github.com:biobakery/humann.git
cd humann
pip install .

.
├── genefamilies.tsv
├── humann_temp_y68c605p
├── pathabundance.tsv
├── pathcoverage.tsv
├── eggnog
│   ├── eggnog.rpkm_stratified.txt
│   ├── eggnog.rpkm.txt
│   ├── eggnog.rpkm_unstratified.txt
│   └── eggnog.txt
├── gbm
│   ├── coverage-plot.svg
│   ├── Abundance-RPKs.modules
│   └── GBM.txt
├── genefamilies
│   ├── genefamilies.rpkm_stratified.tsv
│   ├── genefamilies.rpkm.tsv
│   ├── genefamilies.rpkm_unstratified.tsv
│   ├── genefamilies_stratified.tsv
│   └── genefamilies_unstratified.tsv
├── gmm
│   ├── coverage-plot.svg
│   ├── Abundance-RPKs.modules
│   └── GMM.txt
├── go
│   ├── go.rpkm_stratified.txt
│   ├── go.rpkm.txt
│   ├── go.rpkm_unstratified.txt
│   └── go.txt
├── ko
│   ├── ko.rpkm_stratified.txt
│   ├── ko.rpkm.txt
│   ├── ko.rpkm_unstratified.txt
│   └── ko.txt
├── level4ec
│   ├── level4ec.rpkm_stratified.txt
│   ├── level4ec.rpkm.txt
│   ├── level4ec.rpkm_unstratified.txt
│   └── level4ec.txt
├── module
│   └── Module.txt
├── pathabundance
│   ├── pathabundance.rpkm_stratified.tsv
│   ├── pathabundance.rpkm.tsv
│   ├── pathabundance.rpkm_unstratified.tsv
│   ├── pathabundance_stratified.tsv
│   └── pathabundance_unstratified.tsv
├── pathway
│   └── Pathway.txt
├── pfam
│   ├── pfam.rpkm_stratified.txt
│   ├── pfam.rpkm.txt
│   ├── pfam.rpkm_unstratified.txt
│   └── pfam.txt
└── rxn
    ├── rxn.rpkm_stratified.txt
    ├── rxn.rpkm.txt
    ├── rxn.rpkm_unstratified.txt
    └── rxn.txt

pathabundance.tsv

# Pathway	DM1_1_Abundance
UNMAPPED	21221155.3616584
UNINTEGRATED	5690441.85241983
UNINTEGRATED\|g__Lachnospiraceae_unclassified.s__Lachnospiraceae_bacterium_3_1	878965.418971247
UNINTEGRATED\|g__Mucispirillum.s__Mucispirillum_schaedleri	682884.014131736
UNINTEGRATED\|unclassified	564228.771907507
UNINTEGRATED\|g__Muribaculum.s__Muribaculum_intestinale	438537.115838629

pathcoverage.tsv

# Pathway	DM1_1_Coverage
UNMAPPED	1
UNINTEGRATED	1
UNINTEGRATED\|g__Acutalibacter.s__Acutalibacter_muris	1
UNINTEGRATED\|g__Adlercreutzia.s__Adlercreutzia_equolifaciens	1
UNINTEGRATED\|g__Akkermansia.s__Akkermansia_muciniphila	1
UNINTEGRATED\|g__Asaccharobacter.s__Asaccharobacter_celatus	1

genefamilies.tsv

# Gene Family	DM1_1_Abundance-RPKs
UNMAPPED	57762775
UniRef90_A7V4G2	89245.5797634043
UniRef90_A7V4G2\|g__Bacteroides.s__Bacteroides_uniformis	89245.5797634043
UniRef90_A0A3A9B617	39787.5705471666
UniRef90_A0A3A9B617\|g__Muribaculum.s__Muribaculum_intestinale	39787.5705471666
UniRef90_V2QIK6	34333.3333333333

Plotting stratified functions

ref
我们将使用人类微生物组计划中约 400 个样本的通路丰度表
hmp_pathabund.pcl

FEATURE \ SAMPLE                                                                                      SRS011084    SRS011086      SRS011090
STSite                                                                                                Stool        Tongue_dorsum  Buccal_mucosa
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis                                                  0.000498359  0.000628096    0.000304951
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Acidovorax.s__Acidovorax_ebreus               0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Bacteroides.s__Bacteroides_finegoldii         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Haemophilus.s__Haemophilus_influenzae         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Klebsiella.s__Klebsiella_pneumoniae           0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Staphylococcus.s__Staphylococcus_epidermidis  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Veillonella.s__Veillonella_sp_oral_taxon_158  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|unclassified                                     7.89502e-06  1.60101e-05    0

我们可以看到来自三个不同身体部位的三个样本（通过 STSite metadatum）。我们可以看到一条途径，即 N10-甲酰-四氢叶酸生物合成(N10-formyl-tetrahydrofolate biosynthesis)（1CMET2-PWY）及其分层。在这三个样本中，该途径只能归入未分类层，而且丰度非常低。在第三个样本（SRS011090）中，仅在群落水平上观察到该途径（如何观察到的？）

利用 MaAsLin 2.0（或其他与微生物组数据进行统计关联的方法），我们可以将 HUMAnN 表中的群落总丰度与样本元数据进行比较。在本例中，将群落总途径丰度与样本身体部位相关联的模型显示，( L-homoserine )L-高丝氨酸和 ( L-methionine )L-蛋氨酸生物合成（METSYN-PWY）在三个口腔部位（颊粘膜、舌头和牙菌斑）具有很强的富集性。

现在我们已经确定了这种富集，使用 grep 命令从 hmp_pathabund.pcl 文件中探索该通路的分层情况。

grep 'METSYN-PWY' hmp_pathabund.pcl | less -S

HUMAnN 包含一个工具 humann_barplot，用于帮助可视化分层输出结果。让我们用这个工具来绘制上文确定的差异丰度途径图：

python humann_barplot.py  --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot1.png --focal-feature METSYN-PWY

让我们尝试在分层丰度总和上添加 "排序"：

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot2.png --focal-feature METSYN-PWY --sort sum

一种模式已经开始出现：我们可以清楚地看到，在图的左端（高）端，口腔站点富含。我们可以通过在身体站点（元数据）和“ logStack”缩放（即按照社区丰度的日志成比例地缩放堆叠条的顶部，同时在总条形高度范围内缩放物种贡献），可以继续进行此分析线（即，将堆叠条的顶部缩放到堆叠条的顶部）：

humann_barplot --input hmp_pathabund.pcl --focal-metadata STSite --last-metadata STSite \
    --output plot3.png --focal-feature METSYN-PWY --sort sum metadata --scaling logstack

我们现在可以看到，该途径在口腔粘膜( buccal mucosa)中的富集程度最高，几乎完全由链球菌(Streptococcus species)贡献。该途径在其他两个口腔部位的相对丰富度普遍较低（约为一个数量级，如对数刻度所示），而且在舌体(tongue )部位更常见的是 "未分类"。

MAG基因预测、去冗余和定量耐药基因(抗性基因)