HUMAnN2方法的概述
--bypass-prescreen
HUMAnN
ChocoPhlAn
XXX_2_genefamilies.tsv
XXX_3_reactions.tsv
XXX_4_pathabundance.tsv
# XXX_2_genefamilies.tsv UniRef90_A0A174UN13 4316.9913461275 UniRef90_A0A174UN13|unclassified.t__SGB2091 2539.7728292147 # XXX_3_reactions.tsv 2.7.13.3-RXN 2935.7107442709 2.7.13.3-RXN|unclassified.t__SGB1871 2378.7765338079 # XXX_4_pathabundance.tsv PWY-6527: stachyose degradation 212.0226556742 PWY-6527: stachyose degradation|unclassified.t__SGB1871 129.4247760629
由于与泛基因组数据库比对,因此结果中包含了菌信息(也就是核苷酸对应的基因组信息)注意unclassified.t_SGB1871代表一个菌,ChocoPhlAn数据库有其对应的基因组信息
unclassified.t_SGB1871
--bypass-nucleotide-search
unclassified
UniRef90
pathway
# XXX_2_genefamilies.tsv UniRef90_A0A3R4DA77 5982.0488794472 UniRef90_A0A3R4DA77|unclassified 5982.0488794472 # XXX_3_reactions.tsv 3.2.1.21-RXN 308.4762686802 3.2.1.21-RXN|unclassified 308.4762686802 # XXX_4_pathabundance.tsv PWY-6527: stachyose degradation 130.1038499309 PWY-6527: stachyose degradation|unclassified 130.1038499309
可以看到如果让reads只与蛋白数据苦苦比对的话,是没有菌信息的
humann --input examples/demo.fastq --output $OUTPUT_DIR --bypass-prescreen
Output files will be written to: /workspaces/humann/output Creating custom ChocoPhlAn database ........ Running bowtie2-build ........ Running bowtie2 ........ Total bugs from nucleotide alignment: 3 unclassified.t__SGB1871: 78602 hits unclassified.t__SGB2091: 27259 hits unclassified.t__SGB2301: 8688 hits Total gene families from nucleotide alignment: 3163 Unaligned reads after nucleotide alignment: 19.5532020984 % Running diamond ........ Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd Total bugs after translated alignment: 4 unclassified.t__SGB1871: 78602 hits unclassified.t__SGB2091: 27259 hits unclassified.t__SGB2301: 8688 hits unclassified: 5610 hits Total gene families after translated alignment: 3384 Unaligned reads after translated alignment: 15.6161555154 % Computing gene families ... Computing reaction and pathway abundance ... Output files created: /workspaces/humann/output/demo_2_genefamilies.tsv /workspaces/humann/output/demo_3_reactions.tsv /workspaces/humann/output/demo_4_pathabundance.tsv /workspaces/humann/output/demo_0.log
humann --input examples/demo.fastq --output $OUTPUT_DIR --bypass-nucleotide-search
Output files will be written to: /workspaces/humann/output Running diamond ........ Aligning to reference database: demo_proteins_uniref90_uniref50_like_v2019_06.dmnd Total bugs after translated alignment: 1 unclassified: 63367 hits Total gene families after translated alignment: 1359 Unaligned reads after translated alignment: 56.6194492629 % Computing gene families ... Computing reaction and pathway abundance ... Output files created: /workspaces/humann/output/demo_2_genefamilies.tsv /workspaces/humann/output/demo_3_reactions.tsv /workspaces/humann/output/demo_4_pathabundance.tsv /workspaces/humann/output/demo_0.log
可以看到如果直使用全部的核苷酸数据库搜素(--bypass-prescreen)、接跳过核苷酸搜索(--bypass-nucleotide-search)区别如下:
221
可以看到使用全部的核苷酸数据库搜素后有使用蛋白搜索找到221个基因,加上核苷酸搜索一共3384个基因;直接使用蛋白搜索找到1359个基因。
查看数据目录命令: humann_config --print
humann_config --print
database_folders : nucleotide = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO database_folders : protein = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO database_folders : utility_mapping = /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO
$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/chocophlan_DEMO $ ls # SGB1871_pangenome90.fna.gz SGB2091_pangenome90.fna.gz SGB2301_pangenome90.fna.gz $ zcat SGB1871_pangenome90.fna.gz | grep ">" | head # >SGB1871|SGB1871|UniRef90_A0A0P0EZQ9|UniRef50_A0A0P0L0B2|954 # >SGB1871|SGB1871|UniRef90_A0A1C7GUW0|UniRef50_A0A0P0FA44|474
可以看到核酸数据库一个文件就是一个菌,每个文件里报告了该菌的所有基因
$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/uniref_DEMO $ ls # demo_proteins_uniref90_uniref50_like_v2019_06.dmnd
蛋白数据库包含一个dmnd的索引文件,使用改文件可以将demo_diamond_aligned.tsv转换成demo_diamond_aligned.tsv
demo_diamond_aligned.tsv
$ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_unaligned.fa # >61NLYAAXX100508:5:100:10009:13087/1 # ACCACCGGCATAGTGGTCACATATATCATAGGGCTTTTCTATGGCATGGCTCCCATGGATGGTGTCAAATTGCTCGGTGACGGTATTGACTCTATGGGCAA # >61NLYAAXX100508:5:100:10018:9949/1 # GATCCGTTCGTTCCCGGTGACGGGAATCCTACACGATCCTGGTAAGTATTATTCCCCCACAAAGAGGTAATTTTGGAAGAAGAATGGGATAGACGCATGGT $ head /workspaces/humann/bypass-translated-search-output/demo_humann_temp/demo_diamond_aligned.tsv # 61NLYAAXX100508:5:100:10001:14485/1|101 UniRef90_A0A2A3N3F7|1467 72.7 33 9 0 1 99 383 415 # 1.53e-11 52.4 # 61NLYAAXX100508:5:100:10006:12287/1|101 UniRef90_A0A125MG82|1782 72.7 33 9 0 101 3 175 207 # 2.41e-15 63.2
$ cd /usr/local/python/3.12.1/lib/python3.12/site-packages/humann/data/utility_DEMO $ ls # README.txt map_go_name.txt.gz map_level4ec_name.txt.gz map_pfam_name.txt.gz # mpa_vJan21_CHOCOPhlAnSGB_202103.tsv vOct22_SGB_mapping.tsv # map_eggnog_name.txt.gz map_ko_name.txt.gz map_metacyc-pwy_name.txt.gz metacyc_pathways_structured_filtered_v24_subreactions unipathway_pathways # map_eggnog_uniref90.txt.gz map_ko_uniref90.txt.gz map_metacyc-rxn_name.txt.gz metacyc_reactions_level4ec_only.uniref.bz2 unipathway_uniprots.uniref.bz2
utility数据库是将UniRef90的ID转为其他Id的对应关系(例如KO、GO)
本文要讨论的是在研究宏基因组数据时,要解决的问题"What are they doing?"。我们很容易想到从meta'omic数据分析微生物群落的功能,包括以下几个方式:
HUMAnN通过分层搜索生成宏基因组的reads到已知或未知物种分类的基因序列映射,这些映射按照质量和序列长度进行加权,以估计单个物种或者整个群落基因家族的丰度。根据基因家族丰度可以重新组合到其他功能系统(例如 COGs, KOs, Pfam domains, 和GO terms)。最终基因家族被注释为代谢酶,对代谢酶进一步分析,以重建和量化每个物种和群落的完整的代谢途径(默认通过MetaCyc)。HUMAnN命令如下:
宏基因组的reads
已知
未知
humann --nucleotide-database /data/databases/humann/chocophlan_v31_201901 \ --protein-database /data/databases/humann/uniref90_v201901_ec_filtered \ --threads 10 \ --remove-temp-output \ --input test.clean.gz \ --output test \ --metaphlan-options="--index mpa_vJan21_CHOCOPhlAnSGB_202103 \ --bowtie2db /data/databases/metaphlan_databases_202103"
数据库文件包括3个文件:
chocophlan_v31_201901的目录如下:
uniref90_v201901_ec_filtered的目录如下:
结果文件如下:
根据基因家族丰度可以重新组合到KO系统
KO系统
humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_go -u N -p N -o ${meta.id}/go/${meta.id}.GO.txt humann_regroup_table --input ${meta.id}_genefamilies.tsv -g uniref90_ko -u N -p N -o ${meta.id}/${meta.id}.KO.txt
HUMAnN(HMP Unified Metabolic Analysis Network),用于从metagenomics或metatranscriptomics量化microbial metabolic pathways和other molecular functions
metagenomics
metatranscriptomics
microbial metabolic pathways
other molecular functions
从HUMAnN的源代码开始学习
git clone git@github.com:biobakery/humann.git cd humann pip install .
. ├── genefamilies.tsv ├── humann_temp_y68c605p ├── pathabundance.tsv ├── pathcoverage.tsv ├── eggnog │ ├── eggnog.rpkm_stratified.txt │ ├── eggnog.rpkm.txt │ ├── eggnog.rpkm_unstratified.txt │ └── eggnog.txt ├── gbm │ ├── coverage-plot.svg │ ├── Abundance-RPKs.modules │ └── GBM.txt ├── genefamilies │ ├── genefamilies.rpkm_stratified.tsv │ ├── genefamilies.rpkm.tsv │ ├── genefamilies.rpkm_unstratified.tsv │ ├── genefamilies_stratified.tsv │ └── genefamilies_unstratified.tsv ├── gmm │ ├── coverage-plot.svg │ ├── Abundance-RPKs.modules │ └── GMM.txt ├── go │ ├── go.rpkm_stratified.txt │ ├── go.rpkm.txt │ ├── go.rpkm_unstratified.txt │ └── go.txt ├── ko │ ├── ko.rpkm_stratified.txt │ ├── ko.rpkm.txt │ ├── ko.rpkm_unstratified.txt │ └── ko.txt ├── level4ec │ ├── level4ec.rpkm_stratified.txt │ ├── level4ec.rpkm.txt │ ├── level4ec.rpkm_unstratified.txt │ └── level4ec.txt ├── module │ └── Module.txt ├── pathabundance │ ├── pathabundance.rpkm_stratified.tsv │ ├── pathabundance.rpkm.tsv │ ├── pathabundance.rpkm_unstratified.tsv │ ├── pathabundance_stratified.tsv │ └── pathabundance_unstratified.tsv ├── pathway │ └── Pathway.txt ├── pfam │ ├── pfam.rpkm_stratified.txt │ ├── pfam.rpkm.txt │ ├── pfam.rpkm_unstratified.txt │ └── pfam.txt └── rxn ├── rxn.rpkm_stratified.txt ├── rxn.rpkm.txt ├── rxn.rpkm_unstratified.txt └── rxn.txt
pathabundance.tsv
pathcoverage.tsv
genefamilies.tsv