分箱挖掘单菌基因组

最后发布时间 : 2023-08-06 23:30:21 浏览量 :

学习资料

EasyMetagenome
测试数据

使用软件

metaWRAP，Usage_tutorial
三大主流binning程序
- cococt，测试数据及完整代码
- maxbin2
- metabat2

CONCOCT

通过使用核苷酸组成、多个样本中的覆盖率数据和来自配对末端读数的连锁数据来实现宏基因组序列的无监督分类的程序。

# Run concoct
concoct --composition_file contigs.fa --coverage_file coverage_table.tsv -b concoct_output/
# Merge subcontig clustering into original contig clustering
merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv
# Extract bins as individual FASTA
extract_fasta_bins.py contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins

MetaWRAP混合样本分箱

主要使用MetaWRAP，演示基于官方测试数据
主页：
挖掘单菌基因组，需要研究对象复杂度越低、测序深度越大，结果质量越好。要求单样本6GB+，复杂样本如土壤推荐数据量30GB+，至少3个样本
上面的演示数据12个样仅140MB，无法获得单菌基因组，这里使用官方测序数据演示讲解
软件和数据库布置需2-3天，演示数据分析过程超10h，标准30G样也需3-30天，由服务器性能决定。

4.1.1 准备数据和环境变量

    # 流程: https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md
    > 
    # 输入数据：质控后的FASTQ序列，文件名格式必须为*_1.fastq和*_2.fastq
    #           C1_1_kneaddata_paired_1.fastq  -> C1_1_1.fq
    #           C1_1_kneaddata_paired_2.fastq  -> C1_1_2.fq
    #           放置到 binning/temp/qc 目录下
    
    # 拼装获得的contig文件：result/megahit/final.contigs.fa
    #           放置到 binning/temp/megahit 目录下
    #        
    
    # 中间输出文件：
    #     Binning结果：binning/temp/binning
    #     提纯后的Bin统计结果：binning/temp/bin_refinement/metawrap_50_10_bins.stats
    #     Bin定量结果文件：binning/temp/bin_quant/bin_abundance_heatmap.png
    #                      binning/temp/bin_quant/bin_abundance_table.tab (数据表)
    #     Bin物种注释结果：binning/temp/bin_classify/bin_taxonomy.tab
    #     Prokka基因预测结果：binning/temp/bin_annotate/prokka_out/bin.10.ffn 核酸序列
    #     Bin可视化结果：binning/temp/bloblogy/final.contigs.binned.blobplot (数据表)
    #                    binning/temp/bloblogy/blobplot_figures (可视化图)

    # 准备原始数据从头分析，详见公众号或官网教程
    # 这里我们从质控后数据和拼接结果开始
    cd ${wd}
    mkdir -p binning && cd binning
    mkdir -p temp && cd temp
    # 这里基于质控clean数据和拼接好的contigs，自己链接自上游分析
    # 7G质控数据，输入数据文件名格式必须为*_1.fastq和*_2.fastq
    mkdir -p seq
    cd seq
    # 方法1. 下载测序数据
    # for i in `seq 7 9`;do
    #    wget -c http://210.75.224.110/share/meta/metawrap/ERR01134${i}_1.fastq.gz
    #    wget -c http://210.75.224.110/share/meta/metawrap/ERR01134${i}_2.fastq.gz
    # done
    # gunzip *.gz # 解压文件
    # rename .fq .fastq *.fq # 批量修改扩展名
    # 方法2. 复制准备好的数据
    ln -sf ${db}/metawrap/*.fastq ./
    cd ..
    # megahit拼接结果
    mkdir -p megahit
    cd megahit
    # wget -c http://210.75.224.110/share/meta/metawrap/final.contigs.fa.gz
    # gunzip *.gz
    ln -s ${db}/metawrap/*.fa ./
    cd ../..
    
    # 加载运行环境
    cd ${wd}/binning
    conda activate metawrap

4.1.2 运行三种分箱软件

    metawrap -v
    # 输入文件为contig和clean reads
    # 调用三大主流binning程序cococt, maxbin2, metabat2
    # 8p线程2h，24p耗时1h
    # nohup 和 & 保证任务在后台不被中断，且记录输出内容到 nohup.out(可选)
    nohup metawrap binning -o temp/binning -t 1 -a temp/megahit/final.contigs.fa \
      --metabat2 --maxbin2 --concoct temp/seq/ERR*.fastq &
    # 用自己的文件，替换输出文件名为 *1_kneaddata_paired*.fastq 
	# 如果想接上上面的流程使用自己的文件做分析，则把ERR*.fastq替换为 *1_kneaddata_paired*.fastq
    # 输出文件夹 temp/binning 包括3种软件结果和中间文件

    # 8线程2h， 24p 1h
    cd ${wd}/binning
    # rm -rf temp/bin_refinement
    metawrap bin_refinement \
      -o temp/bin_refinement \
      -A temp/binning/metabat2_bins/ \
      -B temp/binning/maxbin2_bins/ \
      -C temp/binning/concoct_bins/ \
      -c 50 -x 10 -t 2
    # 查看高质量Bin的数量，10个，见temp/bin_refinement/metawrap_50_10_bins.stats目录
    wc -l temp/bin_refinement/metawrap_50_10_bins.stats
    # 结果改进程度见temp/bin_refinement/figures/目录

Bin定量

    # 使用salmon计算每个bin在样本中相对丰度
    # 耗时3m，系统用时10m，此处可设置线程，但salmon仍调用全部资源
	
    # 需要指定输出文件夹，包括4.3中的参数的输出目录
    metawrap quant_bins -b temp/bin_refinement/metawrap_50_10_bins -t 8 \
      -o temp/bin_quant -a temp/megahit/final.contigs.fa temp/seq/ERR*.fastq
    # 文件名字改变
    # 结果包括bin丰度热图`temp/bin_quant/bin_abundance_heatmap.png`
    # 如果想自己画图，原始数据位于`temp/bin_quant/bin_abundance_table.tab`
    ls -l temp/bin_quant/bin_abundance_heatmap.png

分箱注释Bin

    # Taxator-tk对每条contig物种注释，再估计bin整体的物种，11m (用时66 min)
    metawrap classify_bins -b temp/bin_refinement/metawrap_50_10_bins \
      -o temp/bin_classify -t 2 &
    # 注释结果见`temp/bin_classify/bin_taxonomy.tab`

    # export LD_LIBRARY_PATH=/conda2/envs/metagenome_env/lib/:${LD_LIBRARY_PATH}
     # 这是动态链接库找不到时的一个简单的应急策略
    ln -s /conda2/envs/metagenome_env/lib/libssl.so.1.0.0 .
    ln -s /conda2/envs/metagenome_env/lib/libcrypto.so.1.0.0 .
	
    # 基于prokka基因注释，4m
    metaWRAP annotate_bins -o temp/bin_annotate \
      -b temp/bin_refinement/metawrap_50_10_bins  -t 1
    # 每个bin基因注释的gff文件bin_funct_annotations, 
    # 核酸ffn文件bin_untranslated_genes，
    # 蛋白faa文件bin_translated_genes

MetaWRAP单样本分别组装和分箱

多样本受硬件、计算时间限制无法完成时，需要单样本组装、分析。或想进一步提高组装质量，减少污染和杂合度，也可以单样本组装。

参数设定

    # 样本名
    i=ERR011347
    # 线程数
    p=1
    # 任务数
    j=2
    # 定义完整度和污染率的阈值(50, 5; Finn NBT 2020; 50, 10, Bowers NBT 2017)
    c=50
    x=10

输和文件在seq目录

    mkdir -p seq
    ln -s `pwd`/temp/seq/*.fastq seq/

1 megahit组装

单样本并行组装，13m，314m

    rm -rf temp/megahit_*
    time parallel -j ${j} \
    "metawrap assembly \
        -1 seq/{}_1.fastq \
        -2 seq/{}_2.fastq \
        -o temp/megahit_{} \
        -m 100 -t ${p} --megahit" \
     ::: `ls seq/|cut -f1 -d '_'|uniq`

2 运行三种bin软件

    # 192p, 15m (concoct会使用所有线程)
    parallel -j ${j} \
    "metawrap binning \
        -o temp/binning_{} -t ${p} \
        -a temp/megahit_{}/final_assembly.fasta \
        --metabat2 --maxbin2 --concoct \
        seq/{}_*.fastq" \
    ::: `ls seq/|cut -f1 -d '_'|uniq`

3 Bin提纯

    # 24p，10h
    parallel -j ${j} \
    "metawrap bin_refinement \
      -o temp/bin_refinement_{} -t ${p} \
      -A temp/binning_{}/metabat2_bins/ \
      -B temp/binning_{}/maxbin2_bins/ \
      -C temp/binning_{}/concoct_bins/ \
      -c ${c} -x ${x}" \
    ::: `ls seq/|cut -f1 -d '_'|uniq`

4.2 dRep去冗余种/株基因组集

    # 进入虚拟环境，没有用conda安装
    
    # conda activate drep
    source ${soft}/bin/activate drep
    cd ${wd}/binning

合并所有bin至同一目录

    mkdir -p temp/drep_in
    # 混合组装分箱并重命名
    ln -s `pwd`/temp/bin_refinement/metawrap_50_10_bins/bin.* temp/drep_in/
    rename 'bin' 'mix_all' temp/drep_in/bin.*
    
    # 单样品组装分箱结果重命名
    for i in `ls seq/|cut -f1 -d '_'|uniq`;do
       ln -s `pwd`/temp/bin_refinement_${i}/metawrap_50_10_bins/bin.* temp/drep_in/
       rename "bin." "s_${i}" temp/drep_in/bin.*
    done
    # 统计混合和单样本来源数据，10个混，5个单
    ls temp/drep_in/|cut -f 1 -d '_'|uniq -c
    # 统计混合批次/单样本来源
    ls temp/drep_in/|cut -f 2 -d '_'|cut -f 1 -d '.' |uniq -c

按种水平去冗余：15个为10个，8个来自混拼，2个来自单拼

    mkdir -p temp/drep95
    # 15个，40min
    dRep dereplicate temp/drep95/ \
      -g temp/drep_in/*.fa \
      -sa 0.95 -nc 0.30 -comp 50 -con 10 -p 3

主要结果：

非冗余基因组集：dereplicated_genomes/*.fa
聚类信息表：data_tables/Cdb.csv
聚类和质量图：firgures/clustering

(可选)按株水平汇总

    # 20-30min
    mkdir -p temp/drep95
    dRep dereplicate temp/drep95/ \
      -g temp/drep_in/*.fa \
      -sa 0.99 -nc 0.30 -comp 50 -con 10 -p 24

宏基因组组装物种注释和进化树