去宿主

最后发布时间 : 2025-05-03 17:33:55 浏览量 :

宏基因组数据,使用bowtie2去宿主的命令如下

 bowtie2 \
        -x ${reference} \
        -1 ${reads[0]} -2 ${reads[1]} \
        --threads ${task.cpus} \
        --un-conc-gz  ${meta.id}.unmapped.fastq.gz \
        2> >(tee ${meta.id}.bowtie2.log >&2) \
        | samtools view --threads $task.cpus -bS -o ${meta.id}.bam

在paired-end reads中,比对到参考基因组的情况分为三种:

  • reads1和reads2都可以回帖到参考基因组,并且是一致性比对
    • (55742462+10320643)*2 = 132126210
  • reads1和reads2都可以回帖到参考基因组,不是一致性比对
    • 1489759*2 = 2979518
  • reads1和reads只有一条可以回帖到参考基因组
    • 499862+775454 = 1275316

在下面的这个例子中,总共有read1和reads2合起来总共有 69711744*2=139423488条reads。比对上的reads数为132126210+2979518+1275316=136381044,那么未比对上的reads为139423488-136381044=3042444

69711744 reads; of these:
  69711744 (100.00%) were paired; of these:
    3648639 (5.23%) aligned concordantly 0 times
    55742462 (79.96%) aligned concordantly exactly 1 time
    10320643 (14.80%) aligned concordantly >1 times
    ----
    3648639 pairs aligned concordantly 0 times; of these:
      1489759 (40.83%) aligned discordantly 1 time
    ----
    2158880 pairs aligned 0 times concordantly or discordantly; of these:
      4317760 mates make up the pairs; of these:
        3042444 (70.46%) aligned 0 times
        499862 (11.58%) aligned exactly 1 time
        775454 (17.96%) aligned >1 times
97.82% overall alignment rate

解释

使用参数--un-conc-gz

参数--un-conc-gz的结果是未比对上的reads,那么该参数输出的reads,也就是未必对到参考基因组上的reads是aligned 0 times还是aligned concordantly 0 times呢?

  • aligned 0 times代表单端未比对上的reads
  • aligned concordantly 0 times代表不一致比对的reads

--un-conc-gz输出的是aligned concordantly 0 times的reads,在这个例子中未比对上的reads数为3648639*2=7297278
--un-conc-gz的结果与samtools view -F2 SAMPLE_mapped_and_unmapped.bam的结果相同,-F2表示过滤掉所有设置了 FLAG 值为 2 的比对记录‌(即排除“正确配对的 reads”)。
reads1和reads2不一致比对的reads,以及reads1与reads2没有同时比对上的reads,都包含在去宿主的结果里。

从bam文件提取未比对上的reads

 samtools view -b -f 12 -F 256 \
   SAMPLE_mapped_and_unmapped.bam \
   > SAMPLE_bothReadsUnmapped.bam 
  • -f 12:筛选满足‌FLAG=12‌的reads,即同时满足0x4(未比对到参考序列),0x8(配对的另一条reads未比对到参考序列),提取‌双端均未比对成功‌的reads。
    • Extract only ( -f ) alignments with both reads unmapped:
  • -F 256:排除‌FLAG含0x100‌的reads(即过滤掉SECONDARY比对,保留主比对),避免重复统计多重比对结果
    • Do not ( -F ) extract alignments which are:

按reads名称 ( -n ) 对 bam 文件排序,使成对reads相邻(2 个并行线程,每个线程使用多达 5G 内存)

samtools sort -n -m 5G -@ 2 SAMPLE_bothReadsUnmapped.bam -o SAMPLE_bothReadsUnmapped_sorted.bam

samtools fastq -@ 8 SAMPLE_bothReadsUnmapped_sorted.bam \
  -1 SAMPLE_host_removed_R1.fastq.gz \
  -2 SAMPLE_host_removed_R2.fastq.gz \
  -0 /dev/null -s /dev/null -n
  • -n:按照reads名称进行排序

参考