--un-conc-gz
宏基因组数据,使用bowtie2去宿主的命令如下
bowtie2 \ -x ${reference} \ -1 ${reads[0]} -2 ${reads[1]} \ --threads ${task.cpus} \ --un-conc-gz ${meta.id}.unmapped.fastq.gz \ 2> >(tee ${meta.id}.bowtie2.log >&2) \ | samtools view --threads $task.cpus -bS -o ${meta.id}.bam
在paired-end reads中,比对到参考基因组的情况分为三种:
在下面的这个例子中,总共有read1和reads2合起来总共有 69711744*2=139423488条reads。比对上的reads数为132126210+2979518+1275316=136381044,那么未比对上的reads为139423488-136381044=3042444
69711744*2=139423488
132126210+2979518+1275316=136381044
139423488-136381044=3042444
69711744 reads; of these: 69711744 (100.00%) were paired; of these: 3648639 (5.23%) aligned concordantly 0 times 55742462 (79.96%) aligned concordantly exactly 1 time 10320643 (14.80%) aligned concordantly >1 times ---- 3648639 pairs aligned concordantly 0 times; of these: 1489759 (40.83%) aligned discordantly 1 time ---- 2158880 pairs aligned 0 times concordantly or discordantly; of these: 4317760 mates make up the pairs; of these: 3042444 (70.46%) aligned 0 times 499862 (11.58%) aligned exactly 1 time 775454 (17.96%) aligned >1 times 97.82% overall alignment rate
解释见
参数--un-conc-gz的结果是未比对上的reads,那么该参数输出的reads,也就是未必对到参考基因组上的reads是aligned 0 times还是aligned concordantly 0 times呢?
aligned 0 times
aligned concordantly 0 times
--un-conc-gz输出的是aligned concordantly 0 times的reads,在这个例子中未比对上的reads数为3648639*2=7297278--un-conc-gz的结果与samtools view -F2 SAMPLE_mapped_and_unmapped.bam的结果相同,-F2表示过滤掉所有设置了 FLAG 值为 2 的比对记录(即排除“正确配对的 reads”)。reads1和reads2不一致比对的reads,以及reads1与reads2没有同时比对上的reads,都包含在去宿主的结果里。
3648639*2=7297278
samtools view -F2 SAMPLE_mapped_and_unmapped.bam
-F2
samtools view -b -f 12 -F 256 \ SAMPLE_mapped_and_unmapped.bam \ > SAMPLE_bothReadsUnmapped.bam
-f 12
-F 256
按reads名称 ( -n ) 对 bam 文件排序,使成对reads相邻(2 个并行线程,每个线程使用多达 5G 内存)
samtools sort -n -m 5G -@ 2 SAMPLE_bothReadsUnmapped.bam -o SAMPLE_bothReadsUnmapped_sorted.bam samtools fastq -@ 8 SAMPLE_bothReadsUnmapped_sorted.bam \ -1 SAMPLE_host_removed_R1.fastq.gz \ -2 SAMPLE_host_removed_R2.fastq.gz \ -0 /dev/null -s /dev/null -n
-n