bcftools 工具介绍

最后发布时间 : 2025-04-08 14:34:46 浏览量 :

下载序列NC_012967到文件NC_012967.fa,创建mock.fastq文件

@mock1
TGCTTATTCATTCTGACTGCCGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

此时原序列和修改后的序列如下:

#原序列 (A)GCTT( )TTCATTCTGACTGC(AA)CGGGCAATATGTCTCTGT(G)TGGATTAAAAAAAGAGTGTCTGATAGCAGC
#修改的 (T)GCTT(A)TTCATTCTGACTGC(  )CGGGCAATATGTCTCTGT(A)TGGATTAAAAAAAGAGTGTCTGATAGCAGC

可以看到我们将第1位的A变成了T,将40位的G变成了A,第6位插入一个A,第20位和21位删除两个A

bowtie2-build  NC_012967.fa NC_012967.fa
bowtie2 -x NC_012967.fa \
    -U  mock.fastq\
 --no-unal \
     -S mock.sam \
     --un mock.unmatched.fastq
cat mock.sam
1 reads; of these:
  1 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    1 (100.00%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
100.00% overall alignment rate

@HD	VN:1.5	SO:unsorted	GO:query
@SQ	SN:NC_012967.1	LN:4629812
@PG	ID:bowtie2	PN:bowtie2	VN:2.5.4	CL:"/ssd1/wy/conda/bin/bowtie2-align-s --wrapper basic-0 -x /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa --passthrough -U /ssd1/wy/workspace2/leipu/liver/mock_data/mock.fa"
mock1	0	NC_012967.1	1	0	5M1I14M2D49M	*	0	0	TGCTTATTCATTCTGACTGCCGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC	JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ	AS:i:-31	XN:i:0	XM:i:2	XO:i:2	XG:i:3	NM:i:5	MD:Z:0A18^AA18G30	YT:Z:UU

此时查看bowtie2的结果与预想的一致,CIGAR是5M1I14M2D49M,表示前5个match,接下来一个 insertion,然后是14个match,然后是2个deletion,最后是49个match

接下来我们使用bcftools来进行variant calling,使用bcftools mpileup查看所有reads贴上的基因组中每个位置可能产生的基因型。

bcftools mpileup -Q 0 -q 0  -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa  mock.sam > mock_bcftools_mpileup.vcf
cat mock_bcftools_mpileup.vcf

结果如下

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	1	.	A	T,<*>	0	.	DP=2;I16=0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,1,0;VDB=0.02;SGB=-0.453602;MQ0F=1	PL:AD	7,6,0,7,6,7:0,2,0
...
NC_012967.1	5	.	T	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,4,16,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
NC_012967.1	6	.	T	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,6,36,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
NC_012967.1	7	.	T	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,7,49,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
...
NC_012967.1	19	.	C	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,19,361,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
NC_012967.1	20	.	A	<*>	0	.	DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0	PL:AD	0,0,0:0,0
NC_012967.1	21	.	A	<*>	0	.	DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0	PL:AD	0,0,0:0,0
NC_012967.1	22	.	C	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,20,400,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
...
NC_012967.1	40	.	G	A,<*>	0	.	DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1	PL:AD	4,3,0,4,3,4:0,1,0

我们发现只有第1位的A变成了T,将40位的G变成了A的变异找到,没有找到indel

可以在bcftools mpileup的帮助中找到参数--min-ireads

-m, --min-ireads INT    Minimum number gapped reads for indel candidates [2]

可以看到对于indel最小需要两条reads这里我们将这个参数设置为0

!bcftools mpileup  -m 0      -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa  mock.sam > mock_bcftools_mpileup.vcf
!cat mock_bcftools_mpileup.vcf

此时可以看到indel的结果

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	5	.	T	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,4,16,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
NC_012967.1	5	.	T	TA	0	.	INDEL;IDV=1;IMF=1;DP=1;I16=0,0,1,0,0,0,40,1600,0,0,0,0,0,0,4,16;QS=0,1;SGB=-0.379885;MQ0F=1	PL:AD	4,3,0:0,1
...
NC_012967.1	19	.	C	<*>	0	.	DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,19,361,0,0;QS=1,0;MQ0F=1	PL:AD	0,3,4:1,0
NC_012967.1	19	.	CAA	C	0	.	INDEL;IDV=1;IMF=1;DP=1;I16=0,0,1,0,0,0,60,3600,0,0,0,0,0,0,19,361;QS=0,1;SGB=-0.379885;MQ0F=1	PL:AD	4,3,0:0,1

这里表示的是第5位的T变成了TA,第19位的CAA变成了C

bcftools mpileup得到的是所有reads回帖到基因组中每个位置可能产生的基因型。
接下来,我们需要使用bcftools call筛选有变化的位点。

bcftools call  mock_bcftools_mpileup.vcf -m -o mock_bcftools_mpileup.call.vcf
cat mock_bcftools_mpileup.call.vcf

结果如下:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	1	.	A	.	24.1477	.	DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
...
NC_012967.1	5	.	T	.	33.5884	.	DP=1;MQ0F=1;AN=2;DP4=1,0,0,0;MQ=0	GT:AD	0/0:1
NC_012967.1	5	.	T	.	25.5984	.	INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
...
NC_012967.1	19	.	C	.	33.5884	.	DP=1;MQ0F=1;AN=2;DP4=1,0,0,0;MQ=0	GT:AD	0/0:1
NC_012967.1	19	.	CAA	.	25.5984	.	INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
...
NC_012967.1	40	.	G	.	24.1477	.	DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
-v, --variants-only             Output variant sites only

如果使用-v选项,则没有任何结果输出

如果将mock.fastq文件换成如下,只有在40位的G变成了A

@mock1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

结果如下

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	40	.	G	A	11.7172	.	DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42	GT:PL:AD	1/1:41,3,0:0,1

与之前结果比较

bcftools mpileup的比较

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
before: NC_012967.1	40	.	G	A,<*>	0	.	DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1	PL:AD	4,3,0,4,3,4:0,1,0
after:  NC_012967.1	40	.	G	A,<*>	0	.	DP=1;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=0	PL:AD	41,3,0,41,3,41:0,1,0

bcftools call的比较

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
before: NC_012967.1	40	.	G	.	24.1477	.	DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
after:  NC_012967.1	40	.	G	A	11.7172	.	DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42	GT:PL:AD	1/1:41,3,0:0,1

构建vcf文件

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.21+htslib-1.21
##bcftoolsCommand=mpileup -m 0 -Q 0 -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa mock.sam
##reference=file:///ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa
##contig=<ID=NC_012967.1,length=4629812>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric, http://samtools.github.io/bcftools/rd-SegBias.pdf">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	40	.	G	A,<*>	0	.	DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1	PL:AD	4,3,0,4,3,4:0,1,0
NC_012967.1	40	.	G	A,<*>	0	.	DP=1;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=0	PL:AD	41,3,0,41,3,41:0,1,0

执行bcftools call

!bcftools call  mock_bcftools_mpileup.vcf -m -o mock_bcftools_mpileup.call.vcf
!cat mock_bcftools_mpileup.call.vcf

结果如下:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	mock.sam
NC_012967.1	40	.	G	.	24.1477	.	DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0	GT:AD	0/0:0
NC_012967.1	40	.	G	A	11.7172	.	DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42	GT:PL:AD	1/1:41,3,0:0,1

对于bcftools mpileup的结果,下面我们能解释下INFO列和FORMAT列的含义

生信小木屋

  • DP=1: 表示该位点总体测序深度(总覆盖次数)为 1
  • I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625:这是 bcftools mpileup 生成的关键统计数组,含义如下
    • 位置0-3: 对应 A/C/G/T 的 REF 支持数(正链+负链)
    • 位置4-7: 对应 A/C/G/T 的 ALT 支持数
    • 位置8-11: ALT 碱基质量总和
    • 位置12-15: ALT 的测序位置偏移平方和
  • QS=0,1,0:碱基质量和(Base quality sum),通常表示的是三种基因型的质量评分:QS[0]:纯合参考、QS[1]:杂合、QS[2]:纯合变异
  • SGB=-0.379885: SGB 是 Segregation-based score(分离评分),是变异模型置信度评分的一部分。用于惩罚低复杂度区域,降低假阳性变异。
  • MQ0F=1: 表示 Mapping Quality 为 0 的 reads 比例(MQ=0 的 reads/总 reads)。 MQ0F = 1 一般表示这个位点不可靠,需要过滤掉!
  • PL:4,3,0,4,3,4:
  • AD:0,1,0:

参考