下载序列NC_012967到文件NC_012967.fa,创建mock.fastq文件
NC_012967.fa
@mock1 TGCTTATTCATTCTGACTGCCGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC + JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
此时原序列和修改后的序列如下:
#原序列 (A)GCTT( )TTCATTCTGACTGC(AA)CGGGCAATATGTCTCTGT(G)TGGATTAAAAAAAGAGTGTCTGATAGCAGC #修改的 (T)GCTT(A)TTCATTCTGACTGC( )CGGGCAATATGTCTCTGT(A)TGGATTAAAAAAAGAGTGTCTGATAGCAGC
可以看到我们将第1位的A变成了T,将40位的G变成了A,第6位插入一个A,第20位和21位删除两个A
bowtie2-build NC_012967.fa NC_012967.fa
bowtie2 -x NC_012967.fa \ -U mock.fastq\ --no-unal \ -S mock.sam \ --un mock.unmatched.fastq cat mock.sam
1 reads; of these: 1 (100.00%) were unpaired; of these: 0 (0.00%) aligned 0 times 1 (100.00%) aligned exactly 1 time 0 (0.00%) aligned >1 times 100.00% overall alignment rate @HD VN:1.5 SO:unsorted GO:query @SQ SN:NC_012967.1 LN:4629812 @PG ID:bowtie2 PN:bowtie2 VN:2.5.4 CL:"/ssd1/wy/conda/bin/bowtie2-align-s --wrapper basic-0 -x /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa --passthrough -U /ssd1/wy/workspace2/leipu/liver/mock_data/mock.fa" mock1 0 NC_012967.1 1 0 5M1I14M2D49M * 0 0 TGCTTATTCATTCTGACTGCCGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:-31 XN:i:0 XM:i:2 XO:i:2 XG:i:3 NM:i:5 MD:Z:0A18^AA18G30 YT:Z:UU
此时查看bowtie2的结果与预想的一致,CIGAR是5M1I14M2D49M,表示前5个match,接下来一个 insertion,然后是14个match,然后是2个deletion,最后是49个match
bowtie2
5M1I14M2D49M
接下来我们使用bcftools来进行variant calling,使用bcftools mpileup查看所有reads贴上的基因组中每个位置可能产生的基因型。
variant calling
bcftools mpileup
bcftools mpileup -Q 0 -q 0 -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa mock.sam > mock_bcftools_mpileup.vcf cat mock_bcftools_mpileup.vcf
结果如下
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 1 . A T,<*> 0 . DP=2;I16=0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,1,0;VDB=0.02;SGB=-0.453602;MQ0F=1 PL:AD 7,6,0,7,6,7:0,2,0 ... NC_012967.1 5 . T <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,4,16,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 NC_012967.1 6 . T <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,6,36,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 NC_012967.1 7 . T <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,7,49,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 ... NC_012967.1 19 . C <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,19,361,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 NC_012967.1 20 . A <*> 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0 PL:AD 0,0,0:0,0 NC_012967.1 21 . A <*> 0 . DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0 PL:AD 0,0,0:0,0 NC_012967.1 22 . C <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,20,400,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 ... NC_012967.1 40 . G A,<*> 0 . DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1 PL:AD 4,3,0,4,3,4:0,1,0
我们发现只有第1位的A变成了T,将40位的G变成了A的变异找到,没有找到indel
可以在bcftools mpileup的帮助中找到参数--min-ireads
--min-ireads
-m, --min-ireads INT Minimum number gapped reads for indel candidates [2]
可以看到对于indel最小需要两条reads这里我们将这个参数设置为0
!bcftools mpileup -m 0 -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa mock.sam > mock_bcftools_mpileup.vcf !cat mock_bcftools_mpileup.vcf
此时可以看到indel的结果
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 5 . T <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,4,16,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 NC_012967.1 5 . T TA 0 . INDEL;IDV=1;IMF=1;DP=1;I16=0,0,1,0,0,0,40,1600,0,0,0,0,0,0,4,16;QS=0,1;SGB=-0.379885;MQ0F=1 PL:AD 4,3,0:0,1 ... NC_012967.1 19 . C <*> 0 . DP=1;I16=1,0,0,0,41,1681,0,0,0,0,0,0,19,361,0,0;QS=1,0;MQ0F=1 PL:AD 0,3,4:1,0 NC_012967.1 19 . CAA C 0 . INDEL;IDV=1;IMF=1;DP=1;I16=0,0,1,0,0,0,60,3600,0,0,0,0,0,0,19,361;QS=0,1;SGB=-0.379885;MQ0F=1 PL:AD 4,3,0:0,1
这里表示的是第5位的T变成了TA,第19位的CAA变成了C
bcftools mpileup得到的是所有reads回帖到基因组中每个位置可能产生的基因型。接下来,我们需要使用bcftools call筛选有变化的位点。
bcftools call
bcftools call mock_bcftools_mpileup.vcf -m -o mock_bcftools_mpileup.call.vcf cat mock_bcftools_mpileup.call.vcf
结果如下:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 1 . A . 24.1477 . DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0 ... NC_012967.1 5 . T . 33.5884 . DP=1;MQ0F=1;AN=2;DP4=1,0,0,0;MQ=0 GT:AD 0/0:1 NC_012967.1 5 . T . 25.5984 . INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0 ... NC_012967.1 19 . C . 33.5884 . DP=1;MQ0F=1;AN=2;DP4=1,0,0,0;MQ=0 GT:AD 0/0:1 NC_012967.1 19 . CAA . 25.5984 . INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0 ... NC_012967.1 40 . G . 24.1477 . DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0
-v, --variants-only Output variant sites only
如果使用-v选项,则没有任何结果输出
-v
如果将mock.fastq文件换成如下,只有在40位的G变成了A
mock.fastq
@mock1 AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTATGGATTAAAAAAAGAGTGTCTGATAGCAGC + JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 40 . G A 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42 GT:PL:AD 1/1:41,3,0:0,1
与之前结果比较
bcftools mpileup的比较
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam before: NC_012967.1 40 . G A,<*> 0 . DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1 PL:AD 4,3,0,4,3,4:0,1,0 after: NC_012967.1 40 . G A,<*> 0 . DP=1;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL:AD 41,3,0,41,3,41:0,1,0
bcftools call的比较
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam before: NC_012967.1 40 . G . 24.1477 . DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0 after: NC_012967.1 40 . G A 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42 GT:PL:AD 1/1:41,3,0:0,1
构建vcf文件
##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##bcftoolsVersion=1.21+htslib-1.21 ##bcftoolsCommand=mpileup -m 0 -Q 0 -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa mock.sam ##reference=file:///ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa ##contig=<ID=NC_012967.1,length=4629812> ##ALT=<ID=*,Description="Represents allele(s) other than observed."> ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL."> ##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel"> ##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3"> ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)"> ##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)"> ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)"> ##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)"> ##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)"> ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric, http://samtools.github.io/bcftools/rd-SegBias.pdf"> ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)"> ##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h"> ##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (high-quality bases)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 40 . G A,<*> 0 . DP=1;I16=0,0,1,0,0,0,41,1681,0,0,0,0,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=1 PL:AD 4,3,0,4,3,4:0,1,0 NC_012967.1 40 . G A,<*> 0 . DP=1;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,25,625;QS=0,1,0;SGB=-0.379885;MQ0F=0 PL:AD 41,3,0,41,3,41:0,1,0
执行bcftools call
!bcftools call mock_bcftools_mpileup.vcf -m -o mock_bcftools_mpileup.call.vcf !cat mock_bcftools_mpileup.call.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 40 . G . 24.1477 . DP=1;SGB=-0.379885;MQ0F=1;AN=2;DP4=0,0,1,0;MQ=0 GT:AD 0/0:0 NC_012967.1 40 . G A 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=42 GT:PL:AD 1/1:41,3,0:0,1
对于bcftools mpileup的结果,下面我们能解释下INFO列和FORMAT列的含义