本文,我们讨论如何获取vcf文件中,变异位置对应的基因,也就是说确定每个SNP或者INDEL落在哪个基因上。
从NCBI下载NC_012967(Escherichia coli B str. REL606)的基因组NC_012967.fa及GTF3注释文件NC_012967.gff3。
NC_012967.fa
NC_012967.gff3
使用下面命令将NC_012967.gff3文件转bed文件
awk '$3 == "gene" { split($9, attr, /[;=]/); gene_id = attr[4]; OFS="\t"; print $1, $4-1, $5, gene_id }' NC_012967.gff3 > genes.bed
使用beadtools获取基因thrA的序列
thrA
grep "thrA" genes.bed > thrA.bed cat thrA.bed # NC_012967.1 335 2798 thrA
bedtools getfasta -fi /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa -bed thrA.bed -fo thrA.fa # samtools faidx genome.fa cat thrA.fa
>NC_012967.1:335-2798 ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGTGGCGAATC....GCGGGCAATGACGTTACAGCTGCCGGTGTCTTTGCCGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGA
可以看到该序列与NCBI的thrA序列是一样的
模拟测序reads文件mock.fastq
mock.fastq
@mock1 ATGCGAGTGTTGAAGTTGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGA + JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
上述序列在18位缺失一个C
与参考基因组进行比对
!bowtie2 -x /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa \ -U /ssd1/wy/workspace2/leipu/liver/mock_data/mock.fa \ --no-unal \ -S mock.sam \ --un mock.unmatched.fastq !cat mock.sam
鉴定变异
!bcftools mpileup -m 0 -Q 0 \ -f /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa mock.sam > mock_bcftools_mpileup.vcf !cat mock_bcftools_mpileup.vcf
!bcftools call mock_bcftools_mpileup.vcf -mv -o mock_bcftools_mpileup.call.vcf !cat mock_bcftools_mpileup.call.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT mock.sam NC_012967.1 352 . TC T 8.13795 . INDEL;IDV=1;IMF=1;DP=1;SGB=-0.379885;MQSBZ=0;BQBZ=0;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=40 GT:PL:AD 1/1:37,3,0:0,1
将 VCF 转换为 BED 格式
!bcftools query -f '%CHROM\t%POS\t%END\n' mock_bcftools_mpileup.call.vcf > variants.bed !cat variants.bed
用 bedtools intersect 注释变异属于哪个基因
!bedtools intersect -a variants.bed -b genes.bed -wa -wb
NC_012967.1 352 353 NC_012967.1 335 2798 thrA
查看确实碱基
!bcftools query -f '%CHROM\t%POS0\t%POS\n' mock_bcftools_mpileup.call.vcf > snps.bed !cat snps.bed
!bedtools getfasta -fi /ssd1/wy/workspace2/leipu/liver/genomics/NC_012967.fa -bed snps.bed -fo thrA.fa !cat thrA.fa
>NC_012967.1:351-352 T