gatk HaplotypeCaller \ -R ref/ref.fasta \ -I bams/mother.bam \ -O output/motherHC.vcf \ -L 20:10,000,000-10,200,000
motherHC.vcf内容如下:
motherHC.vcf
使用HaplotypeCaller call出三个T碱基插入的纯合变异。只有很少的reads支持在这个位点插入,这个有多大的可能是真的变异?
当发现indel相关的怪异时,我们需要在igv中显示soft-clips。有关soft-clips的解释见sam格式
当打开soft-clips时,我们发现许多高亮的区域,这些高亮的区域表示错配的碱基。对于这些reads,比对器(诸如bwa)来说,出现soft-clopping错配的碱基比出现插入一个gap的惩罚小。
在Smith-Waterman比对中,序列可能不会从第一个残基到最后一个残基进行比对。末端的子序列可能被截断。我们引入操作“S”来描述截断对齐。假设剪裁的对齐为:clipped_alignmentREF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTGREAD: gggGTGTAACC-GACTAGgggg其中,在reads序列中,大写的碱基是匹配的,小写的碱基被截断(clipped)。此对齐的CIGAR为:3S8M1D6M4S。解释为3 soft, 8 match, 1 deletion, 6 match and 4 soft
gatk HaplotypeCaller \ -R ref/ref.fasta \ -I bams/mother.bam \ -O output/motherHCdebug.vcf \ -bamout output/motherHCdebug.bam \ -L 20:10,002,000-10,003,000
HaplotypeCaller有一个-bamout参数,它允许我们重新对齐reads。重新对齐reads是HaplotypeCaller进行variant call的reads。可以在igv看到重新排列是否修复了bam中混乱的区域。
-bamout
Call variants per-sample with HaplotypeCaller
Likely haplotypes + candidate variant sites
Phased Sequencing、Phased Genotypes
取Active Regions进行Local realignment via graph assembly,使用Dijkstra's 算法遍历。收集那个位置最可能的单倍型(Haplotype)。将这些Haplotype与基因组对齐,然后用它进行基因分型(genotyping)
Example HaplotypeCaller assembly graph
Graph assembly recovers indels and removes artifactsResolves complexity caused by mapper limitations
Showing 100bp region starting at 10:96,825,862 for NA12878
Determine per-read likelihoods (PairHMM)PairHMM aligns each read to each haplotypeUses base qualities as the estimate of error
Determine most likely combination of allele(s) for each siteBased on allele likelihoods (from PairHMM)Apply Bayes’ theorem with ploidy assumption
从reads的基因型到样本的基因型
PL
蓝色是没有变异的If you have one sample with one variant and 10,000 samples that do not have variant, thios becomes a lot more interesting. and so being able to hold on to that confident reference information.
Once we've combined together all the samples that we're gonna analysis in this round. Then we do our joint genotyping, those reference blocks get evaluated.
Encode DNA from the Reference Sequence as 1-Hot Tensor