展开

GATK4 HaplotypeCaller

最后发布时间 : 2023-08-05 23:39:24 浏览量 :

学习资料

本片文章涉及的工具

  • GATK 4.0

Physical phasing:如果两个杂合变异体,它们彼此非常接近。会存在这个问题,这两种杂合变异在是在同一条链上,还是在相反的方向上。

生信小木屋

生信小木屋

HaplotypeCaller简介

通过局部组装haplotypes call germline的SNP和indels。由于HaplotypeCaller能够在active region通过haplotypes的局部de-novo assembly的同时call SNP和indel。这使得HaplotypeCaller在call 传统上难以call的区域时更加准确,其表现的比UnifiedGenotyper更好。

  • 鉴定有变异信号的区域
  • 对有变异信号的区域组装确定haplotypes
  • 基于reads确定haplotypes的可能性
  • 给样本分配基因型

Call variants with HaplotypeCaller in default VCF mode

gatk HaplotypeCaller \
    -R ref/ref.fasta \
    -I bams/mother.bam \
    -O output/motherHC.vcf \
    -L 20:10,000,000-10,200,000

motherHC.vcf内容如下:

生信小木屋

motherHC.vcf在igv的展示如下
生信小木屋

生信小木屋

使用bwa比对的bam文件进行call variant存在问题

使用HaplotypeCaller call出三个T碱基插入的纯合变异。只有很少的reads支持在这个位点插入,这个有多大的可能是真的变异?

生信小木屋

当发现indel相关的怪异时,我们需要在igv中显示soft-clips。有关soft-clips的解释见sam格式

生信小木屋

当打开soft-clips时,我们发现许多高亮的区域,这些高亮的区域表示错配的碱基。对于这些reads,比对器(诸如bwa)来说,出现soft-clopping错配的碱基比出现插入一个gap的惩罚小。

在Smith-Waterman比对中,序列可能不会从第一个残基到最后一个残基进行比对。末端的子序列可能被截断。我们引入操作“S”来描述截断对齐。假设剪裁的对齐为:
clipped_alignment
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ: gggGTGTAACC-GACTAGgggg
其中,在reads序列中,大写的碱基是匹配的,小写的碱基被截断(clipped)。此对齐的CIGAR为:3S8M1D6M4S。解释为3 soft, 8 match, 1 deletion, 6 match and 4 soft

生信小木屋

View realigned reads and assembled haplotypes

gatk HaplotypeCaller \
    -R ref/ref.fasta \
    -I bams/mother.bam \
    -O output/motherHCdebug.vcf \
    -bamout output/motherHCdebug.bam \
    -L 20:10,002,000-10,003,000

HaplotypeCaller有一个-bamout参数,它允许我们重新对齐reads。重新对齐reads是HaplotypeCaller进行variant call的reads。可以在igv看到重新排列是否修复了bam中混乱的区域。

生信小木屋

Call variants per-sample with HaplotypeCaller

生信小木屋

Step 1 Identify ActiveRegions

  • Sliding window along the reference
  • Count mismatches, indels and soft-clips
  • measure of entropy

生信小木屋

Step 2: Assemble plausible haplotypes

  • Local realignment via graph assembly
  • Traverse graph to collect most likely haplotypes
  • Align haplotypes to reference using Smith-Waterman

Likely haplotypes + candidate variant sites

  • PGT: phased genotype
  • PID: phase identifier

Phased SequencingPhased Genotypes

生信小木屋

取Active Regions进行Local realignment via graph assembly,使用Dijkstra's 算法遍历。收集那个位置最可能的单倍型(Haplotype)。将这些Haplotype与基因组对齐,然后用它进行基因分型(genotyping)

Example HaplotypeCaller assembly graph

生信小木屋

Graph assembly recovers indels and removes artifacts
Resolves complexity caused by mapper limitations

Showing 100bp region starting at 10:96,825,862 for NA12878
Showing 100bp region starting at 10:96,825,862 for NA12878

Showing 100bp region starting at 10:96,825,862 for NA12878

Step 3: Score haplotypes using PairHMM

Determine per-read likelihoods (PairHMM)
PairHMM aligns each read to each haplotype
Uses base qualities as the estimate of error

PairHMM uses base qualities to score alignments

将单倍型全部组装后,我们想要做的是,给出每个单倍型每条reads的可能性的分。这里我们可以使用PairHMM来估计每一条reads发生的可能性根据我们对错误率和排序误差的了解。

生信小木屋

生信小木屋

A_{ij} = probability \ of \ haplotype-read pair

Transforming support for haplotypes into support for alleles

在每一个活动区域为每一个单倍型生成每条reads的可能性的一个矩阵。接下来,将这个矩阵转化为每个等位基因的可能性。

生信小木屋

生信小木屋

Step 4 : Genotype each sample at each potential variant site

Determine most likely combination of allele(s) for each site
Based on allele likelihoods (from PairHMM)
Apply Bayes’ theorem with ploidy assumption

生信小木屋

后验概率是执果寻因,我们能拿到的是,某一个基因型下reads的概率

生信小木屋

样本基因型概率的计算
生信小木屋

贝叶斯公式是:

P(B|A)=\frac{P(A|B)P(B)}{P(A)}

其中P(B|A)就是后验概率。

生信小木屋

我们的最终目的是求基因型的后验概率,也就是说根据检测到的reads的碱基信息,推测基因型的概率。

P(G_{C/C}|R)=\frac{P(R|G_{C/C})P(G_{C/C})}{P(R)}

根据全概率公式,可以知道P(R)的概率如下:

P(R) = \sum_{k}P(R|G_k)P(G_k) \\ \\ =P(R|G_{C/C})P(G_{C/C}) + P(R|G_{C/A})P(G_{C/A}) + P(R|G_{A/A})P(G_{A/A})

Example PL and GQ calculations

Phred-scaled probability
PL is the normalized Phred-scaled probability of each genotype

生信小木屋

GQ is the Genotype Quality and is the smaller of the 2nd PL and 99
PLs are in pre-set order of possible genotypes, 0/0, 0/1 and 1/1 (for biallelic diploid)
生信小木屋

Joint call GVCF intermediates to produce final VCF

生信小木屋

蓝色是没有变异的
If you have one sample with one variant and 10,000 samples that do not have variant, thios becomes a lot more interesting. and so being able to hold on to that confident reference information.

Once we've combined together all the samples that we're gonna analysis in this round. Then we do our joint genotyping, those reference blocks get evaluated.

Filter Variants

Encode DNA from the Reference Sequence as 1-Hot Tensor

生信小木屋