SNPs + Indels
最后发布时间 : 2023-05-04 17:49:07
浏览量 :
(Single Nucleotide Polymorphism)
利用针对RNA-Seq的比对软件STAR[13]对每个样本的Reads与Unigene序列进行比对,并通过GATK[14]针对RNA-Seq的SNP识别(SNP Calling)流程,识别单核苷酸多态性(Single Nucleotide Polymorphism,SNP)位点。进而可以分析这些SNP位点是否影响了基因的表达水平或者蛋白产物的种类。识别标准如下:
(1) 35bp范围内连续出现的单碱基错配不超过3个;
(2) 经过序列深度标准化的SNP质量值大于2.0。
Mapping to the Reference
Data Cleanup
MergeBamAlignment MarkDuplicates
java -jar $picard MarkDuplicates I=$opdir/$bn"_sorted.bam" O=$opdir/$bn"_dupMarked.bam" M=$opdir/$bn"_dup.metrics" CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT 2>$opdir/$bn.MarkDuplicates.log
https://gist.github.com/PoisonAlien/c6c03539cf4b1ac41cf1
https://www.jianshu.com/p/6b30a0a38729
https://blog.csdn.net/weixin_39653733/article/details/112281827
#picard mark duplicates
gatk MarkDuplicates \
I=output/samtools/220613-95.sorted.bam \
O=output/samtools/220613-95_dupMarked.bam \
M=output/samtools/220613-95_dup.metrics \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=SILENT 2>output/samtools/220613-95.MarkDuplicates.log
gatk CreateSequenceDictionary -R genome/chr1.fasta
#SplitNCigarReads
# 将落在外显子上的reads分离出来,取出N错误碱基,去除内含子区域的reads。这一步太慢了,占用整个流程一半以上运行时间,不知道有没有办法提高速度
gatk SplitNCigarReads \
-R genome/chr1.fasta \
-I output/samtools/220613-95_dupMarked.bam \
-O output/samtools/220613-95_split.bam
# 这一步添加sn样本编号等信息,前面sort如果使用samtools因为没有sn信息会报错。
gatk AddOrReplaceReadGroups \
-I output/samtools/220613-95_split.bam \
-O output/samtools/220613-95_split_added.bam \
-LB 220613-95 -PL ILLUMINA -PU 220613-95 -SM 220613-95
tabix -p vcf /ssd1/wy/workspace/sarek/test_data/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz
# BaseRecalibrator 碱基质量校正第一步
gatk BaseRecalibrator \
-I output/samtools/220613-95_split_added.bam \
-R genome/chr1.fasta \
--known-sites /ssd1/wy/workspace/sarek/test_data/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz \
-O output/samtools/220613-95_recal.table
# ApplyBQSR 应用碱基校正
gatk ApplyBQSR \
--bqsr-recal-file output/samtools/220613-95_recal.table \
-R genome/chr1.fasta \
-I output/samtools/220613-95_split_added.bam \
-O output/samtools/220613-95_recal.bam
cp output/samtools/220613-95_split_added.bam output/samtools/220613-95_recal.bam
# 并行运行HaplotypeCaller #清理拆分生成的interval文件夹:类似temp_0001_of_10,防止重复运行被上次的结果干扰
gatk HaplotypeCaller \
--native-pair-hmm-threads 10 \
-R genome/chr1.fasta \
-I output/samtools/220613-95_recal.bam \
--minimum-mapping-quality 30 \
-ERC GVCF \
-O output/samtools/220613-95.g.vcf
# -D /home/data/gma49/GATK/ref/hg38/var_ref/dbsnp_146.hg38.vcf.gz \
gatk CombineGVCFs \
-R $GENOME \
--variant SRR11618713.g.vcf \
--variant SRR11618726.g.vcf \
--variant SRR11618727.g.vcf \
-O SRR3.all.g.vcf
gatk GenotypeGVCFs \
-R $GENOME \
-V SRR3.all.g.vcf \
-O SRR3.vcf
- https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-
- https://gist.github.com/PoisonAlien/c6c03539cf4b1ac41cf1