(Single Nucleotide Polymorphism)利用针对RNA-Seq的比对软件STAR[13]对每个样本的Reads与Unigene序列进行比对,并通过GATK[14]针对RNA-Seq的SNP识别(SNP Calling)流程,识别单核苷酸多态性(Single Nucleotide Polymorphism,SNP)位点。进而可以分析这些SNP位点是否影响了基因的表达水平或者蛋白产物的种类。识别标准如下:(1) 35bp范围内连续出现的单碱基错配不超过3个;(2) 经过序列深度标准化的SNP质量值大于2.0。
MergeBamAlignment MarkDuplicates
java -jar $picard MarkDuplicates I=$opdir/$bn"_sorted.bam" O=$opdir/$bn"_dupMarked.bam" M=$opdir/$bn"_dup.metrics" CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT 2>$opdir/$bn.MarkDuplicates.log
https://gist.github.com/PoisonAlien/c6c03539cf4b1ac41cf1 https://www.jianshu.com/p/6b30a0a38729 https://blog.csdn.net/weixin_39653733/article/details/112281827 #picard mark duplicates gatk MarkDuplicates \ I=output/samtools/220613-95.sorted.bam \ O=output/samtools/220613-95_dupMarked.bam \ M=output/samtools/220613-95_dup.metrics \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=SILENT 2>output/samtools/220613-95.MarkDuplicates.log gatk CreateSequenceDictionary -R genome/chr1.fasta #SplitNCigarReads # 将落在外显子上的reads分离出来,取出N错误碱基,去除内含子区域的reads。这一步太慢了,占用整个流程一半以上运行时间,不知道有没有办法提高速度 gatk SplitNCigarReads \ -R genome/chr1.fasta \ -I output/samtools/220613-95_dupMarked.bam \ -O output/samtools/220613-95_split.bam # 这一步添加sn样本编号等信息,前面sort如果使用samtools因为没有sn信息会报错。 gatk AddOrReplaceReadGroups \ -I output/samtools/220613-95_split.bam \ -O output/samtools/220613-95_split_added.bam \ -LB 220613-95 -PL ILLUMINA -PU 220613-95 -SM 220613-95 tabix -p vcf /ssd1/wy/workspace/sarek/test_data/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz # BaseRecalibrator 碱基质量校正第一步 gatk BaseRecalibrator \ -I output/samtools/220613-95_split_added.bam \ -R genome/chr1.fasta \ --known-sites /ssd1/wy/workspace/sarek/test_data/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz \ -O output/samtools/220613-95_recal.table # ApplyBQSR 应用碱基校正 gatk ApplyBQSR \ --bqsr-recal-file output/samtools/220613-95_recal.table \ -R genome/chr1.fasta \ -I output/samtools/220613-95_split_added.bam \ -O output/samtools/220613-95_recal.bam cp output/samtools/220613-95_split_added.bam output/samtools/220613-95_recal.bam # 并行运行HaplotypeCaller #清理拆分生成的interval文件夹:类似temp_0001_of_10,防止重复运行被上次的结果干扰 gatk HaplotypeCaller \ --native-pair-hmm-threads 10 \ -R genome/chr1.fasta \ -I output/samtools/220613-95_recal.bam \ --minimum-mapping-quality 30 \ -ERC GVCF \ -O output/samtools/220613-95.g.vcf # -D /home/data/gma49/GATK/ref/hg38/var_ref/dbsnp_146.hg38.vcf.gz \ gatk CombineGVCFs \ -R $GENOME \ --variant SRR11618713.g.vcf \ --variant SRR11618726.g.vcf \ --variant SRR11618727.g.vcf \ -O SRR3.all.g.vcf gatk GenotypeGVCFs \ -R $GENOME \ -V SRR3.all.g.vcf \ -O SRR3.vcf