本文参考Bismark的官方教程给出其使用教程,Bismark映射亚硫酸氢盐转换后的序列,确定胞嘧啶(Cytosine)甲基化的状态。
https://github.com/FelixKrueger/Bismark/tree/master/travis_files
bismark_genome_preparation data2/index
data2/index ├── Bisulfite_Genome │ ├── CT_conversion │ │ ├── BS_CT.1.bt2 │ │ ├── BS_CT.2.bt2 │ │ ├── BS_CT.3.bt2 │ │ ├── BS_CT.4.bt2 │ │ ├── BS_CT.rev.1.bt2 │ │ ├── BS_CT.rev.2.bt2 │ │ └── genome_mfa.CT_conversion.fa │ └── GA_conversion │ ├── BS_GA.1.bt2 │ ├── BS_GA.2.bt2 │ ├── BS_GA.3.bt2 │ ├── BS_GA.4.bt2 │ ├── BS_GA.rev.1.bt2 │ ├── BS_GA.rev.2.bt2 │ └── genome_mfa.GA_conversion.fa └── NC_010473.fa
bismark \ --bowtie2 \ -p 3 \ --non_directional \ --nucleotide_coverage data2/index/ \ -1 trimmed/test_R1_trim.fq.gz \ -2 trimmed/test_R2_trim.fq.gz \ --basename test_MappedOn_NC010473_trim_bismark \ --output_dir mapped 2> mapped/test_MappedOn_NC010473.log
--nucleotide_coverage
mapped ├── test_MappedOn_NC010473_trim_bismark_pe.bam ├── test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt └── test_MappedOn_NC010473_trim_bismark_PE_report.txt
test_MappedOn_NC010473_trim_bismark_PE_report.txt
Bismark report for: trimmed/test_R1_trim.fq.gz and trimmed/test_R2_trim.fq.gz (version: v0.22.1) Bismark was run with Bowtie 2 against the bisulfite genome of /data/wangyang/snakemake-bisulfite/data2/index/ with the specified options: -q --score-min L,0,-0.2 -p 3 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB) Final Alignment report ====================== Sequence pairs analysed in total: 10000 Number of paired-end alignments with a unique best hit: 461 Mapping efficiency: 4.6% Sequence pairs with no alignments under any condition: 9352 Sequence pairs did not map uniquely: 187 Sequence pairs which were discarded because genomic sequence could not be extracted: 0 Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 224 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 237 ((converted) bottom strand) Number of alignments to (merely theoretical) complementary strands being rejected in total: 0 Final Cytosine Methylation Report ================================= Total number of C's analysed: 24900 Total methylated C's in CpG context: 503 Total methylated C's in CHG context: 344 Total methylated C's in CHH context: 1425 Total methylated C's in Unknown context: 1 Total unmethylated C's in CpG context: 279 Total unmethylated C's in CHG context: 4404 Total unmethylated C's in CHH context: 17945 Total unmethylated C's in Unknown context: 15 C methylated in CpG context: 64.3% C methylated in CHG context: 7.2% C methylated in CHH context: 7.4% C methylated in unknown context (CN or CHN): 6.2% Bismark completed in 0d 0h 0m 10s
test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt
(di-)nucleotide count sample percent sample count genomic percent genomic coverage A 222577 24.44 1153640 24.62 0.193 C 231335 25.41 1190880 25.41 0.194 G 230956 25.36 1188801 25.37 0.194 T 225710 24.79 1152814 24.60 0.196 AA 64691 7.18 341159 7.28 0.190 AC 49564 5.50 259124 5.53 0.191 AG 45719 5.07 240634 5.14 0.190 AT 60571 6.72 312723 6.67 0.194 CA 62836 6.97 328231 7.00 0.191 CC 52285 5.80 274063 5.85 0.191 CG 67335 7.47 349543 7.46 0.193 CT 46542 5.16 239041 5.10 0.195 GA 51569 5.72 270253 5.77 0.191 GC 74833 8.30 387532 8.27 0.193 GG 52142 5.78 272860 5.82 0.191 GT 49885 5.53 258156 5.51 0.193 TA 41210 4.57 213996 4.57 0.193 TC 52394 5.81 270161 5.77 0.194 TG 63254 7.02 325763 6.95 0.194 TT 66510 7.38 342893 7.32 0.194
test_MappedOn_NC010473_trim_bismark_pe.bam
Alignments to the original top strand or to the strand complementary to the original top strand (OT and CTOT) will both yield methylation information for cytosines on the top strandAlignments to the original bottom strand or to the strand complementary to the original bottom strand (OB and CTOB) will both yield methylation information for cytosines on the bottom strand
PCR duplicates can be computationally identified following alignment to the reference genome since they erroneously(错误) inflate(夸大) the genome coverages and will occur false positive errors in further analysis.In general, the way we resolve PCR bias is to remove reads that are aligned to the same position on the same strand of the reference genome.
duplicate_bismark \ --paired \ --bam mapped/test_MappedOn_NC010473_trim_bismark_pe.bam \ --output_dir mapped 2> logs/bismark/test_MappedOn_NC010473.deduplication.log
duplicate_bismark的输出文件包括:
duplicate_bismark
test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam
test_MappedOn_NC010473_trim_bismark_pe.deduplicated_report.txt
Total number of alignments analysed in ZF230116-74_1_bismark_bt2_pe.bam: 461 Total number duplicated alignments removed: 0 (0.00%) Duplicated alignments were found at: 0 different position(s) Total count of deduplicated leftover sequences: 461 (100.00% of total)
samtools sort mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam > mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam samtools index mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam