Mapping reads to the reference genome
本文参考Bismark的官方教程给出其使用教程,Bismark映射亚硫酸氢盐转换后的序列,确定胞嘧啶(Cytosine)甲基化的状态。
测试数据准备
https://github.com/FelixKrueger/Bismark/tree/master/travis_files
Bismark build index
bismark_genome_preparation data2/index
data2/index
├── Bisulfite_Genome
│ ├── CT_conversion
│ │ ├── BS_CT.1.bt2
│ │ ├── BS_CT.2.bt2
│ │ ├── BS_CT.3.bt2
│ │ ├── BS_CT.4.bt2
│ │ ├── BS_CT.rev.1.bt2
│ │ ├── BS_CT.rev.2.bt2
│ │ └── genome_mfa.CT_conversion.fa
│ └── GA_conversion
│ ├── BS_GA.1.bt2
│ ├── BS_GA.2.bt2
│ ├── BS_GA.3.bt2
│ ├── BS_GA.4.bt2
│ ├── BS_GA.rev.1.bt2
│ ├── BS_GA.rev.2.bt2
│ └── genome_mfa.GA_conversion.fa
└── NC_010473.fa
Mapping reads to the reference genome
bismark \
--bowtie2 \
-p 3 \
--non_directional \
--nucleotide_coverage data2/index/ \
-1 trimmed/test_R1_trim.fq.gz \
-2 trimmed/test_R2_trim.fq.gz \
--basename test_MappedOn_NC010473_trim_bismark \
--output_dir mapped 2> mapped/test_MappedOn_NC010473.log
--nucleotide_coverage
:
mapped
├── test_MappedOn_NC010473_trim_bismark_pe.bam
├── test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt
└── test_MappedOn_NC010473_trim_bismark_PE_report.txt
输出报告文件
test_MappedOn_NC010473_trim_bismark_PE_report.txt
Bismark report for: trimmed/test_R1_trim.fq.gz and trimmed/test_R2_trim.fq.gz (version: v0.22.1)
Bismark was run with Bowtie 2 against the bisulfite genome of /data/wangyang/snakemake-bisulfite/data2/index/ with the specified options: -q --score-min L,0,-0.2 -p 3 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)
Final Alignment report
======================
Sequence pairs analysed in total: 10000
Number of paired-end alignments with a unique best hit: 461
Mapping efficiency: 4.6%
Sequence pairs with no alignments under any condition: 9352
Sequence pairs did not map uniquely: 187
Sequence pairs which were discarded because genomic sequence could not be extracted: 0
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 224 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 237 ((converted) bottom strand)
Number of alignments to (merely theoretical) complementary strands being rejected in total: 0
Final Cytosine Methylation Report
=================================
Total number of C's analysed: 24900
Total methylated C's in CpG context: 503
Total methylated C's in CHG context: 344
Total methylated C's in CHH context: 1425
Total methylated C's in Unknown context: 1
Total unmethylated C's in CpG context: 279
Total unmethylated C's in CHG context: 4404
Total unmethylated C's in CHH context: 17945
Total unmethylated C's in Unknown context: 15
C methylated in CpG context: 64.3%
C methylated in CHG context: 7.2%
C methylated in CHH context: 7.4%
C methylated in unknown context (CN or CHN): 6.2%
Bismark completed in 0d 0h 0m 10s
- Alignment Stats
- Cytosine Methylation
- Alignment to Individual Bisulfite Strands
其它输出文件
test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt
(di-)nucleotide count sample percent sample count genomic percent genomic coverage
A 222577 24.44 1153640 24.62 0.193
C 231335 25.41 1190880 25.41 0.194
G 230956 25.36 1188801 25.37 0.194
T 225710 24.79 1152814 24.60 0.196
AA 64691 7.18 341159 7.28 0.190
AC 49564 5.50 259124 5.53 0.191
AG 45719 5.07 240634 5.14 0.190
AT 60571 6.72 312723 6.67 0.194
CA 62836 6.97 328231 7.00 0.191
CC 52285 5.80 274063 5.85 0.191
CG 67335 7.47 349543 7.46 0.193
CT 46542 5.16 239041 5.10 0.195
GA 51569 5.72 270253 5.77 0.191
GC 74833 8.30 387532 8.27 0.193
GG 52142 5.78 272860 5.82 0.191
GT 49885 5.53 258156 5.51 0.193
TA 41210 4.57 213996 4.57 0.193
TC 52394 5.81 270161 5.77 0.194
TG 63254 7.02 325763 6.95 0.194
TT 66510 7.38 342893 7.32 0.194
test_MappedOn_NC010473_trim_bismark_pe.bam
QNAME | FLAG | RNAME | Postion | Mapping Quality | CIGAR比对信息 | Mate Reference Name | MPOS | ISIZE | Read sequence | Quality socres | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1_Ecoli_K12:3176760-3177135_R1 | 147 | Ecoli_K12 | 1509655 | 42 | 99M | = | 1509379 | -375 | TACGCGTTGAGTTAGGCGAAATTTATCGTTGATGAATTGATGATCGGGTGTATAAATTTTGCGTTTAGTGGAAAATTTGGTATCGGGAAGAATTATTTA | IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII | NM:i:15 | MD:Z:12C9C0C4C7C6C8C12C2C6C0C5C9C0C1C3 | XM:Z:..Z.Z.......x...Z.....hh..Z.x.......x......xZ.......h........Z...x..z......hx.....xZ........hh.h... | XR:Z:GA | XG:Z:CT |
2_Ecoli_K12:2361798-2361918_R1 | 99 | Ecoli_K12 | 2324341 | 42 | 99M | = | 2324363 | 120 | TTAATTTGTTTTGTTTTTTTTTTTCGTCGGTAGGTAATTGTTATGAGATGTGTGTTAAAGTTTATGATATAGGTTGGGGCGGTGTGGTGTTTAAAACGA | IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII | NM:i:22 | MD:Z:0C0C2C0C0C3C2C0C1C1C2C4C7C2C2C13C0C4C0C2C2C1C29 | XM:Z:hh..hxz...x..hh.h.h..h..Z.xZ......h..x..h.............hh....hh..z..h.x.........Z................Z.. | XR:Z:CT | XG:Z:CT |
2_Ecoli_K12:2361798-2361918_R1 | 147 | Ecoli_K12 | 2324363 | 42 | 98M | = | 2324341 | -120 | TTCGTCGGTAGGTAATTGTTATGAGATGTGTGTTAAAGTTTACGATATAGGTTGGGGCGGTGTGGTGTTTAAAACGATCGGTTTTTTTATTGTTAACG | IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII | NM:i:14 | MD:Z:4C7C2C2C13C0C4C0C5C1C33C8C1C0C4 | XM:Z:..Z.xZ......h..x..h.............hh....hh..Z..h.x.........Z................Z...Z..h........z.hh..Z. | XR:Z:GA | XG:Z:CT |
- If a library is directional, only reads which are (bisulfite converted) versions of the original top strand (OT) or the original bottom strand (OB) will be sequenced. Even though the strands complementary to OT (CTOT) and OB (CTOB) are generated in the BS-PCR step they will not be sequenced as they carry the wrong kind of adapter at their 5’-end. By default, Bismark performs only 2 read alignments to the OT and OB strands, thereby ignoring alignments coming from the complementary strands as they should theoretically not be present in the BS-Seq library in question.
- Alternatively, BS-Seq libraries can be constructed so that all four different strands generated in the BS-PCR can and will end up in the sequencing library with roughly the same likelihood. In this case all four strands (OT, CTOT, OB, CTOB) can produce valid alignments and the library is called non- directional. Specifying --non_directional instructs Bismark to use all four alignment outputs.
Alignments to the original top strand or to the strand complementary to the original top strand (OT and CTOT) will both yield methylation information for cytosines on the top strand
Alignments to the original bottom strand or to the strand complementary to the original bottom strand (OB and CTOB) will both yield methylation information for cytosines on the bottom strand
Deduplication
PCR duplicates can be computationally identified following alignment to the reference genome since they erroneously(错误) inflate(夸大) the genome coverages and will occur false positive errors in further analysis.
In general, the way we resolve PCR bias is to remove reads that are aligned to the same position on the same strand of the reference genome.
duplicate_bismark \
--paired \
--bam mapped/test_MappedOn_NC010473_trim_bismark_pe.bam \
--output_dir mapped 2> logs/bismark/test_MappedOn_NC010473.deduplication.log
duplicate_bismark
的输出文件包括:
test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam
test_MappedOn_NC010473_trim_bismark_pe.deduplicated_report.txt
Total number of alignments analysed in ZF230116-74_1_bismark_bt2_pe.bam: 461
Total number duplicated alignments removed: 0 (0.00%)
Duplicated alignments were found at: 0 different position(s)
Total count of deduplicated leftover sequences: 461 (100.00% of total)
samtools sort mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam > mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam
samtools index mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam