Mapping reads to the reference genome

最后发布时间 : 2023-03-24 14:46:10 浏览量 :

本文参考Bismark的官方教程给出其使用教程，Bismark映射亚硫酸氢盐转换后的序列，确定胞嘧啶（Cytosine）甲基化的状态。

测试数据准备

https://github.com/FelixKrueger/Bismark/tree/master/travis_files

Bismark build index

bismark_genome_preparation data2/index

data2/index
├── Bisulfite_Genome
│   ├── CT_conversion
│   │   ├── BS_CT.1.bt2
│   │   ├── BS_CT.2.bt2
│   │   ├── BS_CT.3.bt2
│   │   ├── BS_CT.4.bt2
│   │   ├── BS_CT.rev.1.bt2
│   │   ├── BS_CT.rev.2.bt2
│   │   └── genome_mfa.CT_conversion.fa
│   └── GA_conversion
│       ├── BS_GA.1.bt2
│       ├── BS_GA.2.bt2
│       ├── BS_GA.3.bt2
│       ├── BS_GA.4.bt2
│       ├── BS_GA.rev.1.bt2
│       ├── BS_GA.rev.2.bt2
│       └── genome_mfa.GA_conversion.fa
└── NC_010473.fa

Mapping reads to the reference genome

 bismark  \
    --bowtie2 \
    -p 3 \
    --non_directional \
    --nucleotide_coverage data2/index/ \
    -1 trimmed/test_R1_trim.fq.gz \
    -2 trimmed/test_R2_trim.fq.gz \
    --basename test_MappedOn_NC010473_trim_bismark \
    --output_dir mapped 2> mapped/test_MappedOn_NC010473.log

--nucleotide_coverage:

mapped
├── test_MappedOn_NC010473_trim_bismark_pe.bam
├── test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt
└── test_MappedOn_NC010473_trim_bismark_PE_report.txt

输出报告文件

test_MappedOn_NC010473_trim_bismark_PE_report.txt

Bismark report for: trimmed/test_R1_trim.fq.gz and trimmed/test_R2_trim.fq.gz (version: v0.22.1)
Bismark was run with Bowtie 2 against the bisulfite genome of /data/wangyang/snakemake-bisulfite/data2/index/ with the specified options: -q --score-min L,0,-0.2 -p 3 --reorder --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--non_directional' specified: alignments to all strands were being performed (OT, OB, CTOT, CTOB)

Final Alignment report
======================
Sequence pairs analysed in total:       10000
Number of paired-end alignments with a unique best hit: 461
Mapping efficiency:     4.6% 
Sequence pairs with no alignments under any condition:  9352
Sequence pairs did not map uniquely:    187
Sequence pairs which were discarded because genomic sequence could not be extracted:    0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:       224     ((converted) top strand)
GA/CT/CT:       0       (complementary to (converted) top strand)
GA/CT/GA:       0       (complementary to (converted) bottom strand)
CT/GA/GA:       237     ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:     0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   24900

Total methylated C's in CpG context:    503
Total methylated C's in CHG context:    344
Total methylated C's in CHH context:    1425
Total methylated C's in Unknown context:        1

Total unmethylated C's in CpG context:  279
Total unmethylated C's in CHG context:  4404
Total unmethylated C's in CHH context:  17945
Total unmethylated C's in Unknown context:      15

C methylated in CpG context:    64.3%
C methylated in CHG context:    7.2%
C methylated in CHH context:    7.4%
C methylated in unknown context (CN or CHN):    6.2%


Bismark completed in 0d 0h 0m 10s

Alignment Stats
Cytosine Methylation
Alignment to Individual Bisulfite Strands

其它输出文件

test_MappedOn_NC010473_trim_bismark_pe.nucleotide_stats.txt

(di-)nucleotide	count sample	percent sample	count genomic	percent genomic	coverage
A	222577	24.44	1153640	24.62	0.193
C	231335	25.41	1190880	25.41	0.194
G	230956	25.36	1188801	25.37	0.194
T	225710	24.79	1152814	24.60	0.196
AA	64691	7.18	341159	7.28	0.190
AC	49564	5.50	259124	5.53	0.191
AG	45719	5.07	240634	5.14	0.190
AT	60571	6.72	312723	6.67	0.194
CA	62836	6.97	328231	7.00	0.191
CC	52285	5.80	274063	5.85	0.191
CG	67335	7.47	349543	7.46	0.193
CT	46542	5.16	239041	5.10	0.195
GA	51569	5.72	270253	5.77	0.191
GC	74833	8.30	387532	8.27	0.193
GG	52142	5.78	272860	5.82	0.191
GT	49885	5.53	258156	5.51	0.193
TA	41210	4.57	213996	4.57	0.193
TC	52394	5.81	270161	5.77	0.194
TG	63254	7.02	325763	6.95	0.194
TT	66510	7.38	342893	7.32	0.194

test_MappedOn_NC010473_trim_bismark_pe.bam

QNAME	FLAG	RNAME	Postion	Mapping Quality	CIGAR比对信息	Mate Reference Name	MPOS	ISIZE	Read sequence	Quality socres
1_Ecoli_K12:3176760-3177135_R1	147	Ecoli_K12	1509655	42	99M	=	1509379	-375	TACGCGTTGAGTTAGGCGAAATTTATCGTTGATGAATTGATGATCGGGTGTATAAATTTTGCGTTTAGTGGAAAATTTGGTATCGGGAAGAATTATTTA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	NM:i:15	MD:Z:12C9C0C4C7C6C8C12C2C6C0C5C9C0C1C3	XM:Z:..Z.Z.......x...Z.....hh..Z.x.......x......xZ.......h........Z...x..z......hx.....xZ........hh.h...	XR:Z:GA	XG:Z:CT
2_Ecoli_K12:2361798-2361918_R1	99	Ecoli_K12	2324341	42	99M	=	2324363	120	TTAATTTGTTTTGTTTTTTTTTTTCGTCGGTAGGTAATTGTTATGAGATGTGTGTTAAAGTTTATGATATAGGTTGGGGCGGTGTGGTGTTTAAAACGA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	NM:i:22	MD:Z:0C0C2C0C0C3C2C0C1C1C2C4C7C2C2C13C0C4C0C2C2C1C29	XM:Z:hh..hxz...x..hh.h.h..h..Z.xZ......h..x..h.............hh....hh..z..h.x.........Z................Z..	XR:Z:CT	XG:Z:CT
2_Ecoli_K12:2361798-2361918_R1	147	Ecoli_K12	2324363	42	98M	=	2324341	-120	TTCGTCGGTAGGTAATTGTTATGAGATGTGTGTTAAAGTTTACGATATAGGTTGGGGCGGTGTGGTGTTTAAAACGATCGGTTTTTTTATTGTTAACG	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII	NM:i:14	MD:Z:4C7C2C2C13C0C4C0C5C1C33C8C1C0C4	XM:Z:..Z.xZ......h..x..h.............hh....hh..Z..h.x.........Z................Z...Z..h........z.hh..Z.	XR:Z:GA	XG:Z:CT

Directional BS-Seq libraries (default)

If a library is directional, only reads which are (bisulfite converted) versions of the original top strand (OT) or the original bottom strand (OB) will be sequenced. Even though the strands complementary to OT (CTOT) and OB (CTOB) are generated in the BS-PCR step they will not be sequenced as they carry the wrong kind of adapter at their 5’-end. By default, Bismark performs only 2 read alignments to the OT and OB strands, thereby ignoring alignments coming from the complementary strands as they should theoretically not be present in the BS-Seq library in question.
Alternatively, BS-Seq libraries can be constructed so that all four different strands generated in the BS-PCR can and will end up in the sequencing library with roughly the same likelihood. In this case all four strands (OT, CTOT, OB, CTOB) can produce valid alignments and the library is called non- directional. Specifying --non_directional instructs Bismark to use all four alignment outputs.

Alignments to the original top strand or to the strand complementary to the original top strand (OT and CTOT) will both yield methylation information for cytosines on the top strand
Alignments to the original bottom strand or to the strand complementary to the original bottom strand (OB and CTOB) will both yield methylation information for cytosines on the bottom strand

Deduplication

PCR duplicates can be computationally identified following alignment to the reference genome since they erroneously(错误) inflate（夸大） the genome coverages and will occur false positive errors in further analysis.
In general, the way we resolve PCR bias is to remove reads that are aligned to the same position on the same strand of the reference genome.

duplicate_bismark \
    --paired \
    --bam mapped/test_MappedOn_NC010473_trim_bismark_pe.bam \
    --output_dir mapped 2> logs/bismark/test_MappedOn_NC010473.deduplication.log

duplicate_bismark的输出文件包括：

test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam
test_MappedOn_NC010473_trim_bismark_pe.deduplicated_report.txt

Total number of alignments analysed in ZF230116-74_1_bismark_bt2_pe.bam:        461
Total number duplicated alignments removed:     0 (0.00%)
Duplicated alignments were found at:    0 different position(s)

Total count of deduplicated leftover sequences: 461 (100.00% of total)

samtools sort mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.bam > mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam
samtools index mapped/test_MappedOn_NC010473_trim_bismark_pe.deduplicated.sorted.bam

甲基化测序 Methylation calling/extraction