比对分析
https://hbctraining.github.io/Intro-to-rnaseq-hpc-orchestra/lessons/07_rnaseq_workflow.html
- A, aligners such as TopHat, STAR or HISAT2 use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq and featureCounts, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)), gene expression is modelled using tools such as edgeR, DESeq2 and limma+voom, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation.
- B, newer, alignment-free tools, such as Kallisto and Salmon, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts.
- Alternatively, workflow C begins by aligning the reads (typically performed with TopHat1, although STAR and HISAT can also be used), followed by the use of CuffLinks to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts.
Other tools in common use include StringTie, which assembles a transcriptome model from TopHat (or similar tools) before the results are passed through to RSEM or MMSEQ to estimate transcript abundance, and then to Ballgown to identify differentially expressed genes or transcripts, and SOAPdenovo-trans, which simultaneously aligns and assembles reads for analysis via the path of choice.
reads Mapping
- SAM(Sequence Alignment/Map):reads在基因组的上的位置信息存储文件
- BAM(Binary Alignment Map): SAM的二进制文件
- 操作SAM and BAM的软件:samtools
- 比对软件
- tophat2
- STAR
- HISAT2
参考基因组及注释文件
tophat2
TopHat: discovering splice junctions with RNA-Seq(2009)
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions(2013)
Trapnell C, Pachter L, Salzberg SL. Bioinformatics. 2009
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. Genome Biol. 2013
建立Index
bowtie2-build --threads 20 \
../data/chr1.fa \
hg38_ch1 | tee tophat2_index.log
tophat2 -p 30 \
-G ../data/hg38_refseq.gtf \
-o . \
../tophat_index/hg38_ch1 \
../cutadapt_fastq/test_R1_cutadapt.fq.gz \
../cutadapt_fastq/test_R2_cutadapt.fq.gz | tee > tophat2_mapping.log 2>&1
samtools view -H accepted_hits.bam
Left reads:
Input : 1922017
Mapped : 279857 (14.6% of input)
of these: 3769 ( 1.3%) have multiple alignments (98 have >20)
Right reads:
Input : 1922017
Mapped : 285446 (14.9% of input)
of these: 3923 ( 1.4%) have multiple alignments (101 have >20)
14.7% overall read mapping rate.
Aligned pairs: 254089
of these: 3040 ( 1.2%) have multiple alignments
2552 ( 1.0%) are discordant alignments
13.1% concordant pair alignment rate.
STAR(Spliced Transcripts Alignment to a Reference)
STAR --runThreadN 12 --runMode genomeGenerate \
--genomeDir . \
--genomeFastaFiles ../data/chr1.fa \
--sjdbGTFfile ../data/hg38_refseq.gtf \
--sjdbOverhang 100 | tee star_index.log 2>&1
--sjdbOverhang
reads长度
STAR \
--genomeDir ../star_index \
--runThreadN 30 \
--readFilesIn ../cutadapt_fastq/test_R1_cutadapt.fq.gz ../cutadapt_fastq/test_R2_cutadapt.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix . \
--outSAMtype BAM Unsorted \
--outSAMstrandField intronMotif \
--outSAMattributes All \
--outFilterIntronMotifs RemoveNoncanonical | tee star_mapping.log 2>&1
HISAT2
hisat2-build -p 30 ../data/chr1.fa hg38_ch1 | tee hisat2-index.log 2>&1
hisat2 -p 30 \
-x ../hisat2_index/hg38_ch1 \
-1 ../cutadapt_fastq/test_R1_cutadapt.fq.gz \
-2 ../cutadapt_fastq/test_R2_cutadapt.fq.gz \
-S hisat2.sam |tee hisat2_mapping.log 2>&1
reads模拟
## Pair-end non strand specific
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
-1 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \
-2 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \
-S nonStrandSpecific.sam 2> nonStrandSpecific.txt
samtools view -bS nonStrandSpecific.sam -o nonStrandSpecific.bam
samtools sort nonStrandSpecific.bam -o nonStrandSpecific.sort.bam
samtools index nonStrandSpecific.sort.bam
## Pair-end strand specific
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
--dta --rna-strandness RF \
-1 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \
-2 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \
-S StrandSpecific.sam 2> StrandSpecific.txt
samtools view -bS StrandSpecific.sam -o StrandSpecific.bam
samtools sort StrandSpecific.bam -o StrandSpecific.sort.bam
samtools index StrandSpecific.sort.bam
grep ENSG00000100271 ../genomic/chr22_with_ERCC92.gtf > TTLL1.gtf
grep ENSG00000100304 ../genomic/chr22_with_ERCC92.gtf > TTLL12.gtf
gffread -w TTLL12_transcripts.fa -g ../genomic/chr22_with_ERCC92.fa TTLL12.gtf
# hisat2 --new-summary -p 30 \
# -x ../hisat2_index/chr22_with_ERCC92 \
# --dta --rna-strandness RF \
# -U out.fq \
# | samtools view -bS - | samtools sort - -o out.bam
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
-U out.fq \
-S out.sam 2> out.txt
samtools view -bS out.sam -o out.bam
samtools sort out.bam -o out.sort.bam
samtools index out.sort.bam
seqtk seq -F '#' TTLL12_transcripts.fa > out.fq
htseq-count out.sam ../genomic/chr22_with_ERCC92.gtf > count.txt
seqtk seq -r out.fq > complement.fq
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
-U complement.fq \
-S complement.sam 2> complement.txt
samtools view -bS complement.sam -o complement.bam
samtools sort complement.bam -o complement.sort.bam
samtools index complement.sort.bam
htseq-count complement.bam ../genomic/chr22_with_ERCC92.gtf > complement.txt
cat complement.fq out.fq > all.fq
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
-U all.fq \
-S all.sam 2> all.txt
samtools view -bS all.sam -o all.bam
samtools sort all.bam -o all.sort.bam
samtools index all.sort.bam
htseq-count all.bam ../genomic/chr22_with_ERCC92.gtf > all.txt
hisat2 --new-summary -p 30 \
-x ../hisat2_index/chr22_with_ERCC92 \
--rna-strandness RF \
-U all.fq \
-S all.RF.sam 2> all.RF.txt
samtools view -bS all.RF.sam -o all.RF.bam
samtools sort all.RF.bam -o all.RF.sort.bam
samtools index all.RF.sort.bam
htseq-count all.RF.sam ../genomic/chr22_with_ERCC92.gtf > all.RF.txt
build index
hisat2_extract_splice_sites.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.splice_sites.txt
hisat2_extract_exons.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.exons.txt
mkdir hisat2
hisat2-build \
-p 30 \
--ss Mus_musculus.GRCm39.107.splice_sites.txt \
--exon Mus_musculus.GRCm39.107.exons.txt \
Mus_musculus.GRCm39.dna_sm.toplevel.fa \
hisat2/Mus_musculus.GRCm39.dna_sm.toplevel
reads mapping
INDEX=`find -L ./ -name "*.1.ht2" | sed 's/\.1.ht2$//'`
hisat2 \
-x hisat2/Mus_musculus.GRCm39.dna_sm.toplevel \
-1 M3_1.fastp.fastq.gz \
-2 M3_2.fastp.fastq.gz \
--known-splicesite-infile Mus_musculus.GRCm39.107.splice_sites.txt \
--summary-file M3.hisat2.summary.log \
--threads 16 \
--rg-id M3 --rg SM:M3 \
--no-mixed \
--no-discordant \
--met-stderr --new-summary --dta \
--un-gz M3.unmapped.fastq.gz
| samtools view -bS -F 4 -F 8 -F 256 - > M3.bam
对于双端的链特异性测序加上
--rna-strandness FR
或者--rna-strandness RF
测试数据准备
https://gitee.com/bioinfoFungi/testData/blob/master/RNA-seq/genomic/chr22_with_ERCC92.fa.gz
gzip -dc chr22_with_ERCC92.fa.gz > chr22_with_ERCC92.fa
export testData=~/testData
mkdir hisat2_index
hisat2-build -p 30 \
${testData}/RNA-seq/genomic/chr22_with_ERCC92.fa \
hisat2_index/chr22_with_ERCC92
├── chr22_with_ERCC92.1.ht2 ├── chr22_with_ERCC92.2.ht2 ├── chr22_with_ERCC92.3.ht2 ├── chr22_with_ERCC92.4.ht2 ├── chr22_with_ERCC92.5.ht2 ├── chr22_with_ERCC92.6.ht2 ├── chr22_with_ERCC92.7.ht2 └── chr22_with_ERCC92.8.ht2
reads mapping
hisat2 --new-summary -p 30 \
-x hisat2_index/chr22_with_ERCC92 \
-1 ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \
-2 ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \
-S hisat2.sam 2> summaty.txt
运行hisat2
命令我们可以得到一个sam
文件hisat2.sam
,有关sam
文件的详细介绍见samtools。
zcat ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | wc -l
# 474284
# 474284/4 = 118571
使用循环将所有的样本mapping到参考基因组
mkdir hisat2_mapping
for i in `ls ${testData}/RNA-seq/ | grep "UHR\|HBR" | awk '{split($1,arr,".read"); print arr[1]}' | uniq`;do
read1="${i}.read1.fastq.gz"
read2="${i}.read2.fastq.gz"
echo "$i"
hisat2 --new-summary -p 30 \
-x hisat2_index/chr22_with_ERCC92 \
-1 ${testData}/RNA-seq/${read1} \
-2 ${testData}/RNA-seq/${read2} \
-S hisat2_mapping/${i}.sam 2> hisat2_mapping/${i}.summary
done &
将sam文件转换为bam
mamba install samtools
samtools view -bS hisat2.sam -o hisat2.bam
samtools sort hisat2.bam -o hisat2.sort.bam
查看bam
文件
samtools view hisat2.bam | less -S
for i in `ls hisat2_mapping | grep ".sam" | awk '{split($1,arr,".sam"); print arr[1]}'`;do
samtools sort hisat2_mapping/$i.sam -o hisat2_mapping/$i.sort.bam
done&
nextflow
process HISAT2_EXTRACTSPLICESITES {
tag "$gtf"
publishDir = [
path: { "/ssd2/databases/mouse/" },
mode: 'symlink',
]
conda "/ssd1/wy/.conda/envs/hisat2"
input:
path gtf
output:
path "*.splice_sites.txt", emit: splice_sites
path "*.exons.txt", emit: exons
script:
"""
hisat2_extract_splice_sites.py $gtf > ${gtf.baseName}.splice_sites.txt
hisat2_extract_exons.py $gtf > ${gtf.baseName}.exons.txt
"""
}
process HISAT2_BUILD {
tag "$fasta"
cpus = 30
publishDir = [
path: { "/ssd2/databases/mouse/" },
mode: 'symlink',
]
conda "/ssd1/wy/.conda/envs/hisat2"
input:
path fasta
path splicesites
path exons
output:
path "hisat2" , emit: index
script:
def ss = "--ss $splicesites"
def exon = "--exon $exons"
"""
mkdir hisat2
hisat2-build \\
-p $task.cpus \\
$ss \\
$exon \\
$fasta \\
hisat2/${fasta.baseName}
"""
}
fasta= file("/ssd2/databases/mouse/Mus_musculus.GRCm39.dna_sm.toplevel.fa")
gtf = file("/ssd2/databases/mouse/Mus_musculus.GRCm39.107.gtf")
workflow{
HISAT2_EXTRACTSPLICESITES ( gtf )
splice_sites = HISAT2_EXTRACTSPLICESITES.out.splice_sites
exons = HISAT2_EXTRACTSPLICESITES.out.exons
ch_hisat2_index = HISAT2_BUILD(fasta, splice_sites,exons).index
}
// nf run mian.nf -with-conda true -resume
软件安装
mamba install hisat2
或者创建一个新的环境安装
mamba create -n hisat2 hisat2
conda activate hisat2
-
RNA sequencing: the teenage years
↩