https://hbctraining.github.io/Intro-to-rnaseq-hpc-orchestra/lessons/07_rnaseq_workflow.html
图片alt
Other tools in common use include StringTie, which assembles a transcriptome model from TopHat (or similar tools) before the results are passed through to RSEM or MMSEQ to estimate transcript abundance, and then to Ballgown to identify differentially expressed genes or transcripts, and SOAPdenovo-trans, which simultaneously aligns and assembles reads for analysis via the path of choice.
https://pubmed.ncbi.nlm.nih.gov/31341269/
TopHat: discovering splice junctions with RNA-Seq(2009)TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions(2013)
Trapnell C, Pachter L, Salzberg SL. Bioinformatics. 2009
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. Genome Biol. 2013
建立Index
bowtie2-build --threads 20 \ ../data/chr1.fa \ hg38_ch1 | tee tophat2_index.log
tophat2 -p 30 \ -G ../data/hg38_refseq.gtf \ -o . \ ../tophat_index/hg38_ch1 \ ../cutadapt_fastq/test_R1_cutadapt.fq.gz \ ../cutadapt_fastq/test_R2_cutadapt.fq.gz | tee > tophat2_mapping.log 2>&1
samtools view -H accepted_hits.bam
Left reads: Input : 1922017 Mapped : 279857 (14.6% of input) of these: 3769 ( 1.3%) have multiple alignments (98 have >20) Right reads: Input : 1922017 Mapped : 285446 (14.9% of input) of these: 3923 ( 1.4%) have multiple alignments (101 have >20) 14.7% overall read mapping rate. Aligned pairs: 254089 of these: 3040 ( 1.2%) have multiple alignments 2552 ( 1.0%) are discordant alignments 13.1% concordant pair alignment rate.
STAR --runThreadN 12 --runMode genomeGenerate \ --genomeDir . \ --genomeFastaFiles ../data/chr1.fa \ --sjdbGTFfile ../data/hg38_refseq.gtf \ --sjdbOverhang 100 | tee star_index.log 2>&1
--sjdbOverhang
STAR \ --genomeDir ../star_index \ --runThreadN 30 \ --readFilesIn ../cutadapt_fastq/test_R1_cutadapt.fq.gz ../cutadapt_fastq/test_R2_cutadapt.fq.gz \ --readFilesCommand zcat \ --outFileNamePrefix . \ --outSAMtype BAM Unsorted \ --outSAMstrandField intronMotif \ --outSAMattributes All \ --outFilterIntronMotifs RemoveNoncanonical | tee star_mapping.log 2>&1
hisat2-build -p 30 ../data/chr1.fa hg38_ch1 | tee hisat2-index.log 2>&1
hisat2 -p 30 \ -x ../hisat2_index/hg38_ch1 \ -1 ../cutadapt_fastq/test_R1_cutadapt.fq.gz \ -2 ../cutadapt_fastq/test_R2_cutadapt.fq.gz \ -S hisat2.sam |tee hisat2_mapping.log 2>&1
## Pair-end non strand specific hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ -1 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \ -2 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \ -S nonStrandSpecific.sam 2> nonStrandSpecific.txt samtools view -bS nonStrandSpecific.sam -o nonStrandSpecific.bam samtools sort nonStrandSpecific.bam -o nonStrandSpecific.sort.bam samtools index nonStrandSpecific.sort.bam ## Pair-end strand specific hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ --dta --rna-strandness RF \ -1 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \ -2 ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \ -S StrandSpecific.sam 2> StrandSpecific.txt samtools view -bS StrandSpecific.sam -o StrandSpecific.bam samtools sort StrandSpecific.bam -o StrandSpecific.sort.bam samtools index StrandSpecific.sort.bam grep ENSG00000100271 ../genomic/chr22_with_ERCC92.gtf > TTLL1.gtf grep ENSG00000100304 ../genomic/chr22_with_ERCC92.gtf > TTLL12.gtf gffread -w TTLL12_transcripts.fa -g ../genomic/chr22_with_ERCC92.fa TTLL12.gtf # hisat2 --new-summary -p 30 \ # -x ../hisat2_index/chr22_with_ERCC92 \ # --dta --rna-strandness RF \ # -U out.fq \ # | samtools view -bS - | samtools sort - -o out.bam hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ -U out.fq \ -S out.sam 2> out.txt samtools view -bS out.sam -o out.bam samtools sort out.bam -o out.sort.bam samtools index out.sort.bam seqtk seq -F '#' TTLL12_transcripts.fa > out.fq htseq-count out.sam ../genomic/chr22_with_ERCC92.gtf > count.txt seqtk seq -r out.fq > complement.fq hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ -U complement.fq \ -S complement.sam 2> complement.txt samtools view -bS complement.sam -o complement.bam samtools sort complement.bam -o complement.sort.bam samtools index complement.sort.bam htseq-count complement.bam ../genomic/chr22_with_ERCC92.gtf > complement.txt cat complement.fq out.fq > all.fq hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ -U all.fq \ -S all.sam 2> all.txt samtools view -bS all.sam -o all.bam samtools sort all.bam -o all.sort.bam samtools index all.sort.bam htseq-count all.bam ../genomic/chr22_with_ERCC92.gtf > all.txt hisat2 --new-summary -p 30 \ -x ../hisat2_index/chr22_with_ERCC92 \ --rna-strandness RF \ -U all.fq \ -S all.RF.sam 2> all.RF.txt samtools view -bS all.RF.sam -o all.RF.bam samtools sort all.RF.bam -o all.RF.sort.bam samtools index all.RF.sort.bam htseq-count all.RF.sam ../genomic/chr22_with_ERCC92.gtf > all.RF.txt
build index
hisat2_extract_splice_sites.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.splice_sites.txt hisat2_extract_exons.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.exons.txt
mkdir hisat2 hisat2-build \ -p 30 \ --ss Mus_musculus.GRCm39.107.splice_sites.txt \ --exon Mus_musculus.GRCm39.107.exons.txt \ Mus_musculus.GRCm39.dna_sm.toplevel.fa \ hisat2/Mus_musculus.GRCm39.dna_sm.toplevel
reads mapping
INDEX=`find -L ./ -name "*.1.ht2" | sed 's/\.1.ht2$//'` hisat2 \ -x hisat2/Mus_musculus.GRCm39.dna_sm.toplevel \ -1 M3_1.fastp.fastq.gz \ -2 M3_2.fastp.fastq.gz \ --known-splicesite-infile Mus_musculus.GRCm39.107.splice_sites.txt \ --summary-file M3.hisat2.summary.log \ --threads 16 \ --rg-id M3 --rg SM:M3 \ --no-mixed \ --no-discordant \ --met-stderr --new-summary --dta \ --un-gz M3.unmapped.fastq.gz | samtools view -bS -F 4 -F 8 -F 256 - > M3.bam
对于双端的链特异性测序加上 --rna-strandness FR或者 --rna-strandness RF
--rna-strandness FR
--rna-strandness RF
https://gitee.com/bioinfoFungi/testData/blob/master/RNA-seq/genomic/chr22_with_ERCC92.fa.gz
gzip -dc chr22_with_ERCC92.fa.gz > chr22_with_ERCC92.fa export testData=~/testData
mkdir hisat2_index hisat2-build -p 30 \ ${testData}/RNA-seq/genomic/chr22_with_ERCC92.fa \ hisat2_index/chr22_with_ERCC92
├── chr22_with_ERCC92.1.ht2 ├── chr22_with_ERCC92.2.ht2 ├── chr22_with_ERCC92.3.ht2 ├── chr22_with_ERCC92.4.ht2 ├── chr22_with_ERCC92.5.ht2 ├── chr22_with_ERCC92.6.ht2 ├── chr22_with_ERCC92.7.ht2 └── chr22_with_ERCC92.8.ht2
hisat2 --new-summary -p 30 \ -x hisat2_index/chr22_with_ERCC92 \ -1 ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz \ -2 ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz \ -S hisat2.sam 2> summaty.txt
运行hisat2命令我们可以得到一个sam文件hisat2.sam,有关sam文件的详细介绍见samtools。
hisat2
sam
hisat2.sam
zcat ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | wc -l # 474284 # 474284/4 = 118571
使用循环将所有的样本mapping到参考基因组
mkdir hisat2_mapping for i in `ls ${testData}/RNA-seq/ | grep "UHR\|HBR" | awk '{split($1,arr,".read"); print arr[1]}' | uniq`;do read1="${i}.read1.fastq.gz" read2="${i}.read2.fastq.gz" echo "$i" hisat2 --new-summary -p 30 \ -x hisat2_index/chr22_with_ERCC92 \ -1 ${testData}/RNA-seq/${read1} \ -2 ${testData}/RNA-seq/${read2} \ -S hisat2_mapping/${i}.sam 2> hisat2_mapping/${i}.summary done &
mamba install samtools
samtools view -bS hisat2.sam -o hisat2.bam samtools sort hisat2.bam -o hisat2.sort.bam
查看bam文件
bam
samtools view hisat2.bam | less -S
for i in `ls hisat2_mapping | grep ".sam" | awk '{split($1,arr,".sam"); print arr[1]}'`;do samtools sort hisat2_mapping/$i.sam -o hisat2_mapping/$i.sort.bam done&
process HISAT2_EXTRACTSPLICESITES { tag "$gtf" publishDir = [ path: { "/ssd2/databases/mouse/" }, mode: 'symlink', ] conda "/ssd1/wy/.conda/envs/hisat2" input: path gtf output: path "*.splice_sites.txt", emit: splice_sites path "*.exons.txt", emit: exons script: """ hisat2_extract_splice_sites.py $gtf > ${gtf.baseName}.splice_sites.txt hisat2_extract_exons.py $gtf > ${gtf.baseName}.exons.txt """ } process HISAT2_BUILD { tag "$fasta" cpus = 30 publishDir = [ path: { "/ssd2/databases/mouse/" }, mode: 'symlink', ] conda "/ssd1/wy/.conda/envs/hisat2" input: path fasta path splicesites path exons output: path "hisat2" , emit: index script: def ss = "--ss $splicesites" def exon = "--exon $exons" """ mkdir hisat2 hisat2-build \\ -p $task.cpus \\ $ss \\ $exon \\ $fasta \\ hisat2/${fasta.baseName} """ } fasta= file("/ssd2/databases/mouse/Mus_musculus.GRCm39.dna_sm.toplevel.fa") gtf = file("/ssd2/databases/mouse/Mus_musculus.GRCm39.107.gtf") workflow{ HISAT2_EXTRACTSPLICESITES ( gtf ) splice_sites = HISAT2_EXTRACTSPLICESITES.out.splice_sites exons = HISAT2_EXTRACTSPLICESITES.out.exons ch_hisat2_index = HISAT2_BUILD(fasta, splice_sites,exons).index } // nf run mian.nf -with-conda true -resume
mamba install hisat2
或者创建一个新的环境安装
mamba create -n hisat2 hisat2 conda activate hisat2
RNA sequencing: the teenage years