展开

比对分析

最后发布时间 : 2023-07-22 19:00:14 浏览量 :

https://hbctraining.github.io/Intro-to-rnaseq-hpc-orchestra/lessons/07_rnaseq_workflow.html

图片alt

图片alt

[1]

  • A, aligners such as TopHat, STAR or HISAT2 use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq and featureCounts, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)), gene expression is modelled using tools such as edgeR, DESeq2 and limma+voom, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation.
  • B, newer, alignment-free tools, such as Kallisto and Salmon, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts.
  • Alternatively, workflow C begins by aligning the reads (typically performed with TopHat1, although STAR and HISAT can also be used), followed by the use of CuffLinks to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts.
Note

Other tools in common use include StringTie, which assembles a transcriptome model from TopHat (or similar tools) before the results are passed through to RSEM or MMSEQ to estimate transcript abundance, and then to Ballgown to identify differentially expressed genes or transcripts, and SOAPdenovo-trans, which simultaneously aligns and assembles reads for analysis via the path of choice.

生信小木屋

https://pubmed.ncbi.nlm.nih.gov/31341269/

reads Mapping

  • SAM(Sequence Alignment/Map):reads在基因组的上的位置信息存储文件
  • BAM(Binary Alignment Map): SAM的二进制文件
  • 操作SAM and BAM的软件:samtools
  • 比对软件
    • tophat2
    • STAR
    • HISAT2

参考基因组及注释文件

tophat2

TopHat: discovering splice junctions with RNA-Seq(2009)
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions(2013)

图片alt

图片alt

Trapnell C, Pachter L, Salzberg SL. Bioinformatics. 2009

图片alt

图片alt

Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. Genome Biol. 2013

建立Index

 bowtie2-build --threads 20  \
 ../data/chr1.fa  \
hg38_ch1  |  tee tophat2_index.log
tophat2 -p 30  \
-G ../data/hg38_refseq.gtf  \
-o . \
../tophat_index/hg38_ch1 \
../cutadapt_fastq/test_R1_cutadapt.fq.gz \
../cutadapt_fastq/test_R2_cutadapt.fq.gz | tee > tophat2_mapping.log 2>&1
samtools view -H accepted_hits.bam
Left reads:
          Input     :   1922017
           Mapped   :    279857 (14.6% of input)
            of these:      3769 ( 1.3%) have multiple alignments (98 have >20)
Right reads:
          Input     :   1922017
           Mapped   :    285446 (14.9% of input)
            of these:      3923 ( 1.4%) have multiple alignments (101 have >20)
14.7% overall read mapping rate.

Aligned pairs:    254089
     of these:      3040 ( 1.2%) have multiple alignments
                    2552 ( 1.0%) are discordant alignments
13.1% concordant pair alignment rate.

STAR(Spliced Transcripts Alignment to a Reference)

STAR --runThreadN 12 --runMode genomeGenerate \
--genomeDir . \
--genomeFastaFiles ../data/chr1.fa \
--sjdbGTFfile ../data/hg38_refseq.gtf \
--sjdbOverhang 100 | tee star_index.log 2>&1
  • --sjdbOverhang reads长度
STAR \
--genomeDir ../star_index \
--runThreadN 30 \
--readFilesIn ../cutadapt_fastq/test_R1_cutadapt.fq.gz  ../cutadapt_fastq/test_R2_cutadapt.fq.gz   \
--readFilesCommand zcat \
--outFileNamePrefix . \
--outSAMtype BAM Unsorted \
--outSAMstrandField intronMotif \
--outSAMattributes All \
--outFilterIntronMotifs RemoveNoncanonical | tee  star_mapping.log 2>&1

HISAT2

hisat2-build -p 30 ../data/chr1.fa hg38_ch1 | tee  hisat2-index.log 2>&1
hisat2 -p 30 \
-x ../hisat2_index/hg38_ch1 \
-1 ../cutadapt_fastq/test_R1_cutadapt.fq.gz  \
-2 ../cutadapt_fastq/test_R2_cutadapt.fq.gz  \
-S hisat2.sam |tee hisat2_mapping.log 2>&1

reads模拟

## Pair-end non strand specific
hisat2 --new-summary -p 30 \
    -x  ../hisat2_index/chr22_with_ERCC92 \
    -1  ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz  \
    -2  ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz  \
    -S nonStrandSpecific.sam 2> nonStrandSpecific.txt
samtools view -bS nonStrandSpecific.sam -o nonStrandSpecific.bam
samtools sort nonStrandSpecific.bam -o nonStrandSpecific.sort.bam
samtools index nonStrandSpecific.sort.bam

## Pair-end strand specific
hisat2 --new-summary -p 30 \
    -x ../hisat2_index/chr22_with_ERCC92 \
    --dta --rna-strandness RF \
    -1  ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz  \
    -2  ../reads/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz  \
    -S StrandSpecific.sam 2> StrandSpecific.txt
samtools view -bS StrandSpecific.sam -o StrandSpecific.bam
samtools sort StrandSpecific.bam -o StrandSpecific.sort.bam
samtools index StrandSpecific.sort.bam



grep ENSG00000100271 ../genomic/chr22_with_ERCC92.gtf > TTLL1.gtf
grep ENSG00000100304 ../genomic/chr22_with_ERCC92.gtf > TTLL12.gtf
gffread -w TTLL12_transcripts.fa -g ../genomic/chr22_with_ERCC92.fa TTLL12.gtf
# hisat2 --new-summary -p 30 \
#     -x ../hisat2_index/chr22_with_ERCC92 \
#     --dta --rna-strandness RF \
#     -U out.fq \
#     | samtools view -bS - | samtools sort - -o out.bam  
hisat2 --new-summary -p 30 \
    -x ../hisat2_index/chr22_with_ERCC92 \
    -U out.fq \
    -S out.sam 2> out.txt
samtools view -bS out.sam -o out.bam
samtools sort out.bam -o out.sort.bam
samtools index out.sort.bam
seqtk seq -F '#'  TTLL12_transcripts.fa  > out.fq

htseq-count  out.sam  ../genomic/chr22_with_ERCC92.gtf  > count.txt

seqtk seq -r out.fq > complement.fq



hisat2 --new-summary -p 30 \
    -x ../hisat2_index/chr22_with_ERCC92 \
    -U complement.fq \
    -S complement.sam 2> complement.txt
samtools view -bS complement.sam -o complement.bam
samtools sort complement.bam -o complement.sort.bam
samtools index complement.sort.bam

htseq-count  complement.bam ../genomic/chr22_with_ERCC92.gtf  > complement.txt


cat complement.fq out.fq > all.fq
hisat2 --new-summary -p 30 \
    -x ../hisat2_index/chr22_with_ERCC92 \
    -U all.fq \
    -S all.sam 2> all.txt
samtools view -bS all.sam -o all.bam
samtools sort all.bam -o all.sort.bam
samtools index all.sort.bam

htseq-count all.bam ../genomic/chr22_with_ERCC92.gtf  > all.txt



hisat2 --new-summary -p 30 \
    -x ../hisat2_index/chr22_with_ERCC92 \
    --rna-strandness RF \
    -U all.fq \
    -S all.RF.sam 2> all.RF.txt
samtools view -bS all.RF.sam -o all.RF.bam
samtools sort all.RF.bam -o all.RF.sort.bam
samtools index all.RF.sort.bam
htseq-count all.RF.sam ../genomic/chr22_with_ERCC92.gtf  > all.RF.txt

build index

hisat2_extract_splice_sites.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.splice_sites.txt
hisat2_extract_exons.py Mus_musculus.GRCm39.107.gtf > Mus_musculus.GRCm39.107.exons.txt
mkdir hisat2
hisat2-build \
    -p 30 \
    --ss Mus_musculus.GRCm39.107.splice_sites.txt \
    --exon Mus_musculus.GRCm39.107.exons.txt \
    Mus_musculus.GRCm39.dna_sm.toplevel.fa \
    hisat2/Mus_musculus.GRCm39.dna_sm.toplevel

reads mapping

INDEX=`find -L ./ -name "*.1.ht2" | sed 's/\.1.ht2$//'`
hisat2 \
    -x  hisat2/Mus_musculus.GRCm39.dna_sm.toplevel \
    -1 M3_1.fastp.fastq.gz \
    -2 M3_2.fastp.fastq.gz \
    --known-splicesite-infile Mus_musculus.GRCm39.107.splice_sites.txt \
    --summary-file M3.hisat2.summary.log \
    --threads 16 \
    --rg-id M3 --rg SM:M3 \
    --no-mixed \
    --no-discordant \
    --met-stderr --new-summary --dta \
	--un-gz M3.unmapped.fastq.gz
    | samtools view -bS -F 4 -F 8 -F 256 - > M3.bam

对于双端的链特异性测序加上 --rna-strandness FR或者 --rna-strandness RF

测试数据准备

https://gitee.com/bioinfoFungi/testData/blob/master/RNA-seq/genomic/chr22_with_ERCC92.fa.gz

gzip -dc chr22_with_ERCC92.fa.gz > chr22_with_ERCC92.fa
export testData=~/testData
mkdir hisat2_index

hisat2-build -p 30 \
    ${testData}/RNA-seq/genomic/chr22_with_ERCC92.fa \
    hisat2_index/chr22_with_ERCC92 
output
├── chr22_with_ERCC92.1.ht2
├── chr22_with_ERCC92.2.ht2
├── chr22_with_ERCC92.3.ht2
├── chr22_with_ERCC92.4.ht2
├── chr22_with_ERCC92.5.ht2
├── chr22_with_ERCC92.6.ht2
├── chr22_with_ERCC92.7.ht2
└── chr22_with_ERCC92.8.ht2

reads mapping

hisat2 --new-summary -p 30 \
    -x  hisat2_index/chr22_with_ERCC92 \
    -1  ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz  \
    -2  ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz  \
    -S hisat2.sam 2> summaty.txt

运行hisat2命令我们可以得到一个sam文件hisat2.sam,有关sam文件的详细介绍见samtools

zcat ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | wc -l
# 474284
# 474284/4 = 118571

生信小木屋

使用循环将所有的样本mapping到参考基因组

mkdir hisat2_mapping
for i in `ls ${testData}/RNA-seq/ | grep  "UHR\|HBR" | awk  '{split($1,arr,".read"); print arr[1]}' | uniq`;do
    read1="${i}.read1.fastq.gz"
    read2="${i}.read2.fastq.gz"
    echo "$i"
    hisat2 --new-summary -p 30 \
        -x  hisat2_index/chr22_with_ERCC92 \
        -1  ${testData}/RNA-seq/${read1}  \
        -2  ${testData}/RNA-seq/${read2}  \
        -S hisat2_mapping/${i}.sam  2> hisat2_mapping/${i}.summary
done &

将sam文件转换为bam

mamba install samtools
samtools view -bS hisat2.sam -o hisat2.bam
samtools sort hisat2.bam -o hisat2.sort.bam

查看bam文件

samtools view hisat2.bam  | less -S
for i in `ls hisat2_mapping | grep ".sam" |  awk  '{split($1,arr,".sam"); print arr[1]}'`;do
    samtools sort hisat2_mapping/$i.sam -o hisat2_mapping/$i.sort.bam
done&

nextflow

process HISAT2_EXTRACTSPLICESITES {
    tag "$gtf"
    publishDir = [
        path: { "/ssd2/databases/mouse/" },
        mode: 'symlink',
    ]

    conda "/ssd1/wy/.conda/envs/hisat2"
    
    input:
    path gtf

    output:
    path "*.splice_sites.txt", emit: splice_sites
    path "*.exons.txt", emit: exons

    script:
    """
    hisat2_extract_splice_sites.py $gtf > ${gtf.baseName}.splice_sites.txt
    hisat2_extract_exons.py $gtf > ${gtf.baseName}.exons.txt
    """
}

process HISAT2_BUILD {
    tag "$fasta"
    cpus = 30
    publishDir = [
        path: { "/ssd2/databases/mouse/" },
        mode: 'symlink',
    ]
    conda "/ssd1/wy/.conda/envs/hisat2"

    input:
    path fasta
    path splicesites
    path exons

    output:
    path "hisat2"       , emit: index

    script:

    def ss = "--ss $splicesites"
    def exon = "--exon $exons"
    """
    mkdir hisat2
    hisat2-build \\
        -p $task.cpus \\
        $ss \\
        $exon \\
        $fasta \\
        hisat2/${fasta.baseName}
    """
}
fasta= file("/ssd2/databases/mouse/Mus_musculus.GRCm39.dna_sm.toplevel.fa")
gtf = file("/ssd2/databases/mouse/Mus_musculus.GRCm39.107.gtf")
workflow{
    HISAT2_EXTRACTSPLICESITES ( gtf )
    splice_sites = HISAT2_EXTRACTSPLICESITES.out.splice_sites
    exons = HISAT2_EXTRACTSPLICESITES.out.exons

    ch_hisat2_index = HISAT2_BUILD(fasta, splice_sites,exons).index 
}

// nf run mian.nf  -with-conda true -resume

软件安装

mamba install  hisat2

或者创建一个新的环境安装

mamba create -n hisat2  hisat2
conda activate hisat2

  1. RNA sequencing: the teenage years