定量分析
一旦确定了reads的质量,我们就可以在转录本水平量化基因表达。这一步骤的目标是鉴定每一个reads来自哪一个转录本,以及与每个转录本相关的reads总数。
既然是在转录本水平量化基因的表达,因此首选是要获取转录本序列的fasta文件。获取转录本的fasta文件可以从像ensembl这样网站下载全基因组的DNA序列文件以及基因组的注释gtf文件,使用工具gffread
即可获取转录本的序列文件。或者使用stringtie
这样的工具从头组装(de-novo assembly)得到转录本序列。
用于进行转录本水平表达量化的工具有:
- RSEM
- Salmon
https://www.mun.ca/biology/scarr/iGen3_05-03.html
A "gene" occurs over a particular physical region (locus) of a double-stranded dsDNA molecule. It includes an RNA-coding region, where transcription occurs from the DNA template strand so as to synthesize an RNA transcript in the 5'3' direction, running between the initiation and termination sites.
转录本水平定量
However, allocating reads to specific isoforms using short reads requires an estimation step, as many reads will not span splice junctions and therefore cannot be unambiguously assigned to a specific isoform.
从BAM文件到reads count
- 质控:比对情况
- RNASeqQC
- reads计数
- FeatureCount
- HTSeq-count
FeatureCount
Subread package: high-performance read alignment, quantification and mutation discovery
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
featureCounts -t exon\
-g gene_id \
-Q 10 \
--primary \
-s 0 \ # Perform strand-specific read counting
-p \ # paired-end reads; single-end reads
-T 1 \ # Number of the threads
-a hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf \ # Name of an annotation file
-o ./count_result/test_count.featureCounts \ # Name of output file including read counts
./bam/test_hisat2.sort.bam \
./bam/test_hisat2.sort.2.bam
./count_result/test_count.featureCounts.log 2>&1 &
cut -f 1,7,8 test_count.featureCounts
HTSeq-count
High-throughput sequence
只能是基因层面计数,不能做转录本层面计数
https://htseq.readthedocs.io/en/master/htseqcount.html
htseq-count -f bam \ # {sam,bam,auto} Type of <alignment_file> data
-r pos \ # {pos,name}
--max-reads-in-buffer 1000000 \ # 当<alignment_file>按位置成对结束排序时,只允许如此多的读取保留在内存中,直到找到匹配项(提高此数字将使用更多内存)
--stranded no \ # {yes,no,reverse} Whether the data is from a strand-specific assay.
--minaqual 10 \ # 跳过MAPQ对齐质量低于给定最小值(默认值:10)的所有reads
--type exon \ #要使用的功能类型(GTF文件中的第三列),其他类型的所有功能都将被忽略(默认值,适用于Ensembl GTF文件:exon)
--idattr gene_id \ # 要用作特征ID的GTF属性(默认值,适用于Ensembl GTF文件:gene_ID)
--mode union \ # {union,intersection-strict,intersection-nonempty}处理多个特征重叠的读取的模式
--nonunique none \ # {none,all,fraction,random}
--secondary-alignments ignore \ # {score,ignore}
--supplementary-alignments ignore \ # {score,ignore}
--counts_output ./count_result/test_count.tsv \ # 将计数输出到而不是标准输出的文件名
--nprocesses 1 \ # Number of parallel CPU processes to use
./bam/test_hisat2.sort.bam \ #
./reference/gtf/hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf > test_count.HTSeq.log 2>&1 &
sam文件说明
- sam flag explain
- HD:说明VN的版本以及比对有无排列顺序,这个例子没有排序。
- SQ:参考序列目录。SN:参考序列名字。LN:参考序列长度。
- PG:使用的比对程序名
samtools sort -O BAM \
-o output.sort.bam \ # Write final output to FILE rather than standard output
-@ 6 \ #Number of additional threads to use
-m 2G \ # Set maximum memory per thread
-T output.sort.bam.temp # Write temporary files to PREFIX
igv展示
箭头向右表示该基因在染色体的负链上
基因表达定量
- 标准化
- cufflink,cuffnorm
- DESeq2
- esgeR
- limma
- 差异表达
- cuffdiff
- DESeq2
- edgeR
- limma-voom
Status 220613-95.sorted.bam
Assigned 124
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 22
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 102
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 2
> featureCounts[,7] |> colSums()
220613-95.sorted.bam
124
124 + 22 +102+2 = 250
HISAT2 summary stats:
Total pairs: 9926
Aligned concordantly or discordantly 0 time: 9801 (98.74%)
Aligned concordantly 1 time: 114 (1.15%)
Aligned concordantly >1 times: 11 (0.11%)
Aligned discordantly 1 time: 0 (0.00%)
Total unpaired reads: 19602
Aligned 0 time: 19602 (100.00%)
Aligned 1 time: 0 (0.00%)
Aligned >1 times: 0 (0.00%)
Overall alignment rate: 1.26%
Total Reads 1634
Total Tags 2107
Total Assigned Tags 1947
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 5896210 969 0.16
5'UTR_Exons 1145918 16 0.01
3'UTR_Exons 2590152 411 0.16
Introns 105096973 386 0.00
TSS_up_1kb 3101801 1 0.00
TSS_up_5kb 13536555 15 0.00
TSS_up_10kb 23882788 27 0.00
TES_down_1kb 3067005 43 0.01
TES_down_5kb 12827481 125 0.01
TES_down_10kb 22176000 138 0.01
=====================================================================
(subread) wy@master:~/workspace/RNA-seq/workspace$ featureCounts -t exon \
> -g gene_id \
> --primary \
> -s 0 \
> -p \
> -T 10 \
> -a resources/genome.gtf \
> /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam -o test.txt
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v2.0.3
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| ||
|| hisat2.bam ||
|| ||
|| Output file : test.txt ||
|| Summary : test.txt.summary ||
|| Paired-end : yes ||
|| Count read pairs : no ||
|| Annotation : genome.gtf (GTF) ||
|| Dir for temp files : ./ ||
|| ||
|| Threads : 10 ||
|| Level : meta-feature level ||
|| Multimapping reads : not counted ||
|| Multiple alignments : primary alignment only ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file genome.gtf ... ||
|| Features : 526640 ||
|| Meta-features : 30560 ||
|| Chromosomes/contigs : 48 ||
|| ||
|| Process BAM file hisat2.bam... ||
|| Paired-end reads are included. ||
|| The reads are assigned on the single-end mode. ||
|| Total alignments : 23750177 ||
|| Successfully assigned alignments : 12709161 (53.5%) ||
|| Running time : 0.09 minutes ||
|| ||
|| Write the final count table. ||
|| Write the read assignment summary. ||
|| ||
|| Summary of counting results can be found in file "test.txt.summary" ||
|| ||
\\============================================================================//
(subread) wy@master:~/workspace/RNA-seq/workspace$ samtools view /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam | wc -l
23750177