展开

定量分析

最后发布时间 : 2023-04-26 22:43:44 浏览量 :

一旦确定了reads的质量,我们就可以在转录本水平量化基因表达。这一步骤的目标是鉴定每一个reads来自哪一个转录本,以及与每个转录本相关的reads总数。

既然是在转录本水平量化基因的表达,因此首选是要获取转录本序列的fasta文件。获取转录本的fasta文件可以从像ensembl这样网站下载全基因组的DNA序列文件以及基因组的注释gtf文件,使用工具gffread即可获取转录本的序列文件。或者使用stringtie这样的工具从头组装(de-novo assembly)得到转录本序列。

用于进行转录本水平表达量化的工具有:

生信小木屋

https://www.mun.ca/biology/scarr/iGen3_05-03.html
A "gene" occurs over a particular physical region (locus) of a double-stranded dsDNA molecule. It includes an RNA-coding region, where transcription occurs from the DNA template strand so as to synthesize an RNA transcript in the 5'3' direction, running between the initiation and termination sites.

转录本水平定量

However, allocating reads to specific isoforms using short reads requires an estimation step, as many reads will not span splice junctions and therefore cannot be unambiguously assigned to a specific isoform.

从BAM文件到reads count

  • 质控:比对情况
    • RNASeqQC
  • reads计数
    • FeatureCount
    • HTSeq-count

FeatureCount

Subread package: high-performance read alignment, quantification and mutation discovery

Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

featureCounts -t exon\
-g gene_id \
-Q 10 \
--primary \
-s 0 \ # Perform strand-specific read counting
-p \ # paired-end reads; single-end reads
-T 1 \ # Number of the threads
-a hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf \ # Name of an annotation file
-o ./count_result/test_count.featureCounts \ # Name of output file including read counts
./bam/test_hisat2.sort.bam \
./bam/test_hisat2.sort.2.bam
 ./count_result/test_count.featureCounts.log 2>&1 &
cut -f 1,7,8 test_count.featureCounts

HTSeq-count

High-throughput sequence
只能是基因层面计数,不能做转录本层面计数

图片alt

图片alt

https://htseq.readthedocs.io/en/master/htseqcount.html

htseq-count -f bam \ # {sam,bam,auto} Type of <alignment_file> data
-r pos \ # {pos,name}
--max-reads-in-buffer 1000000 \ # 当<alignment_file>按位置成对结束排序时,只允许如此多的读取保留在内存中,直到找到匹配项(提高此数字将使用更多内存)
--stranded no \ # {yes,no,reverse} Whether the data is from a strand-specific assay.
--minaqual 10 \ # 跳过MAPQ对齐质量低于给定最小值(默认值:10)的所有reads
--type exon \ #要使用的功能类型(GTF文件中的第三列),其他类型的所有功能都将被忽略(默认值,适用于Ensembl GTF文件:exon)
--idattr gene_id \ # 要用作特征ID的GTF属性(默认值,适用于Ensembl GTF文件:gene_ID)
--mode union \ # {union,intersection-strict,intersection-nonempty}处理多个特征重叠的读取的模式
--nonunique none \ # {none,all,fraction,random}
--secondary-alignments ignore \ # {score,ignore}
--supplementary-alignments ignore \ #  {score,ignore}
--counts_output ./count_result/test_count.tsv \ # 将计数输出到而不是标准输出的文件名
--nprocesses 1 \ # Number of parallel CPU processes to use 
./bam/test_hisat2.sort.bam \ # 
./reference/gtf/hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf >  test_count.HTSeq.log 2>&1 &

sam文件说明

图片alt

图片alt


图片alt

图片alt

  • sam flag explain
  • HD:说明VN的版本以及比对有无排列顺序,这个例子没有排序。
  • SQ:参考序列目录。SN:参考序列名字。LN:参考序列长度。
  • PG:使用的比对程序名

sam文件格式简介

samtools sort -O BAM \
	-o output.sort.bam \ #  Write final output to FILE rather than standard output
	-@ 6 \ #Number of additional threads to use
	-m 2G \ # Set maximum memory per thread
	-T  output.sort.bam.temp # Write temporary files to PREFIX

igv展示

图片alt

图片alt


图片alt

图片alt


箭头向右表示该基因在染色体的负链上
图片alt

图片alt

基因表达定量

  • 标准化
    • cufflink,cuffnorm
    • DESeq2
    • esgeR
    • limma
  • 差异表达
    • cuffdiff
    • DESeq2
    • edgeR
    • limma-voom
Status	220613-95.sorted.bam
Assigned	124
Unassigned_Unmapped	0
Unassigned_Read_Type	0
Unassigned_Singleton	0
Unassigned_MappingQuality	0
Unassigned_Chimera	0
Unassigned_FragmentLength	0
Unassigned_Duplicate	0
Unassigned_MultiMapping	22
Unassigned_Secondary	0
Unassigned_NonSplit	0
Unassigned_NoFeatures	102
Unassigned_Overlapping_Length	0
Unassigned_Ambiguity	2
> featureCounts[,7] |> colSums()
220613-95.sorted.bam 
                 124 

生信小木屋

124 + 22 +102+2 = 250

HISAT2 summary stats:
	Total pairs: 9926
		Aligned concordantly or discordantly 0 time: 9801 (98.74%)
		Aligned concordantly 1 time: 114 (1.15%)
		Aligned concordantly >1 times: 11 (0.11%)
		Aligned discordantly 1 time: 0 (0.00%)
	Total unpaired reads: 19602
		Aligned 0 time: 19602 (100.00%)
		Aligned 1 time: 0 (0.00%)
		Aligned >1 times: 0 (0.00%)
	Overall alignment rate: 1.26%

生信小木屋

Total Reads                   1634
Total Tags                    2107
Total Assigned Tags           1947
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb             
CDS_Exons           5896210             969                 0.16              
5'UTR_Exons         1145918             16                  0.01              
3'UTR_Exons         2590152             411                 0.16              
Introns             105096973           386                 0.00              
TSS_up_1kb          3101801             1                   0.00              
TSS_up_5kb          13536555            15                  0.00              
TSS_up_10kb         23882788            27                  0.00              
TES_down_1kb        3067005             43                  0.01              
TES_down_5kb        12827481            125                 0.01              
TES_down_10kb       22176000            138                 0.01              
=====================================================================

(subread) wy@master:~/workspace/RNA-seq/workspace$ featureCounts -t exon \
>       -g gene_id \
>       --primary \
>       -s 0 \
>       -p \
>       -T 10 \
>       -a  resources/genome.gtf \
>       /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam -o test.txt

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
          v2.0.3

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           hisat2.bam                                       ||
||                                                                            ||
||             Output file : test.txt                                         ||
||                 Summary : test.txt.summary                                 ||
||              Paired-end : yes                                              ||
||        Count read pairs : no                                               ||
||              Annotation : genome.gtf (GTF)                                 ||
||      Dir for temp files : ./                                               ||
||                                                                            ||
||                 Threads : 10                                               ||
||                   Level : meta-feature level                               ||
||      Multimapping reads : not counted                                      ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file genome.gtf ...                                        ||
||    Features : 526640                                                       ||
||    Meta-features : 30560                                                   ||
||    Chromosomes/contigs : 48                                                ||
||                                                                            ||
|| Process BAM file hisat2.bam...                                             ||
||    Paired-end reads are included.                                          ||
||    The reads are assigned on the single-end mode.                          ||
||    Total alignments : 23750177                                             ||
||    Successfully assigned alignments : 12709161 (53.5%)                     ||
||    Running time : 0.09 minutes                                             ||
||                                                                            ||
|| Write the final count table.                                               ||
|| Write the read assignment summary.                                         ||
||                                                                            ||
|| Summary of counting results can be found in file "test.txt.summary"        ||
||                                                                            ||
\\============================================================================//
(subread) wy@master:~/workspace/RNA-seq/workspace$ samtools view /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam | wc -l
23750177