一旦确定了reads的质量,我们就可以在转录本水平量化基因表达。这一步骤的目标是鉴定每一个reads来自哪一个转录本,以及与每个转录本相关的reads总数。
既然是在转录本水平量化基因的表达,因此首选是要获取转录本序列的fasta文件。获取转录本的fasta文件可以从像ensembl这样网站下载全基因组的DNA序列文件以及基因组的注释gtf文件,使用工具gffread即可获取转录本的序列文件。或者使用stringtie这样的工具从头组装(de-novo assembly)得到转录本序列。
gffread
stringtie
用于进行转录本水平表达量化的工具有:
https://www.mun.ca/biology/scarr/iGen3_05-03.htmlA "gene" occurs over a particular physical region (locus) of a double-stranded dsDNA molecule. It includes an RNA-coding region, where transcription occurs from the DNA template strand so as to synthesize an RNA transcript in the 5'3' direction, running between the initiation and termination sites.
However, allocating reads to specific isoforms using short reads requires an estimation step, as many reads will not span splice junctions and therefore cannot be unambiguously assigned to a specific isoform.
Subread package: high-performance read alignment, quantification and mutation discovery
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... featureCounts -t exon\ -g gene_id \ -Q 10 \ --primary \ -s 0 \ # Perform strand-specific read counting -p \ # paired-end reads; single-end reads -T 1 \ # Number of the threads -a hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf \ # Name of an annotation file -o ./count_result/test_count.featureCounts \ # Name of output file including read counts ./bam/test_hisat2.sort.bam \ ./bam/test_hisat2.sort.2.bam ./count_result/test_count.featureCounts.log 2>&1 &
cut -f 1,7,8 test_count.featureCounts
High-throughput sequence只能是基因层面计数,不能做转录本层面计数
图片alt
https://htseq.readthedocs.io/en/master/htseqcount.html
htseq-count -f bam \ # {sam,bam,auto} Type of <alignment_file> data -r pos \ # {pos,name} --max-reads-in-buffer 1000000 \ # 当<alignment_file>按位置成对结束排序时,只允许如此多的读取保留在内存中,直到找到匹配项(提高此数字将使用更多内存) --stranded no \ # {yes,no,reverse} Whether the data is from a strand-specific assay. --minaqual 10 \ # 跳过MAPQ对齐质量低于给定最小值(默认值:10)的所有reads --type exon \ #要使用的功能类型(GTF文件中的第三列),其他类型的所有功能都将被忽略(默认值,适用于Ensembl GTF文件:exon) --idattr gene_id \ # 要用作特征ID的GTF属性(默认值,适用于Ensembl GTF文件:gene_ID) --mode union \ # {union,intersection-strict,intersection-nonempty}处理多个特征重叠的读取的模式 --nonunique none \ # {none,all,fraction,random} --secondary-alignments ignore \ # {score,ignore} --supplementary-alignments ignore \ # {score,ignore} --counts_output ./count_result/test_count.tsv \ # 将计数输出到而不是标准输出的文件名 --nprocesses 1 \ # Number of parallel CPU processes to use ./bam/test_hisat2.sort.bam \ # ./reference/gtf/hg38_refseq_from_ucsc.rm_XM_XR.fix_name.gtf > test_count.HTSeq.log 2>&1 &
sam文件格式简介
samtools sort -O BAM \ -o output.sort.bam \ # Write final output to FILE rather than standard output -@ 6 \ #Number of additional threads to use -m 2G \ # Set maximum memory per thread -T output.sort.bam.temp # Write temporary files to PREFIX
Status 220613-95.sorted.bam Assigned 124 Unassigned_Unmapped 0 Unassigned_Read_Type 0 Unassigned_Singleton 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 22 Unassigned_Secondary 0 Unassigned_NonSplit 0 Unassigned_NoFeatures 102 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 2
> featureCounts[,7] |> colSums() 220613-95.sorted.bam 124
HISAT2 summary stats: Total pairs: 9926 Aligned concordantly or discordantly 0 time: 9801 (98.74%) Aligned concordantly 1 time: 114 (1.15%) Aligned concordantly >1 times: 11 (0.11%) Aligned discordantly 1 time: 0 (0.00%) Total unpaired reads: 19602 Aligned 0 time: 19602 (100.00%) Aligned 1 time: 0 (0.00%) Aligned >1 times: 0 (0.00%) Overall alignment rate: 1.26%
Total Reads 1634 Total Tags 2107 Total Assigned Tags 1947 ===================================================================== Group Total_bases Tag_count Tags/Kb CDS_Exons 5896210 969 0.16 5'UTR_Exons 1145918 16 0.01 3'UTR_Exons 2590152 411 0.16 Introns 105096973 386 0.00 TSS_up_1kb 3101801 1 0.00 TSS_up_5kb 13536555 15 0.00 TSS_up_10kb 23882788 27 0.00 TES_down_1kb 3067005 43 0.01 TES_down_5kb 12827481 125 0.01 TES_down_10kb 22176000 138 0.01 =====================================================================
(subread) wy@master:~/workspace/RNA-seq/workspace$ featureCounts -t exon \ > -g gene_id \ > --primary \ > -s 0 \ > -p \ > -T 10 \ > -a resources/genome.gtf \ > /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam -o test.txt ========== _____ _ _ ____ _____ ______ _____ ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ v2.0.3 //========================== featureCounts setting ===========================\\ || || || Input files : 1 BAM file || || || || hisat2.bam || || || || Output file : test.txt || || Summary : test.txt.summary || || Paired-end : yes || || Count read pairs : no || || Annotation : genome.gtf (GTF) || || Dir for temp files : ./ || || || || Threads : 10 || || Level : meta-feature level || || Multimapping reads : not counted || || Multiple alignments : primary alignment only || || Multi-overlapping reads : not counted || || Min overlapping bases : 1 || || || \\============================================================================// //================================= Running ==================================\\ || || || Load annotation file genome.gtf ... || || Features : 526640 || || Meta-features : 30560 || || Chromosomes/contigs : 48 || || || || Process BAM file hisat2.bam... || || Paired-end reads are included. || || The reads are assigned on the single-end mode. || || Total alignments : 23750177 || || Successfully assigned alignments : 12709161 (53.5%) || || Running time : 0.09 minutes || || || || Write the final count table. || || Write the read assignment summary. || || || || Summary of counting results can be found in file "test.txt.summary" || || || \\============================================================================//
(subread) wy@master:~/workspace/RNA-seq/workspace$ samtools view /ssd1/wy/workspace/RNA-seq/workspace/results/hisat2/gwat_ABX_OLZ_18/hisat2.bam | wc -l 23750177