RSeQC
conda install rseqc
RSeQC处理4种文件格式
- BED 格式:Tab 分割,12列的表示基因模型的纯文本文件。
- SAM 或BAM 格式: 用来存储reads比对结果信息,SAM是可读的纯文本文件,然而BAM是SAM的二进制文本,一个压缩的可索引的reads比对文件。
- 染色体大小文件: 只有两列的纯文本文件,在“生物信息学文本处理大杂烩(一)”里已经讲过。hg19.chrom_24.sizes是人基因组hg19版本的size文件,是使用UCSC 的fetchChromSizes下载的。
- Fasta文件。
bam2fq.py
将BAM或SAM转换为FastQ
#pair-end
bam2fq.py -i test_PairedEnd_StrandSpecific_hg19.sam -o bam2fq_out1
Convert BAM/SAM file into fastq format ... Done
read_1 count: 649507
read_2 count: 350495
#single-end BAM file
bam2fq.py -i test_SingleEnd_StrandSpecific_hg19.bam -s -o bam2fq_out2
Convert BAM/SAM file into fastq format ... Done
read count: 1000000
bam2wig.py
将BAM文件转换为wig/bigWig格式
必须使用SAMtools对BAM文件进行正确排序和索引。以下示例显示如何使用samTools对BAM文件进行排序和索引
# sort and index BAM files
samtools sort -m 1000000000 input.bam input.sorted.bam
samtools index input.sorted.bam
bam2wig.py -s hg19.chrom.sizes -i sample.bam -o out -u
bam_stat.py
汇总BAM或SAM文件的映射统计信息
samtools view -bS hisat2.sam -o hisat2.bam
samtools sort hisat2.bam -o hisat2.sort.bam
samtools index hisat2.bam
bam_stat.py -i hisat2.bam
#==================================================
#All numbers are READ count
#==================================================
Total records: 132342474
QC failed: 0
Optical/PCR duplicate: 0
Non primary hits 13407558
Unmapped reads: 18177341
mapq < mapq_cut (non-unique): 8412191
mapq >= mapq_cut (unique): 92345384
Read-1: 46209851
Read-2: 46135533
Reads map to '+': 46170757
Reads map to '-': 46174627
Non-splice reads: 45358434
Splice reads: 46986950
Reads mapped in proper pairs: 80309530
Proper-paired reads map to different chrom:0
clipping_profile.py
计算clipped nucleotides在reads之间的分布
$ clipping_profile.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -s "PE" -o out
$ ls -l out.clipping_profile.* |awk '{print $9}'
out.clipping_profile.r
out.clipping_profile.R1.pdf
out.clipping_profile.R2.pdf
out.clipping_profile.xls
deletion_profile.py
计算不同reads之间deletions分布
$ python2.7 deletion_profile.py -i sample.bam -l 101 -o out
Process BAM file ... Total reads used: 58310
$ ls -l out.deletion_profile.* |awk '{print $9}'
out.deletion_profile.pdf
out.deletion_profile.r
out.deletion_profile.txt
divide_bam.py
分割bam文件
将BAM文件(m个对齐)等分为n个部分。每个部分包含大约m/n 个alignments,这些alignments是从总alignments中随机采样的。
python2.7 divide_bam.py -n 3 -i SingleEnd_StrandSpecific_50mer_Human_hg19.bam -o output
Dividing SingleEnd_StrandSpecific_50mer_Human_hg19.bam ... Done
output_0.bam 22179548
output_1.bam 22185659
output_2.bam 22187574
FPKM_count.py
计算BED文件的FPM、FPKM、
FPM (fragment per million) and FPKM (fragment per million mapped reads per kilobase exon)
FPKM-UQ.py
计算TCGA的count、FPKM和FPKM-UQ值
FPKM-UQ.py --bam 8e5c811c-132a-4f27-b7ef-1a4a644f9079_gdc_realn_rehead.bam --gtf gencode.v22.annotation.gtf --info gencode.gene.info.v22.tsv -o output
geneBody_coverage.py
计算RNA-seq reads覆盖gene body
如果提供了3个或更多BAM文件。这个程序生成了一个lineGraph和一个热图。如果提供的BAM文件少于3个,则仅生成lineGraph。请参阅以下示例。
生成热图时,样本按覆盖范围的“偏度”进行排名:覆盖范围最好(最差)的样本将显示在热图的顶部(底部)。
覆盖偏斜度通过Pearson偏斜度系数进行测量
仅输入排序和索引的BAM文件。
mRNA长度<100的基因/转录物将被跳过(指定为“-l”的数字不能<100)
A single BAM file
geneBody_coverage.py -r hg19.housekeeping.bed -i test.bam -o output
A list of BAM files separated by “,”
geneBody_coverage.py -r hg19.housekeeping.bed -i test1.bam,test2.bam,test3.bam -o output
Plain text file containing the paths of BAM files.
geneBody_coverage.py -r hg19.housekeeping.bed -i bam_path.txt -o output
cat bam_path.txt
/data/alignment/test1.bam
/data/alignment/test2.bam
/data/alignment/test3.bam
A directory containing BAM files.
geneBody_coverage.py -r hg19.housekeeping.bed -i /data/alignment/ -o output
geneBody_coverage2.py
infer_experiment.py
- 该程序用于“猜测”RNA-seq测序是如何配置的, 特别是链特异性RNA-seq数据的读数是在哪一条链,通过比较strandness of reads和standness of transcripts。
- strandness of reads决定比对;standness of transcripts决定注释
- 对于非链特异性的RNA-seq数据,strandness of reads和standness of transcripts是独立的
- 对于链特异性的RNA-seq数据,strandness of reads在很大程度上取决于standness of transcripts,有关详细信息,请参见以下3个示例。
- 在将读数映射到参考基因组之前,您不需要知道RNA测序方案。使用非链特异性比对RNA-seq数据,这个脚本可以“猜测”RNA-seq是哪一种链特异性。
对于双端RNA-seq,有两种不同的strand reads 方式(such as Illumina ScriptSeq protocol)
- 1++,1–,2+-,2-+
- read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
- read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
- 1+-,1-+,2++,2–
- read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
- read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
对于单端RNA-seq,还有两种不同的链读方式:
- ++,–
- read mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
- +-,-+
- read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
Pair-end non strand specific
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam
This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925
1.72%的reads映射到两个位置(基因组区域正链和负链都有基因);剩余 98.28% (1 - 0.0172 = 0.9828) 的reads,一半可以用1++,1–,2+-,2-+
解释,一半可以用1+-,1-+,2++,2–
解释。最终得出结论,这不是一个链特异性的数据集,因为strandness of reads
独立于standness of transcripts
Pair-end strand specific
infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam
This is PairEnd Data
Fraction of reads failed to determine: 0.0072
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487
0.72%的reads映射到两个位置(基因组区域正链和负链都有基因);剩余 99.28% (1 - 0.0072 = 0.9928)的reads,绝大多数可以用1++,1–,2+-,2-+
解释。最终得出结论,这不是一个链特异性的数据集,因为strandness of reads
独立于standness of transcripts
,因此表明是链特异性的数据集。
Single-end strand specific
infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam
This is SingleEnd Data
Fraction of reads failed to determine: 0.0170
Fraction of reads explained by "++,--": 0.9669
Fraction of reads explained by "+-,-+": 0.0161
inner_distance.py
Calculate inner distance between read pairs.
该模块用于计算两个配对RNA reads之间的内部距离(或插入物大小),我们首先确定两个配对reads之间的基因组(DNA)大小
- D_size = read2_start - read1_end
- 如果两个成对的reads映射到同一外显子: inner distance = D_size
- 如果两个成对的reads映射到不同的外显子:inner distance = D_size - intron_size
- 如果两个成对的reads映射非外显子区域(如内含子和基因间区域:inner distance = D_size
- 如果两个片段重叠,则inner_distance可能为负值。
注:并非所有reads对都用于估计内部距离分布。跳过了那些低质量、PCR重复、多重映射reads。
insertion_profile.py
junction_annotation.py
junction_saturation.py
mismatch_profile.py
normalize_bigwig.py
overlay_bigwig.py
read_distribution.py
比对区域分布统计
将比对到基因组上的Reads分布情况进行统计,定位区域分为CDS( 编码区)、Intron( 内含子) 、Intergenic( 基因间区)和UTR( 5'和3'非翻译区) 。在基因组注释较为完全的物种中,通常比对到CDS( 编码区) 的reads含量最高比对到Intron( 内含子)区域的reads来源于pre-mRNA的残留或由可变剪切过程中发生的内含子保留事件导致的,而比对到Intergenic( 基因间区)的Reads则可能转录自新基因或新的非编码RNA。
获取bed文件
mamba install -c conda-forge -c bioconda ucsc-gtftogenepred ucsc-genepredtobed
cat genome.gtf | \
gtfToGenePred /dev/stdin /dev/stdout | \
genePredToBed /dev/stdin /dev/stdout > genome.bed
https://bioinformatics.stackexchange.com/questions/7094/making-a-bed-file-for-rseqc
read_distribution.py \
-i hisat2.bam \
-r genome.bed
Total Reads 100757575
Total Tags 163279102
Total Assigned Tags 156662725
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 36791753 119159928 3238.77
5'UTR_Exons 36551703 2753517 75.33
3'UTR_Exons 61032544 31236198 511.80
Introns 1096995615 2574747 2.35
TSS_up_1kb 34946104 36227 1.04
TSS_up_5kb 158419554 172100 1.09
TSS_up_10kb 284020649 260890 0.92
TES_down_1kb 37387697 336922 9.01
TES_down_5kb 162180245 506298 3.12
TES_down_10kb 286126600 677445 2.37
=====================================================================
read_duplication.py
read_GC.py
read_hexamer.py
read_NVC.py
read_quality.py
sc_bamStat.py -h
sc_seqLogo.py -h
sc_seqQual.py
sc_editMatrix.py
RNA_fragment_size.py
RPKM_count.py
RPKM_saturation.py
任何样本统计(RPKM)的精度都受到样本量(测序深度)的影响;“resampling”或“jacknifing”是一种通过使用可用数据子集来估计样本统计精度的方法。该模块将从总RNA读数中重新采样一系列子集,然后使用每个子集计算RPKM值。通过这样做,我们能够检查当前测序深度在基因表达估计方面是否饱和(或者RPKM值是否稳定)。如果测序深度饱和,估计的RPKM值将是固定的或可重复的。默认情况下,该模块将为每个转录物计算20个RPKM值(使用总读数的5%、10%、…、95%、100%)。
在输出图中,Y轴是“相对误差百分比”或“误差百分比”,用于测量从读数子集(即RPKMobs)估计的RPKM如何偏离真实表达水平(即RPKMreal)。然而,在实践中,人们无法知道RPKMreal。作为代理,我们使用从总读取量估计的RPKM来近似RPKMreal。
RPKM_saturation.py -r hg19.refseq.bed12 -d '1++,1--,2+-,2-+' -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output
说明:Q1,Q2,Q3,Q4是按照转录本表达量4分位分开的.Q1表示的是表达量低于25%的转录本,以此类推.可以看出:随着样本量升高, 与实际值的偏差也在降低.而且转录本表达量越高这种趋势越明显(Q4最明显).