RSeQC

最后发布时间:2023-04-20 16:31:32 浏览量:

conda install rseqc

RSeQC处理4种文件格式

BED 格式：Tab 分割，12列的表示基因模型的纯文本文件。
SAM 或BAM 格式：用来存储reads比对结果信息，SAM是可读的纯文本文件，然而BAM是SAM的二进制文本，一个压缩的可索引的reads比对文件。
染色体大小文件：只有两列的纯文本文件，在“生物信息学文本处理大杂烩(一)”里已经讲过。hg19.chrom_24.sizes是人基因组hg19版本的size文件，是使用UCSC 的fetchChromSizes下载的。
Fasta文件。

bam2fq.py

将BAM或SAM转换为FastQ

#pair-end
bam2fq.py -i test_PairedEnd_StrandSpecific_hg19.sam  -o bam2fq_out1
Convert BAM/SAM file into fastq format ...  Done
read_1 count: 649507
read_2 count: 350495

#single-end BAM file
bam2fq.py -i test_SingleEnd_StrandSpecific_hg19.bam -s -o bam2fq_out2
Convert BAM/SAM file into fastq format ...  Done
read count: 1000000

bam2wig.py

将BAM文件转换为wig/bigWig格式
必须使用SAMtools对BAM文件进行正确排序和索引。以下示例显示如何使用samTools对BAM文件进行排序和索引

# sort and index BAM files
samtools sort -m 1000000000  input.bam input.sorted.bam
samtools index input.sorted.bam

bam2wig.py -s hg19.chrom.sizes -i sample.bam -o out -u

bam_stat.py

汇总BAM或SAM文件的映射统计信息

samtools view -bS hisat2.sam -o hisat2.bam
samtools sort hisat2.bam -o hisat2.sort.bam
samtools index hisat2.bam

bam_stat.py -i hisat2.bam


#==================================================
#All numbers are READ count
#==================================================

Total records:                          132342474

QC failed:                              0
Optical/PCR duplicate:                  0
Non primary hits                        13407558
Unmapped reads:                         18177341
mapq < mapq_cut (non-unique):           8412191

mapq >= mapq_cut (unique):              92345384
Read-1:                                 46209851
Read-2:                                 46135533
Reads map to '+':                       46170757
Reads map to '-':                       46174627
Non-splice reads:                       45358434
Splice reads:                           46986950
Reads mapped in proper pairs:           80309530
Proper-paired reads map to different chrom:0

clipping_profile.py

计算clipped nucleotides在reads之间的分布

$ clipping_profile.py -i Pairend_StrandSpecific_51mer_Human_hg19.bam -s "PE" -o out

$ ls -l out.clipping_profile.* |awk '{print $9}'
out.clipping_profile.r
out.clipping_profile.R1.pdf
out.clipping_profile.R2.pdf
out.clipping_profile.xls

deletion_profile.py

计算不同reads之间deletions分布

$ python2.7  deletion_profile.py -i sample.bam -l 101 -o out
Process BAM file ...  Total reads used: 58310

$ ls -l out.deletion_profile.* |awk '{print $9}'
out.deletion_profile.pdf
out.deletion_profile.r
out.deletion_profile.txt

divide_bam.py

分割bam文件
将BAM文件（m个对齐）等分为n个部分。每个部分包含大约m/n 个alignments，这些alignments是从总alignments中随机采样的。

python2.7 divide_bam.py -n 3 -i SingleEnd_StrandSpecific_50mer_Human_hg19.bam -o output
Dividing SingleEnd_StrandSpecific_50mer_Human_hg19.bam ... Done
output_0.bam                                           22179548
output_1.bam                                           22185659
output_2.bam                                           22187574

FPKM_count.py

计算BED文件的FPM、FPKM、
FPM (fragment per million) and FPKM (fragment per million mapped reads per kilobase exon)

FPKM-UQ.py

计算TCGA的count、FPKM和FPKM-UQ值

FPKM-UQ.py --bam 8e5c811c-132a-4f27-b7ef-1a4a644f9079_gdc_realn_rehead.bam --gtf gencode.v22.annotation.gtf --info gencode.gene.info.v22.tsv -o output

geneBody_coverage.py

计算RNA-seq reads覆盖gene body
如果提供了3个或更多BAM文件。这个程序生成了一个lineGraph和一个热图。如果提供的BAM文件少于3个，则仅生成lineGraph。请参阅以下示例。
生成热图时，样本按覆盖范围的“偏度”进行排名：覆盖范围最好（最差）的样本将显示在热图的顶部（底部）。
覆盖偏斜度通过Pearson偏斜度系数进行测量

仅输入排序和索引的BAM文件。
mRNA长度<100的基因/转录物将被跳过（指定为“-l”的数字不能<100）

A single BAM file

geneBody_coverage.py -r hg19.housekeeping.bed -i test.bam  -o output

A list of BAM files separated by “,”

geneBody_coverage.py -r hg19.housekeeping.bed -i test1.bam,test2.bam,test3.bam  -o output

Plain text file containing the paths of BAM files.

geneBody_coverage.py -r hg19.housekeeping.bed -i bam_path.txt  -o output
cat  bam_path.txt
/data/alignment/test1.bam
/data/alignment/test2.bam
/data/alignment/test3.bam

A directory containing BAM files.

geneBody_coverage.py -r hg19.housekeeping.bed -i /data/alignment/  -o output

geneBody_coverage2.py

infer_experiment.py

该程序用于“猜测”RNA-seq测序是如何配置的，特别是链特异性RNA-seq数据的读数是在哪一条链，通过比较strandness of reads和standness of transcripts。
strandness of reads决定比对；standness of transcripts决定注释
对于非链特异性的RNA-seq数据，strandness of reads和standness of transcripts是独立的
对于链特异性的RNA-seq数据，strandness of reads在很大程度上取决于standness of transcripts，有关详细信息，请参见以下3个示例。
在将读数映射到参考基因组之前，您不需要知道RNA测序方案。使用非链特异性比对RNA-seq数据，这个脚本可以“猜测”RNA-seq是哪一种链特异性。

对于双端RNA-seq，有两种不同的strand reads 方式(such as Illumina ScriptSeq protocol)

1++,1–,2+-,2-+
- read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
- read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
1+-,1-+,2++,2–
- read1 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read1 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
- read2 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read2 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand

对于单端RNA-seq，还有两种不同的链读方式：

++,–
- read mapped to ‘+’ strand indicates parental gene on ‘+’ strand
- read mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
+-,-+
- read mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
- read mapped to ‘-‘ strand indicates parental gene on ‘+’ strand

gtf转bed文件

Pair-end non strand specific

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_nonStrandSpecific_36mer_Human_hg19.bam

This is PairEnd Data
Fraction of reads failed to determine: 0.0172
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4903
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4925

1.72%的reads映射到两个位置(基因组区域正链和负链都有基因)；剩余 98.28% (1 - 0.0172 = 0.9828) 的reads，一半可以用1++,1–,2+-,2-+解释，一半可以用1+-,1-+,2++,2–解释。最终得出结论，这不是一个链特异性的数据集，因为strandness of reads独立于standness of transcripts

Pair-end strand specific

infer_experiment.py -r hg19.refseq.bed12 -i Pairend_StrandSpecific_51mer_Human_hg19.bam

This is PairEnd Data
Fraction of reads failed to determine: 0.0072
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9441
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0487

0.72%的reads映射到两个位置(基因组区域正链和负链都有基因)；剩余 99.28% (1 - 0.0072 = 0.9928)的reads，绝大多数可以用1++,1–,2+-,2-+解释。最终得出结论，这不是一个链特异性的数据集，因为strandness of reads独立于standness of transcripts，因此表明是链特异性的数据集。

Single-end strand specific

infer_experiment.py -r hg19.refseq.bed12 -i SingleEnd_StrandSpecific_36mer_Human_hg19.bam

This is SingleEnd Data
Fraction of reads failed to determine: 0.0170
Fraction of reads explained by "++,--": 0.9669
Fraction of reads explained by "+-,-+": 0.0161

inner_distance.py

Calculate inner distance between read pairs.

该模块用于计算两个配对RNA reads之间的内部距离（或插入物大小）,我们首先确定两个配对reads之间的基因组（DNA）大小

D_size = read2_start - read1_end
- 如果两个成对的reads映射到同一外显子： inner distance = D_size
- 如果两个成对的reads映射到不同的外显子：inner distance = D_size - intron_size
- 如果两个成对的reads映射非外显子区域（如内含子和基因间区域：inner distance = D_size
- 如果两个片段重叠，则inner_distance可能为负值。

注：并非所有reads对都用于估计内部距离分布。跳过了那些低质量、PCR重复、多重映射reads。

insertion_profile.py

junction_annotation.py

junction_saturation.py

mismatch_profile.py

normalize_bigwig.py

overlay_bigwig.py

read_distribution.py

比对区域分布统计
将比对到基因组上的Reads分布情况进行统计，定位区域分为CDS（编码区）、Intron（内含子）、Intergenic（基因间区）和UTR（ 5'和3'非翻译区）。在基因组注释较为完全的物种中，通常比对到CDS（编码区）的reads含量最高比对到Intron（内含子）区域的reads来源于pre-mRNA的残留或由可变剪切过程中发生的内含子保留事件导致的，而比对到Intergenic（基因间区）的Reads则可能转录自新基因或新的非编码RNA。

获取bed文件

mamba install -c conda-forge -c bioconda ucsc-gtftogenepred ucsc-genepredtobed
cat genome.gtf | \
	gtfToGenePred /dev/stdin /dev/stdout | \
	genePredToBed /dev/stdin /dev/stdout >   genome.bed

https://bioinformatics.stackexchange.com/questions/7094/making-a-bed-file-for-rseqc

bed文件从官网下载

read_distribution.py \
    -i hisat2.bam \
    -r genome.bed

Total Reads                   100757575
Total Tags                    163279102
Total Assigned Tags           156662725
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb             
CDS_Exons           36791753            119159928           3238.77           
5'UTR_Exons         36551703            2753517             75.33             
3'UTR_Exons         61032544            31236198            511.80            
Introns             1096995615          2574747             2.35              
TSS_up_1kb          34946104            36227               1.04              
TSS_up_5kb          158419554           172100              1.09              
TSS_up_10kb         284020649           260890              0.92              
TES_down_1kb        37387697            336922              9.01              
TES_down_5kb        162180245           506298              3.12              
TES_down_10kb       286126600           677445              2.37              
=====================================================================

read_duplication.py

read_GC.py

read_hexamer.py

read_NVC.py

read_quality.py

sc_bamStat.py -h

sc_seqLogo.py -h

sc_seqQual.py

sc_editMatrix.py

RNA_fragment_size.py

RPKM_count.py

RPKM_saturation.py

任何样本统计（RPKM）的精度都受到样本量（测序深度）的影响；“resampling”或“jacknifing”是一种通过使用可用数据子集来估计样本统计精度的方法。该模块将从总RNA读数中重新采样一系列子集，然后使用每个子集计算RPKM值。通过这样做，我们能够检查当前测序深度在基因表达估计方面是否饱和（或者RPKM值是否稳定）。如果测序深度饱和，估计的RPKM值将是固定的或可重复的。默认情况下，该模块将为每个转录物计算20个RPKM值（使用总读数的5%、10%、…、95%、100%）。

在输出图中，Y轴是“相对误差百分比”或“误差百分比”，用于测量从读数子集（即RPKMobs）估计的RPKM如何偏离真实表达水平（即RPKMreal）。然而，在实践中，人们无法知道RPKMreal。作为代理，我们使用从总读取量估计的RPKM来近似RPKMreal。

RPKM_saturation.py -r hg19.refseq.bed12 -d '1++,1--,2+-,2-+' -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output

说明:Q1,Q2,Q3,Q4是按照转录本表达量4分位分开的.Q1表示的是表达量低于25%的转录本,以此类推.可以看出:随着样本量升高, 与实际值的偏差也在降低.而且转录本表达量越高这种趋势越明显(Q4最明显).