展开

MATS

最后发布时间 : 2023-05-06 11:02:46 浏览量 :

rMATS是一款对RNA-Seq数据进行差异可变剪切分析的软件。其通过rMATS统计模型对不同样本(有生物学重复的)进行可变剪切事件的表达定量,然后以likelihood-ratio test计算P value来表示两组样品在IncLevel(Inclusion Level)水平上的差异(从公式上来看,IncLevel跟PSI的定义也是类似的),lncLevel并利用Benjamini Hochberg算法对p value进行校正得FDR值。
https://github.com/Xinglab/rmats-turbo/blob/v4.1.2/README.md

rMATS在处理具有生物学重复的样本时,并不是将所有样本合并后,再进行可变剪切的差异统计检验的。因为
pooling RNAs or merging RNA-Seq data from multiple replicates is not an effective approach to account for variability, and the results is particularity sensitive to outliner.

rMATS的结果包含两种类型, using reads mapped to splice junctions only or using reads mapped to both splice junctions and exon body.These two types of results are comparable. Therefore, we used results using reads mapped to both splice junctions and exon body in our study. PMC7498679

R package maser

可变剪接定量指标PSI

rMATS采用exon inclusion level来定义样本中可变剪切事件的表达量,以外显子跳跃(Skipped Exon)为例,正常的转录本称之为Exon Inclusion Isofrom,发生了外显子跳跃的转录本则称之为Exon Skipping Isofrom

生信小木屋

I表示比对到Exon Inclusion Isofrom上的reads,S表示比对到Exon Skipping Isofrom上的reads, 则该外显子跳跃的可变剪切事件比例可以表示为:
生信小木屋

可以看到,exon inclusion level实际上是inclusion isofrom所占的比例,计算时,用长度校正了原始的reads数。其他类型的可变剪切事件也可以划分成上述两种isoform, 示意图如下

生信小木屋

基于RNA-Seq数据的可变剪接定量值PSI计算流程

生信小木屋

差异分析与统计检验

rMATS的使用

安装

mamba create -n rmats rmats
conda activate rmats

使用

测试数据
该数据中已经通过hisat得到bam文件

RNA-seq/hisat2_mapping
├── HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam
├── HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam
├── HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam
├── UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam
├── UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam
└──  UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam
cd RNA-seq
ls hisat2_mapping | grep ".bam"  | xargs | sed 's/ /,/g' > all.txt
ls hisat2_mapping | grep ".bam" | grep "HBR" | xargs | sed 's/ /,/g' > HBR.txt
ls hisat2_mapping | grep ".bam" | grep "UHR" | xargs | sed 's/ /,/g' > UHR.txt

HBR

HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam,HBR_Rep2_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam,HBR_Rep3_ERCC-Mix2_Build37-ErccTranscripts-chr22.sort.bam

UHR

UHR_Rep1_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam,UHR_Rep2_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam,UHR_Rep3_ERCC-Mix1_Build37-ErccTranscripts-chr22.sort.bam

rMATS的分析有两个步骤,preppost
prep步骤中,处理输入文件,并将summary保存到--tmp目录中的.rmats文件中。rmats文件根据输入.txt文件中指定的BAM的完整路径分别跟踪每个BAM的信息。
post步骤中,读取.rmats文件并创建最终输出文件。

rmats.py \
    --b1 all.txt \
    --gtf genomic/chr22_with_ERCC92.gtf \
    -t paired \
    --readLength 100 \
    --nthread 4 \
    --od rmats_prep \
    --tmp rmats_tmp \
    --task prep
results/hisat2/hypo_CON_VEH_68/hisat2.bam
150
147
"ENSRNOG00000029582";7,8,10;8,8,10;8,10,11
"ENSRNOG00000037658";4,6,10;5,6,10;6,6,10;6,10,11
"ENSRNOG00000067650";5,6,13;6,6,13;6,13,14
"ENSRNOG00000003782";0,0,4;0,4,5
"ENSRNOG00000007688";5,6,8;6,6,8;6,8,9

单独运行统计模型

rMATS统计模型需要一个事件定义文件fromGTF.[AS].txt和计数文件{JC,JCEC}.raw.input.[AS].txt作为输入。通常这些文件是由post步骤创建的,该步骤还运行统计模型以创建最终输出文件[AS].MATS.{JC,JCEC}.txt

python rmats.py --od /path/to/dir_with_existing_files --tmp /path/to/tmp_dir --task stat

--taskstat的一个用例是当有两个以上的组要比较时。例如,如果有3个样本组,则可以将每个样本组与其他两个样本组(1至2、1至3、2至3)进行比较。

统计reads的长度

zcat ${testData}/RNA-seq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz | head -n 2  | awk -F "" '{print NF}'

结果说明

每一个可变剪切的类型都对应一组输出文件,这里使用[AS_Event]代替5种可变剪切,说明rmats生成文件的含义

  • SE (skipped exon),
  • MXE (mutually exclusive exons)
  • A3SS (alternative 3' splice site)
  • A5SS (alternative 5' splice site)
  • RI (retained intron)

--od包含了post step(--task post)中的最终生成文件

  • [AS_Event].MATS.JC.txt:(Junction Counts)Final output that contains the list of events and read counts. Only splice junction reads are counted.
  • [AS_Event].MATS.JCEC.txt:(Junction Counts Exon Counts)Final output including both reads that span junctions defined by rmats (Junction Counts) and reads that do not cross an exon boundary (Exon Counts)
  • fromGTF.[AS_Event].txt:鉴定到来自GTF和RNA的可变剪切事件
  • fromGTF.novelJunction.[AS_Event].txt
  • fromGTF.novelSpliceSite.[AS_Event].txt
  • JC.raw.input.[AS_Event].txt
  • JCEC.raw.input.[AS_Event].txt:(Junction Counts Exon Counts)

[AS_Event].MATS.JC.txt: contains the list of events and read counts.

rMATS event id Event specific columns(event coordinates) rMATS event id
IDGeneIDgeneSymbolchrstrandlongExonStart_0baselongExonEndshortESshortEEflankingESflankingEEIDIJC_SAMPLE_1SJC_SAMPLE_1IJC_SAMPLE_2SJC_SAMPLE_2IncFormLenSkipFormLenPValueFDRIncLevel1IncLevel2IncLevelDifference
1"ENSRNOG00000000661""Hps4"chr12-442669834426760744266983442670924427154644272315115,8,13,00,0,0,019,18,1,321,0,0,029814911.01.0,1.0,1.0,NA0.905,1.0,1.0,1.00.024
7"ENSRNOG00000025612""Sez6l"chr12+44215856442159714421585944215971442137734421380672,0,0,02,0,0,00,0,0,01,0,0,015214911.00.495,NA,NA,NA0.0,NA,NA,NA0.495
12"ENSRNOG00000047896""Kctd10"chr12+422434214224351142243424422435114224217042242340121,2,0,023,8,5,12,0,0,022,18,0,3815214911.00.041,0.197,0.0,0.00.082,0.0,NA,0.00.032
16"ENSRNOG00000000888""Sbds"chr12+26428904264296492642929926429649264265642642672916349,164,82,250,0,0,0245,349,59,3900,1,0,029814911.01.0,1.0,1.0,1.01.0,0.994,1.0,1.00.002
18"ENSRNOG00000000994""Stxbp2"chr12+169934016995011699343169950116991801699266182,0,0,022,7,5,21,2,0,014,30,6,2715214911.00.082,0.0,0.0,0.00.065,0.061,0.0,0.0-0.011
19"ENSRNOG00000049604""Mvk"chr12-421526824215286342152682421528274215666442156812190,0,0,112,5,2,00,0,0,09,8,1,1318514911.00.0,0.0,0.0,1.00.0,0.0,0.0,0.00.25
22"ENSRNOG00000001032""Ccz1b"chr12+1062118810622312106221531062231210615067106151792212,17,3,10,0,0,06,9,2,190,1,0,029814911.01.0,1.0,1.0,1.01.0,0.818,1.0,1.00.045
23"ENSRNOG00000001299""Sun1"chr12-1540619415406306154061941540630315407327154073712317,4,3,014,6,0,29,19,2,3611,10,2,201521490.6565926156691.00.543,0.395,1.0,0.00.445,0.651,0.495,0.638-0.073
24"ENSRNOG00000001317""Zfp68"chr12+160293431603450916034382160345091602873516028848244,1,0,13,1,0,00,0,1,11,1,0,129814911.00.4,0.333,NA,1.00.0,0.0,1.0,0.3330.244
  • PValue: 两个样本组之间拼接差异的显著性。
  • IncLevel1: Inclusion level for sample 1. Replicates are comma separated. Calculated from normalized counts
  • IncLevel2:Inclusion level for sample 2. Replicates are comma separated. Calculated from normalized counts
  • IncLevelDifference: average(IncLevel1) - average(IncLevel2)
rmats_output/
├── MXE.MATS.JC.txt
├── RI.MATS.JC.txt
├── SE.MATS.JC.txt
├── A3SS.MATS.JC.txt
└── A5SS.MATS.JC.txt

├── MXE.MATS.JCEC.txt
├── RI.MATS.JCEC.txt
├── SE.MATS.JCEC.txt
├── A3SS.MATS.JCEC.txt
└── A5SS.MATS.JCEC.txt

├── fromGTF.A3SS.txt
├── fromGTF.A5SS.txt
├── fromGTF.MXE.txt
├── fromGTF.RI.txt
└── fromGTF.SE.txt

├── fromGTF.novelJunction.A3SS.txt
├── fromGTF.novelJunction.A5SS.txt
├── fromGTF.novelJunction.MXE.txt
├── fromGTF.novelJunction.RI.txt
└── fromGTF.novelJunction.SE.txt

├── fromGTF.novelSpliceSite.A3SS.txt
├── fromGTF.novelSpliceSite.A5SS.txt
├── fromGTF.novelSpliceSite.MXE.txt
├── fromGTF.novelSpliceSite.RI.txt
└── fromGTF.novelSpliceSite.SE.txt

├── JCEC.raw.input.A3SS.txt
├── JCEC.raw.input.A5SS.txt
├── JCEC.raw.input.MXE.txt
├── JCEC.raw.input.RI.txt
└── JCEC.raw.input.SE.txt

├── JC.raw.input.A3SS.txt
├── JC.raw.input.A5SS.txt
├── JC.raw.input.MXE.txt
├── JC.raw.input.RI.txt
└── JC.raw.input.SE.txt

└── summary.txt

JC和JCEC区分

https://www.jianshu.com/p/d09b95a98c64
rMATS中,JC是Junction Counts的缩写,表示跨越剪切位点的reads(暂且叫为JC reads)数量,JCEC是Junction Counts和Exon Counts的缩写合并,Exon Counts表示不跨越剪切位点的reads数量,JCEC可以理解为所有比对上的reads(暂且叫为JCEC reads)。他们的关系见下图:

生信小木屋

注意:参数ftype指示要导入哪些rMATS输出文件。较新的rMATS版本(>4.0.1)使用JCEC或JC命名法,而ReadsOnTargetedJunction或JunctionCountOnly在rMATS 3.2.5或更低版本中使用。看见用于参数描述的脉泽。

差异分析与统计检验

rmats 在差异分析时,比较的就是两组样本中inclusion level的差异,给定阈值c, 判断两个样本中对应inclusion level 的是否发生了变化,公式如下

生信小木屋

c这个阈值通过--cstat参数自定义,取值范围为0-1,代表的是两个样本中inclusion level的差值,0.1表示两个样本中该可变剪切事件的inclusion level相差10%。当然,实际计算过程是非常繁琐的,需要考虑数据的分布,对应的统计模型等各种因素,最终会给出每个可变剪切事件的p值和多重假设检验校正后的FDR值。

ф(exon inclusion level)用来定量可变剪切,即包含可变剪切事件区域的转录本在包含和跳过可变剪切事件区域的转录本中的百分比。△ф (△ф = | ф1- ф2 |)和 FDR 值用来确认两组样品间是否存在差异可变剪切,ф1 和 ф2 分别为两组样品的 exon inclusion level,当满足 △ф >5% 且 FDR <=1% 时,认为两组样品在该剪切位点发生了差异可变剪切。下图是已知转录本和新转录本中 rMATS 识别到的5种可变剪切数量的统计图。

rmats安装存在问题

error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory
sudo apt-get install libgsl-dev
sudo ln /lib/x86_64-linux-gnu/libgsl.so /lib/x86_64-linux-gnu/libgsl.so.25

参考

https://www.jianshu.com/p/9987e897c133