Part 1/2
Hi,
although it's probably too late and I'm also just starting with rMATS I think I can try to explain those columns:
One row in the SE file contains information about an exon, which was at least skipped once in one of the two samples (either at condition1 or at condition2).
upstreamES and upstreamEE
The column upstreamES stands for upstreamExonStart and the column upstreamEE stands for upstreamExonEnd, the same applies to the downstream exon. These columns hold the position on the chromosome of the nucleotide which is at the upstream end (ExonStart) or downstream end (ExonEnd) of the flanking exons.
Positions of the different exon borders. The exon in the middel is the respective exon in a row of the SE file of the rMATS output.
IC_SAMPLE_1 and SC_SAMPLE_1
The column IC_SAMPLE_1 holds the number of reads (of sample1), which were assgined to events, where there was an inclusion of the respective exon, meaning, that the exon would be present in the final processed mRNA transcript after splicing.
The column SC_SAMPLE_1 holds the number of reads (of sample 1), which were assigned to events, where it seemed , that the respective exon got skipped.
If you have replicates of your sample 1, the respective read counts will be seperated by comma in those columns. It's described in the rMATS paper with following image.
Image 1 of the rMATS paper (http://www.pnas.org/content/111/51/E5593.full.pdf?with-ds=yes)
where I stands for reads, which are counted as inclusion events and S stands for reads, which are counted as skipping events
IncFormLen and SkipFormLen
According to the "SI materials and methods" part of the rMATS paper (http://www.pnas.org/content/suppl/2014/10/14/1415762111.DCSupplemental/pnas.1415762111.sapp.pdf), the 2 columns are just used to normalize the isoform-specific read counts - meaning they are used to calculate the columns IncLevel1 and IncLevel2- like this:
ψ = (I/LI) / (I/LI + S/LS)
where: ψ = Inclusion Level (IncLevel), I = number of reads mapped to the exon inclusion isoform (IC_SAMPLE_1), S = number of reads mapped to the exon skipping isoform (SC_SAMPLE_1), LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)
The columns are calcuated like this:
LI = 2( j - r + 1 )
LS = j - r + 1
where: j = junction length, r = read length of your rMATS experiment, LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)
What's the junction length j though? According to Shen Shihao, it's "the overall junction region covered by reads across junctions, affected by read length, anchor length (by default 8 bps in both upstream and downstream exons) and exon length."
j is calcualted as (read length - anchor)*2
"If the exon is shorter than (read length - anchor), the junction lengh will be reduced."
An example calcualtion of the columns IncFormLen and SkipFormLen can be found in following Google Group conversation :
https://groups.google.com/forum/#!topic/rmats-user-group/d7rzUBKXF1U
PValue and FDR
The PValue column is discribed in the rMATS paper like this:
"rMATS uses a likelihood-ratio test to calculate the P value that the difference in the mean ψ values between two sample groups exceeds a given threshold"
The documentation of rMATS states, that the statistic module of rMATS calculates the P-value (PValue) and the false discovery rate (FDR) that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold.
Meaning, that a row with e.g a PValue entry of 0.0001, and a FDR entry of close to zero, means that there can be found a statistically highly significant difference between the columns IncLevel1 and IncLevel2 of that row.
Be aware, that in some cases, Exel or e.g. RStudios will show you an entry of zero, when the respective value in those columns is lower than 2.2e-16.
In a recent post I found from their google group, it seems the made some changes to the calculation of effective length since 3.2.X, in order to cope with reads spanning multiple exons. And new definition seems to be read_length-1. https://groups.google.com/forum/#!searchin/rmats-user-group/read$20length$203.2.X%7Csort:date/rmats-user-group/DeTfsq3Llbw/J7vUBe2DAAAJ
I am using rMATS 4.0.2 and results header is changed. Is this explanation is same for updated version? I just want to understand about these name longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE.
According to above explanation skipped exon position should lie between upstreamES and downstreamES but i didn't find this type to patter as in example - ENSG00000141480 ARRB2 chr17 + 4715150 4715336 4715199 4715336 4715012 4715043
This question was concerning the SE file, which contains information about differences in exon skipping. I only know the columns you named from the ASS file, concerning alternative splice site usage. Per definition, alternative splice site usage either lead to an longer or shorter exon, depending on the position of the alternativly used splice site.