I'm trying to investigate splicing events in RNA-seq experiments over multiple libraries.
STAR got a SJ.out.tab
output file listing splicing events over one library.
After alignments I got one SJ.out.tab
per library.
I would like to face splicing event counts between libraries, but libraries do not have the same amount of mapped reads, which lead to an impossible comparison.
Is it a way to normalize this kind of count. Something else than divide by the library size ?
Running featureCounts then DESeq2 to get a sizeFactor to apply to my splicing event counts ?
I'm aware about some multiple bias engaged in RNAseq experiment
genes length
: I want to compare gene1 against gene1 in different conditions so it's OKgenes GC composition
: I want to compare gene1 against gene1 in different conditions so it's OKRNA population composition for each condition
: Biologically, I expect no variation in the amount of expressed genes,only more or less splicing eventsbatch effect
: The design is a bit messy, but I'll just ignore it for now (I'm in the exploration step)
Remains :
Library size
: Experiments took place in different ships so the library size are very different
Thanks !
Edit : I'm not looking for gene isoforms. I'm more interested in gene recombinaison (around 170 000 bp) as I'm working with B cells.
I would add that you also need to normalize by gene expression. Otherwise you'll see increase in all splicing events of an upregulated gene. A common thing to do is to normalize by reads across an exon for example, by computing the PSI (percent spliced in).
My biologists cleam that there sould not be any variation in gene expression over conditions only a modification of splicing events. I'm not looking for gene isoforms, I'm interested in large genes recombinaison (immune cells)
Do you want to investigate the actual splicing events or just the numbers of events per gene per condition?
If you aim for the latter, you may approach it with a Fisher's exact test per gene and do afterwards multiple hypothesis correction. The contingency table would be something like number of events in two conditions x {gene1,backgroud} .
I don't want to investigate the actual splicing events. But I don't want neither have a general splicing count.
I have a very small area of reseach, let's say 200 000 pb on a specific chromosome. Using my gtf file I know all the genes over this region (maybe 20 genes). I want to say "In condition1 I've got more gene recombinaison events merging gene A and gene B than in my conditionB"
I filtered my count table (
SJ.out.tab
STAR output) to only keep my area of interest :I'm looking for a normalization of theses read counts.
Let's say you have in total 1000 junctions reads in condition A and 1500 in conditions B. For the junctions Gene1 to Gene2, you have 150 in condition A and 20 in condition B.
In case you have 200 instead of 20 reads in condition B:
Using this approach no normalisation is needed.
Thanks it's clearer ! The number of junctions Gene1 to Gene2 could be very different so I'll need to normalize in any way.
Which kind of normalization should I apply to this type of data ? Is linear regression fit my data ?
Also should I take as factor : Number of total reads ? Number of mapped reads ? Number of splice junctions ?
For the Fisher's exact test, you need to have counts. The normalisation is given by the background data. In your case it could be the total numbers of splice-reads in the 170 kbp area.
I've got counts, but Fisher's exact test is a test to see if I need to normalize my counts. If I got p-value < 2.2e-16, how can I normalize those counts ?
This p-value indicates if the differences in number of reads spanning a certain junction are likely to be a "true" difference due to the condition or merely random.
Using this approach would give a p-value for each fusion/recombination event, like in a differential expression analysis. I thought this was what you need.
Nah, let's say I have 2 conditions :
The forcing splice event only occurs between these two genes, everywhere else in my modified mouse should have the same amount of splicing events.
So I expect that the huge amount of splice events for modified mouse in this location (geneA/geneB) is due to my condition2 (I could here, use your Fisher's test to prouve that)
But, at the beginning, if I got more reads in condition2 than in my condition1, I will also, statistically, get more splicing events, which is not due to my condition. It's due to my total number of reads output from sequencing.
I want to normalize this biais ! I don't know if i'm clear enought