I have fractionated the same RNA-seq sample into size bins 1 - 2 kb, 2 - 3 kb etc prior to preparing sequencing libraries and carrying out RNA-seq. The bins are all of the "same" 1 kb increment.
Is there a way of probabilistically scaling the bins to make them comparable between the size bins at a gene by gene level??? So say that if for gene X there are 3 annotated isoforms that I want to establish the ratios of in the sample, of length 200, 1000 and 2000, can I use (Bayesian?) multivariate linear regression or some other approach to carry this out???
I can't seem to see any literature where people have done this (but then we usually don't fractionate prior to sequencing)... Does anyone have any ideas ?
It's a little unclear what you're trying to accomplish. If your concern is that your quantitation can be biased by fragment lengths (I assume you're using an ion torrent or something like that, since feeding a 3kb fragment into an Illumina machine will not end well), presumably due to longer fragments yielding better discriminative power, then you can use any of the standard EM methods (rsem, eXpress, or newer methods like salmon and kallisto). The biased is minimized as much as it reasonably can be then.
Hi Devon - thanks for the reply!
The challenge is that I have several (for simplicity's sake lets reduce it to two) different long read sequencing libraries (made from the exact same RNA tube) from RNA that was gel separated by size into 1-2 and 5-6 kb long fractions. I am trying to somehow scale my data between these bins, so that for a particular gene X which (again for simplicity's sake) has two isoforms A and B of different lengths. Lets say A is 1500 nt long, and B is 5500 nt long.
I observe 100 reads from A and 2 reads from B in bin1_2, and 5 reads from A and 20 reads from B in bin5_6. What I want to model/figure out is what is the approximate ratio of A:B in the original sample?
The bias I am worried about is that most (annotated) genes are actually on the short end of the spectrum, so in the longer bins (ex 5-6 I am actually enriching for a smaller number of genes for which I will see reads - so I'll think the minor (longer) isoforms are more prevalent than they really are. Latest human gencode, for example:
NB if it helps, the bins are of equal sizes and equally spaced, and cover the entire dynamic range I'm investigating.