Dear biofellas, I have a phylosophical question here. While on one hand I am aware that the title question can be either supported or not by each one of us, I am interested in gathering your opinions to make a consensus of them and make a choice for my project :)
I have five RNASeq libraries from five different species, the insert size distributions seem "normal" (both in -commonsense- normal and -gaussian- normal), but I want to improve mapping accuracy a bit. I am mapping on the transcriptome of one of the five species, so I am not using splice-aware methods. The fact that the species are different obviously alters the insert size distribution a bit itself.
Do you think it makes sense to restrict the insert size range between the values values that my bell-shaped plots show? Or should I instead make use of everything I find, because of the difference in terms of species between reads and reference?
Disclaimer: I am 1% more confident in not considering the insert size.
It doesn't seem to me that restricting the insert size range would be beneficial, but I can certainly see places where it would cause problems, particularly with short transcripts.
Please see my reply to @Friederike! :)
Mapping programs already take the pair distance into account and try to place them as close as possible.
But if you don't provide one what scale do they use to define a "good" and a "bad" mapping distance?
which aligner are you using? have you checked the respective documentation?
bwa estimates the insert size range from the first X number of reads. BBMap maintains a running average of the observed insert sizes and uses that. Other programs probably use similar strategies...
Bowtie2 doesn't do this, as far as I know, unless you set -I and -X!
what about kallisto or salmon?
The point is that I tried different aligners and in the end chose Bowtie2 for other reasons. I map 100,000 read pairs, estimate the insert size by calculating extracting the TLEN field in the SAM file, plot it to see the bell-shaped curve and where the peak is.
Basically, that is what BWA does by itself. My point is more: is this needed or is it biasing my analysis from a biological point of view (as Brian said, regarding the small transcripts)?
Yes, it biases your analysis.
For DNA mapping, where presumably the length of chromosomes is much greater than insert size, you can substantially increase accuracy by deprioritizing potential mapping locations in which pairs have an unreasonable insert size. On the upper end only! There will always be short inserts so it does not really make sense to ban them. That said, if the average insert size is 400, and you have 2 different possible pair mapping locations, one with an apparent insert size of 400 and the other with an apparent insert size of 100, the one with an apparent insert size of 400 is much more likely to be correct. That does not mean it IS correct, it's just more likely.
Insert sizes from mapping are inherently untrustworthy in eukaryotic RNA-seq due to the presence of introns. BBMap (against the genome) will generally correctly calculate insert sizes for reads that are fully overlapping, but for reads that are not overlapping, where introns might randomly appear in the unsequenced portion, it's impossible. Even when mapping to the transcriptome, with differential splicing, it's just not possible to determine the insert size accurately (it should be fine on an unspliced prokaryote though). So if you are dealing with a euk, it is definitely a bad idea to cap insert size ranges. For a prok, it still is a bad idea, but it's actually OK to bound it on the upper side. For example, you can say "don't allow pairs that are farther apart than 2000bp" since those cannot occur with Illumina sequencing of an unspliced transcript, from what I understand about the limits of bridge-amplification. But it is never wise to exclude short inserts in RNA-seq.
Of course, any limitations will incur bias. For example, if there is a long deletion event in the unsequenced middle, setting bounds will eliminate read pairs that encompass it. For mappers other than BBMap, it may also eliminate pairs that encounter the event within a read. Ultimately your choice of mapper and parameters depends on what you want to accomplish, which would be somewhat useful to note in this thread.
ok, I'm slowly understanding what you're getting at. but another question: why do you think that your current alignment has suboptimal precision? and why do you think that tweaking the default settings (max. fragment length 500 and no minimum fragment length threshold) will help with that?
As mentioned in my post, I am not eager to set upper and lower boundaries for the insert size. However, since with DNA reads this is a crucial step to avoid misalignments, I was asking if in your experience this helps also when mapping RNA reads onto a transcriptome.
If it comes to me, I wouldn't really consider it, but I am second guessing myself, hence I ask :D
I didn't assume you were eager to set boundaries, but since you said you wanted to up the precision and hence thought about playing with the fragment sizes, I wondered what let you to think that. If you want to collect arguments, it's helpful to know which one you're already aware of.
Brian has given you a comprehensive answer that I absolutely second.
Yep, that convinced me. :) thanks guys!
maybe I'm missing it, but I do not understand why you would want to restrict the insert size to begin with.
EDIT: I guess, your rationale is described here: Does it make sense to limit insert size when mapping on transcriptome? ?
I am mapping RNASeq paired-end reads on a transcriptomic reference, therefore the distance between the mates in the fragment is an information that I could make use of, to avoid two mates to map distantly when the fragment tells me their closer.