If we had an insert length of 426bp and an adapter size of 65bp, then the length of the central region between two 100bp reads would be 96bp (426bp - 130bp - 100bp - 100bp).
So providing the mean length of reads in the library is in fact 100bp, then the avg. distance between the reads for that library would be 96bp.
As Istvan noted (How does the insert size parameter change after trimming (MATS tool)), if you were to trim off a fixed number of bases, for example resulting in an avg. read length of 98bp then presumably this distance would increase to 100bp.
If this increase in distance is correct, then how is it any different if we assume those removed bases were actually errors?
The 100bp read length is being reduced to 98bp so you would assume that the reads would map to a reference genome with 4bp more in-between them. Should I therefore find the average post-trimming read length of the library and use this to calculate the inner distance, or am I missing something here?
Thanks
I believe you have a good point and you should re-calculate the insert size for any software that is sensitive to it. However, this issue only matters for poor-quality/long reads, where you trim like 20-50bp. As an example, with MiSeq 300bp paired end one rarely could get good quality of last 50 bp, so here it matters a lot.
In others words, if the "-r" option in TopHat (for example) was set to 92bp instead of the original 96bp, then this would not make a large enough difference to worry about? (Whereas -100bp would clearly make a big difference).
For illumina short-insert reads, tools rarely use inner distance because it is not a well defined number. External distance makes much more sense. Experimentally, it is the length of the DNA fragment subjected to sequencing. This length is not affected by read lengths, low-qual bases at the tail or 3'-end adapters which occur much more frequently than 5'-end adapters.
That makes sense, but in TopHat the "-r" parameter "is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp." Should I be concerned about the removal of these few bp?