I'm getting a bit confused after having run an aligner program on my RNA-seq data (BWA for Burrows-Wheeler Aligner) in order to get an estimate of the mate inner distance for use in downstream analysis. The total RNA libraries were prepared with universal Illumina adapters, it was a 150 bp PE sequencing. BWA gave me an average insert size of ~250 +/- 60 bp, and the sequencing company gave me a target fragment size value of 394. Adapters are 34 bp long.
Here is how I understand things so far, please correct me if I'm wrong.
1) Does the fragment size target value of 394 includes the 3' and 5' adapters? I would think so.
2) Do the read length include adpaters? I would say no, because quality checks on the clean data (ie adapters trimmed) give a read length of exactly 150 bp (shouldn't it be 150 - 2x34 = 92 bp after the adapters are removed if they are 34 bp long and included in the read length?)
3) Given the two previous points, that would give an insert size of 394 - 2x34 = 326. Which is different from the BWA estimate. Is it usual to have a large difference between a target fragment size and the actual size of the fragments that are sequenced? And if I have an insert size of ~ 250 bp and PE reads of 150 bp, that means that the left and right reads overlap in the middle?
Thanks for your help! Antoine
1) yes, what they gave you is probably the result of a Bioanalyzer gel electrophoresis, which measures the full length of the DNA fragment that is pipetted onto the flowcell: That is, the actual genomic fragment + the Illumina adapters + P5/P7 sequences that are required for flow cell binding (total adapter content should therefore be somewhat 120bp). Check this picture for an idea of how this looks.
2) read length is simple the number of base calls that the sequencer performs, so it sequences 150bp into the fragment. in case your fragment (the actual genomic fragment) is shorter than that, also parts of the adapter will be sequenced, requiring adapter trimming. You can check that with e.g. FastQC, followed by trimming with e.g. bbdup, skewer, cutadapt.
3) the insert size that the aligners report are solely based on the distance between the two mate pair alignments towards the reference genomes. adapter sizes and read lengths play no role here (given that you properly trimmed all adapter content).