I used magic-BLAST to map SRA paired reads (Illumina 150 bp) to the the corresponding whole genome assembly (WGA). I expected that paired end reads would map near each other on a scaffold; however, as shown below the paired reads typically overlap. Here is an example from each magic-blast output file that shows the mapping of a read pair (query) from four related libraries to assembled contigs (refID) of the WGA:
queryID refID %_ident q_start q_end r_start r_end
SRR_1.sra.6388.1 S107.1 100 27 127 80397 80497
SRR_1.sra.6388.2 S107.1 100 1 101 80497 80397
SRR_2.sra.576423.1 S007.1 100 1 151 297238 297388
SRR_2.sra.576423.2 S007.1 100 58 151 297455 297362
SRR_3.sra.4219.1 S516.1 99.0654 45 151 40745 40639
SRR_3.sra.4219.2 S516.1 99.1379 1 116 40630 40745
SRR_4.sra.3159.1 S557.1 99.3333 1 150 37510 37659
SRR_4.sra.3159.2 S557.1 100 1 151 37706 37556
As seen, the SRR_1 and SRR_3 read pairs are mapping to (nearly) the same (reversed) coordinates of the corresponding contig. The mapped coordinates of the SRR_2 and SRR_4 read pairs are off-set, but still overlap (for other read pairs, the overlap is as extensive as for the SRR_1 and SRR_3 pairs).
I suppose this mapping of read pairs is possible if the library fragments used to create the paired end library were about the same size as the read lengths (150 bp), but I would expect that longer library fragments (300-600 bp) would have been made. Unfortunately, the SRA metadata doesn't include these details about library construction. The SRA metadata does note that two of the four libraries are paired-end and the other two are mate pair, which is even more perplexing. If the read pairs were from a mate pair library, I would expect them to map to the same scaffold but hundreds of base pairs from each other.
My working hypothesis is that all four libraries were made with very small fragments, but perhaps there is another likely explanation. Or perhaps my expectations about the mapping of pair end and mate pair reads are wrong. Any insights?
Thanks for taking the time to respond to my question and confirming my expectations about mapping of paired end and mate paired reads.
I tried your suggestion to use bbmerge.sh, working with a test set of 24 read pairs, and as one would expect from the magicblast data posted originally, >80% of the reads were merged. (I also double-checked one read pair with BLASTN, which confirmed the magicblast mapping.)
At this point, the evidence seems to support my initial hypothesis that all four libraries were made with very small fragments, but I'll try the "Target fragment size versus final insert size" instructions that you pointed me to when I have some more compute power available.