I find myself confused about insert size and pairing of reads. How are read pairs paired? As in, how does the aligning software know that they belong together? How does the sequencing machine? And, will the aligner know the exact distance between reads in a pair, so as to build a scaffold?
Or can only mate pairs do the latter? Are mate pairs still a thing? How does the aligner know the distance between two mates?
I realize these are quite basic questions, and apologize in advance.
Mated-pairs is a type of paired-end reads where the distance and orientation of the pairs is different.
Paired-end reads came to describe the Illumina sequencing protocol where the reads are pointing towards one another,r the read lengths are about 150 bp and the distance between ends is a few hundred base pairs:
==150==> <==150==
|------- 400 ------|
Mated pair libraries used to mean some sort of circularization method during library preparation, where, after sequencing, the reads point in the same direction, and the distance is a few thousand base pairs.
==150==> ==150==>
|------- 2000 ------|
Note how the aligner can immediately tell what the distance and orientation of the reads pairs are and thus identify the protocol.
Mated pairs are typically used for assembly as it allows ordering more distant pieces of DNA even when the intermediate sequences are missing.
Thank you Istvan for your reply!
You say "Mated pair libraries used to mean...", are you implying this is no longer the case? Are they still being used?
I have not seen data produced with this technology for a many years now, hence I am not quite sure if it is still in use and wether the terminology is still the same.
I suspect that long-read technology like PacBio has turned mated-pairs into somewhat obsolete technology.
See @IstvanAlbert's answer for the difference between paired-end and mate-pair.
For your other question:
The sequencing machine knows pairs belong together because they reside at identical locations on the flow cell. Basically, the two ends of a fragment of DNA have different primers on them. A run of the machine is done using the read1 primer first, the flurorescence at each coordinate on the flowcell recorded at each base cycle, and then the results stripped off. The process is then repeated using the read2 primer. As read1 and read2 are reads from the same physical piece of DNA, they will be in the same location on the flowcell.
The aligner knows that two reads belong together because of the order in the fastq file. The first read in the read1 fastq is the pair of the first read in the read2 fastq, and the 600th read in the read1 fastq is the pair of the 600th read in the read2 fastq. This is why it is important not to change the order of reads in fastq files without taking account of pairing.
Thank you for your helpful reply! I understand now, that was a great explanation. Thank you so much :-]
Do you happen to also know how the analogous process works for mate pairs?
I believe that if you reverse complement the second read (before aligning) the mated-pairs will be in the same orientation as a "regular" paired-end would.
Thus workflows that need that orientation would work with it.
I'm afraid I don't. I've not handled mate-pair reads before, and my impression is that they have mostly been replaced by long read sequencing, but I might be wrong.
Difference Between "Mate Pair" And "Pair-End"
http://seqanswers.com/forums/showthread.php?t=15626