I am designing some bioinformatics software, but have little working experience with FASTQ data.
The data I wish to compute over is paired end data. From which I understand consists of "mate sequences", namely left and right mates, which correspond to sequencing the same region of the genome, in the reverse and forward orientation.
My question is about how this data is returned back to the user after sequencing. Is the researcher given separate files containing only forward or reverse orientation sequences? Or is the data mixed together.
This basically comes down to how I process data in the software. If it is the case that separate orientations are given separate files, then the I can allow the user to specify the orientation at the command line; otherwise, I will have to read every sequence id to determine the orientation.
Kind regards, Izaak
Before you start reading too much into "reverse" and "forward", note that the pairs are just sequencing different ends of the same original fragment. Which of the two will end up being "forward" after alignment is essentially random and can't be determined from read IDs.
Yeah, I've read around that there is no real concept of which is forward or reverse, it was just easier to express ;) Also, out of interest, I've seen that, often, one of the mate pairs is sequenced, then the next. However, are they also sometimes sequenced in parallel, but with multiplex capable primers? Or is multiplexing mainly used to differentiate samples?
In Illumina technology only one read happens at a time. Order is generally [Read 1 --> Index 1 (if present) --> Index 2 (if present) --> Read 2]. Multiplex is only used to differentiate samples.
Keep in mind that sequence is always present in 5'-->3' orientation no matter if it is the forward or reverse read. In case of Illumina there is a convention followed which indicates if the data is from forward or reverse read (rather first and second may be more appropriate to think about it). That information is encoded in the fastq header.