I know there are a few posts here that raise specific questions about RNA-seq library prep. protocols, but I was curious if there's a comprehensive catalog about exactly what protocols exist and exactly what type of data they produce (i.e. to what do the reads in the final FASTA/FASTQ files correspond). Basically, I'm somewhat confused by all of the different aspects of a protocol and how they compose. For example, if a protocol is stranded or not, the relative orientation of the reads, which strands each read (i.e. \1 and \2) comes from. Are the reads that end up in a FASTA/Q file always reported 5' to 3' with respect to the strand from which they are derived? Are mates pointing toward each other reported with respect to the same strand, opposite strands, or both depending on the protocol? What do people mean when they say reads are 'reversed' --- does this mean reverse complemented with respect to the mate, or that the read is actually reported in the 3' to 5' direction (i.e. reversed but not complemented)?
Basically, I'm curious how all of these different variables interact with each other to produce the read sequences that will be used for downstream analysis. If I want to be able to communicate (to another person, or, perhaps as importantly, to a piece of software) all of the details / constrains about which reads should map to which strands in which orientations --- what is the most parsimonious way to do so? What is the minimum amount of information I need to convey? Is there a standard language / specification for representing this information? I'm sorry to ask such a broad question, but I'm a bit overwhelmed and trying to gain a comprehensive understanding of what, exactly, the reads in a file represent in light of the protocol under which they were prepared.
It's often difficult to tell what you actually mean in your diagrams. A read will never run 3'->5'. Never. It might look like that if you reverse complement it and look at the resulting alignment in + strand coordinates, but you'd do yourself a favor by thinking of the biology and not what an alignment visually looks like.
You have one example of 5'=====------->====<-------====3', which is a chimeric read. Sure, that's possible. You could also see that in a mate-pair library or other less common library types.
A strand is an orientation. The two are one and the same since direction of sequencing is always the same given how polymerases work (at least with everything in use these days, the energetics of going in the other direction are terrible).
I apologize for any confusion. Given that I chose to use a graphical depiction of different scenarios, I should have provided a legend for what my graphics actually mean. In all of the graphics I drew, the coding strand sits atop the template strand. The '=' character is used to represent unsequenced nucleotides and each arrow '---->' or '<----' represents a read. The assumption is that the "head" ('>') of the arrow always appears at the end of the corresponding read in the FASTA/Q file and the "tail" appears at the beginning of the corresponding read in the FASTA/Q file. So the reads are reported "tail-to-head" in the output.
You mention that "A strand is an orientation. The two are one and the same . . . ". However, I was asking about the relative oritentation of the reads, and how that interacts with stranded protocols. My understanding is that it's perfectly possible to have an unstranded protocol (where, e.g., read 1 could come from either strand) where you nonetheless know that the reads are oriented toward each other (or inward). Here you don't know the strand a priori for either of the two reads, but you know that if one read comes from the coding strand, the other must come from the template strand and vice-versa (assuming, as you suggest, that a read will never run 3' -> 5').
Thanks, Phil. That's a ton of different protocols! Unfortunately, I'm still a bit confused about what is possible in terms of what appears in the FASTQ file. For example, this post mentions reads oriented toward each other, but coming from the same strand --- implying that one of them is written 3' -> 5' in the FASTQ file --- which I previously thought was not possible. The way I see it, there 3 variables (1) relative orientation of the reads, (2) strandedness of the protocol and (3) whether the reads come from the same strand or opposite strands. The question is, are all combinations of these variables possible?
Both of these have reads with a relative orientation of "toward". If read 1 could originate on either strand, then this is an unstranded protocol and I could see both of these situations in the same read library. If the protocol is stranded, then read 1 will always either originate from the coding or template strand and only one or the other of these (correspondingly) will ever appear in my output (ideally). Likewise, it seems that there are protocols that generate reads facing in the same orientation coming from the same strand like:
Likewise, the protocol could be unstranded, in which case the read library would contain both these cases, or it could be a stranded library, in which case only one or the other of these would occur in the FASTQ file.
Now, what's unclear to me is things like the following are possible:
In these cases, I'm not so much concerned about if they are stranded or unstranded (as above, that seems to be independent of the relative orientation). However, what happens in situations like this is that one or more of the reads is not reported in 5'->3' orientation relative to the strand from which it was derived (regardless of which strand that is). Are libraries like this possible? Do any current protocols produce such libraries? If not, then one may have to specify which strand a read comes from, but never the orientation of the read relative to that strand. If so, however, then it seems one would have to communicate the relative orientation of the read, whether or not the protocol is stranded, and b/c of possible 3'->5' reads what the orientation of the reads are relative to their strand of origin. Presumably, if this last scenario is possible, there might even be cases where the orientation relative to the strand of origin is unknown. In such a case, a read could map in up to 4 different ways; (1) forward wrt + strand, (2) reverse wrt + strand, (3) fwd wrt - strand = rev-comp wrt + strand, and (4) rev wrt - strand. Otherwise, only (1) and (3) are possible. Anyway, thanks for the pointers and I hope that this painfully long description makes my question a bit more clear.
The relative orientation of a pair of reads will be the same unless you're using mate-pairs or the like. This is true regardless of whether you have a stranded/directional library or not.
Edit: I should add that when this isn't the case with a standard paired-end library then either (A) an error occurred when the library was made (PCR or otherwise), or (B) the sample doesn't match the reference there (i.e., you have a variant), or (C) it's a case of incorrect mapping, or (D) something really strange is going on (perhaps the nature of the experiment would make this likely).
Excellent! So just to reiterate. With a standard paired-end protocol, one should never expect a read to be reported in the 3'->5' direction relative to the strand from which it was generated. This means that if I see a read aligned to the reverse (not reverse-complement) of the template or coding strand, it's either an incorrect mapping or something amiss/wrong with the prep? Thanks again for the clear answer and explanation!
Correct. 5'->3' alignment to the reverse complement would be normal, but alignment to the reverse (i.e., 3'->5') suggests something odd or interesting happened.
When it comes to using Illumina machines you may want to check out this review from Illumina which has all possible library preps etc. (also with papers where those are used) and shows what they can be used for. If you want to have the same (with much less information though) you can check out this poster, also published by Illumina themselves. Be aware that reading the whole review might take some time ;)
These are standard stranded and unstranded methods. Which read is from 5' of the strand depends on the methods (how many times it's copied in the protocol.
So if you read the papers you will see that different methods are better at processing e.g. low input libraries like you get from clinical samples. Others are good at not introducing bias (e.g. being more likely to sequence high GC or long reads). Some are cheaper. Some get rid of rRNA better.
In addition, there is DGE which a method that only sequences reads at 3' end of the transcript and cap sequencing (I don't know the right term) which sequences the 5'end. In these you usually throw away a read.
There is also RNA TagSeq which we developed here at the Broad Tech Labs (and you can buy from us) which adds a barcode to the reads early in sequencing so you can pool 36 samples together for most of the library prep and do a massive experiment really quickly. You have to trim off a barcode on that one (and I think the SMART protocol as well but check that). Sometimes you have to trim off barcodes and since they are always changing the method I don't usually know where it is until they tell me.
Moved comment to an "answer" b/c of the size.
It's often difficult to tell what you actually mean in your diagrams. A read will never run 3'->5'. Never. It might look like that if you reverse complement it and look at the resulting alignment in + strand coordinates, but you'd do yourself a favor by thinking of the biology and not what an alignment visually looks like.
You have one example of 5'=====------->====<-------====3', which is a chimeric read. Sure, that's possible. You could also see that in a mate-pair library or other less common library types.
A strand is an orientation. The two are one and the same since direction of sequencing is always the same given how polymerases work (at least with everything in use these days, the energetics of going in the other direction are terrible).
I apologize for any confusion. Given that I chose to use a graphical depiction of different scenarios, I should have provided a legend for what my graphics actually mean. In all of the graphics I drew, the coding strand sits atop the template strand. The '=' character is used to represent unsequenced nucleotides and each arrow '---->' or '<----' represents a read. The assumption is that the "head" ('>') of the arrow always appears at the end of the corresponding read in the FASTA/Q file and the "tail" appears at the beginning of the corresponding read in the FASTA/Q file. So the reads are reported "tail-to-head" in the output.
You mention that "A strand is an orientation. The two are one and the same . . . ". However, I was asking about the relative oritentation of the reads, and how that interacts with stranded protocols. My understanding is that it's perfectly possible to have an unstranded protocol (where, e.g., read 1 could come from either strand) where you nonetheless know that the reads are oriented toward each other (or inward). Here you don't know the strand a priori for either of the two reads, but you know that if one read comes from the coding strand, the other must come from the template strand and vice-versa (assuming, as you suggest, that a read will never run 3' -> 5').
Don't do that. Make it a reply on the appropriate answer.
Sorry; thanks for fixing this for me!
Thanks, Phil. That's a ton of different protocols! Unfortunately, I'm still a bit confused about what is possible in terms of what appears in the FASTQ file. For example, this post mentions reads oriented toward each other, but coming from the same strand --- implying that one of them is written 3' -> 5' in the FASTQ file --- which I previously thought was not possible. The way I see it, there 3 variables (1) relative orientation of the reads, (2) strandedness of the protocol and (3) whether the reads come from the same strand or opposite strands. The question is, are all combinations of these variables possible?
For example, I know I can get the following:
and even
Both of these have reads with a relative orientation of "toward". If read 1 could originate on either strand, then this is an unstranded protocol and I could see both of these situations in the same read library. If the protocol is stranded, then read 1 will always either originate from the coding or template strand and only one or the other of these (correspondingly) will ever appear in my output (ideally). Likewise, it seems that there are protocols that generate reads facing in the same orientation coming from the same strand like:
or
Likewise, the protocol could be unstranded, in which case the read library would contain both these cases, or it could be a stranded library, in which case only one or the other of these would occur in the FASTQ file.
Now, what's unclear to me is things like the following are possible:
or
or
or
In these cases, I'm not so much concerned about if they are stranded or unstranded (as above, that seems to be independent of the relative orientation). However, what happens in situations like this is that one or more of the reads is not reported in 5'->3' orientation relative to the strand from which it was derived (regardless of which strand that is). Are libraries like this possible? Do any current protocols produce such libraries? If not, then one may have to specify which strand a read comes from, but never the orientation of the read relative to that strand. If so, however, then it seems one would have to communicate the relative orientation of the read, whether or not the protocol is stranded, and b/c of possible 3'->5' reads what the orientation of the reads are relative to their strand of origin. Presumably, if this last scenario is possible, there might even be cases where the orientation relative to the strand of origin is unknown. In such a case, a read could map in up to 4 different ways; (1) forward wrt + strand, (2) reverse wrt + strand, (3) fwd wrt - strand = rev-comp wrt + strand, and (4) rev wrt - strand. Otherwise, only (1) and (3) are possible. Anyway, thanks for the pointers and I hope that this painfully long description makes my question a bit more clear.