RNA-Seq library types
3
1
Entering edit mode
10.4 years ago
compucurious ▴ 10

Hi All,

I know there are a few posts here that raise specific questions about RNA-seq library prep. protocols, but I was curious if there's a comprehensive catalog about exactly what protocols exist and exactly what type of data they produce (i.e. to what do the reads in the final FASTA/FASTQ files correspond). Basically, I'm somewhat confused by all of the different aspects of a protocol and how they compose. For example, if a protocol is stranded or not, the relative orientation of the reads, which strands each read (i.e. \1 and \2) comes from. Are the reads that end up in a FASTA/Q file always reported 5' to 3' with respect to the strand from which they are derived? Are mates pointing toward each other reported with respect to the same strand, opposite strands, or both depending on the protocol? What do people mean when they say reads are 'reversed' --- does this mean reverse complemented with respect to the mate, or that the read is actually reported in the 3' to 5' direction (i.e. reversed but not complemented)?

Basically, I'm curious how all of these different variables interact with each other to produce the read sequences that will be used for downstream analysis. If I want to be able to communicate (to another person, or, perhaps as importantly, to a piece of software) all of the details / constrains about which reads should map to which strands in which orientations --- what is the most parsimonious way to do so? What is the minimum amount of information I need to convey? Is there a standard language / specification for representing this information? I'm sorry to ask such a broad question, but I'm a bit overwhelmed and trying to gain a comprehensive understanding of what, exactly, the reads in a file represent in light of the protocol under which they were prepared.

Thanks!

sequence alignment RNA-Seq • 3.6k views
ADD COMMENT
0
Entering edit mode

Moved comment to an "answer" b/c of the size.

ADD REPLY
1
Entering edit mode

It's often difficult to tell what you actually mean in your diagrams. A read will never run 3'->5'. Never. It might look like that if you reverse complement it and look at the resulting alignment in + strand coordinates, but you'd do yourself a favor by thinking of the biology and not what an alignment visually looks like.

You have one example of 5'=====------->====<-------====3', which is a chimeric read. Sure, that's possible. You could also see that in a mate-pair library or other less common library types.

A strand is an orientation. The two are one and the same since direction of sequencing is always the same given how polymerases work (at least with everything in use these days, the energetics of going in the other direction are terrible).

ADD REPLY
0
Entering edit mode

I apologize for any confusion. Given that I chose to use a graphical depiction of different scenarios, I should have provided a legend for what my graphics actually mean. In all of the graphics I drew, the coding strand sits atop the template strand. The '=' character is used to represent unsequenced nucleotides and each arrow '---->' or '<----' represents a read. The assumption is that the "head" ('>') of the arrow always appears at the end of the corresponding read in the FASTA/Q file and the "tail" appears at the beginning of the corresponding read in the FASTA/Q file. So the reads are reported "tail-to-head" in the output.

You mention that "A strand is an orientation. The two are one and the same . . . ". However, I was asking about the relative oritentation of the reads, and how that interacts with stranded protocols. My understanding is that it's perfectly possible to have an unstranded protocol (where, e.g., read 1 could come from either strand) where you nonetheless know that the reads are oriented toward each other (or inward). Here you don't know the strand a priori for either of the two reads, but you know that if one read comes from the coding strand, the other must come from the template strand and vice-versa (assuming, as you suggest, that a read will never run 3' -> 5').

ADD REPLY
0
Entering edit mode

Don't do that. Make it a reply on the appropriate answer.

ADD REPLY
0
Entering edit mode

Sorry; thanks for fixing this for me!

ADD REPLY
0
Entering edit mode

Thanks, Phil. That's a ton of different protocols! Unfortunately, I'm still a bit confused about what is possible in terms of what appears in the FASTQ file. For example, this post mentions reads oriented toward each other, but coming from the same strand --- implying that one of them is written 3' -> 5' in the FASTQ file --- which I previously thought was not possible. The way I see it, there 3 variables (1) relative orientation of the reads, (2) strandedness of the protocol and (3) whether the reads come from the same strand or opposite strands. The question is, are all combinations of these variables possible?

For example, I know I can get the following:

5'====------->============3' (read 1)
3'==========<-----------====5' (read 2)

and even

5'=============<--------==3' (read2)
3'======------->=========5' (read 1)

Both of these have reads with a relative orientation of "toward". If read 1 could originate on either strand, then this is an unstranded protocol and I could see both of these situations in the same read library. If the protocol is stranded, then read 1 will always either originate from the coding or template strand and only one or the other of these (correspondingly) will ever appear in my output (ideally). Likewise, it seems that there are protocols that generate reads facing in the same orientation coming from the same strand like:

5'=====------->====------->====3'
3'=======================5'

or

5'=======================3'
3'=====<-------=====<-------===5'

Likewise, the protocol could be unstranded, in which case the read library would contain both these cases, or it could be a stranded library, in which case only one or the other of these would occur in the FASTQ file.

Now, what's unclear to me is things like the following are possible:

5'=====------->====<-------====3'
3'=======================5'

or

5'=====<----------============3'
3'===============-------->===5'

or

5'=====<-------====<------======3'
3'=========================5'

or

3'=========================5'
5'======------>=====------->====3'

In these cases, I'm not so much concerned about if they are stranded or unstranded (as above, that seems to be independent of the relative orientation). However, what happens in situations like this is that one or more of the reads is not reported in 5'->3' orientation relative to the strand from which it was derived (regardless of which strand that is). Are libraries like this possible? Do any current protocols produce such libraries? If not, then one may have to specify which strand a read comes from, but never the orientation of the read relative to that strand. If so, however, then it seems one would have to communicate the relative orientation of the read, whether or not the protocol is stranded, and b/c of possible 3'->5' reads what the orientation of the reads are relative to their strand of origin. Presumably, if this last scenario is possible, there might even be cases where the orientation relative to the strand of origin is unknown. In such a case, a read could map in up to 4 different ways; (1) forward wrt + strand, (2) reverse wrt + strand, (3) fwd wrt - strand = rev-comp wrt + strand, and (4) rev wrt - strand. Otherwise, only (1) and (3) are possible. Anyway, thanks for the pointers and I hope that this painfully long description makes my question a bit more clear.

ADD REPLY
2
Entering edit mode
10.4 years ago

The relative orientation of a pair of reads will be the same unless you're using mate-pairs or the like. This is true regardless of whether you have a stranded/directional library or not.

Edit: I should add that when this isn't the case with a standard paired-end library then either (A) an error occurred when the library was made (PCR or otherwise), or (B) the sample doesn't match the reference there (i.e., you have a variant), or (C) it's a case of incorrect mapping, or (D) something really strange is going on (perhaps the nature of the experiment would make this likely).

ADD COMMENT
0
Entering edit mode

Excellent! So just to reiterate. With a standard paired-end protocol, one should never expect a read to be reported in the 3'->5' direction relative to the strand from which it was generated. This means that if I see a read aligned to the reverse (not reverse-complement) of the template or coding strand, it's either an incorrect mapping or something amiss/wrong with the prep? Thanks again for the clear answer and explanation!

ADD REPLY
1
Entering edit mode

Correct. 5'->3' alignment to the reverse complement would be normal, but alignment to the reverse (i.e., 3'->5') suggests something odd or interesting happened.

ADD REPLY
0
Entering edit mode

Thanks so much! Is there any way that you could move your comments (relevant) to an answer so that I could accept them as such?

ADD REPLY
0
Entering edit mode

Sure, I've moved most of this thread to an answer.

ADD REPLY
2
Entering edit mode
10.4 years ago
Phil S. ▴ 700

Hi,

When it comes to using Illumina machines you may want to check out this review from Illumina which has all possible library preps etc. (also with papers where those are used) and shows what they can be used for. If you want to have the same (with much less information though) you can check out this poster, also published by Illumina themselves. Be aware that reading the whole review might take some time ;)

Hope that helps a bit!

Phil

ADD COMMENT
1
Entering edit mode
10.4 years ago
Michele Busby ★ 2.2k

For a more comprehensive answer to this question I would look into Joshua Levin's work as he has done several RNA Seq methods comparison papers.

Strand specific methods: http://www.nature.com/nmeth/journal/v7/n9/abs/nmeth.1491.html

Low input and poor quality RNA (more recent): http://www.nature.com/nmeth/journal/v10/n7/nmeth.2483/metrics/blogs

These are standard stranded and unstranded methods. Which read is from 5' of the strand depends on the methods (how many times it's copied in the protocol.

So if you read the papers you will see that different methods are better at processing e.g. low input libraries like you get from clinical samples. Others are good at not introducing bias (e.g. being more likely to sequence high GC or long reads). Some are cheaper. Some get rid of rRNA better.

In addition, there is DGE which a method that only sequences reads at 3' end of the transcript and cap sequencing (I don't know the right term) which sequences the 5'end. In these you usually throw away a read.

There is also RNA TagSeq which we developed here at the Broad Tech Labs (and you can buy from us) which adds a barcode to the reads early in sequencing so you can pool 36 samples together for most of the library prep and do a massive experiment really quickly. You have to trim off a barcode on that one (and I think the SMART protocol as well but check that). Sometimes you have to trim off barcodes and since they are always changing the method I don't usually know where it is until they tell me.

ADD COMMENT
0
Entering edit mode

Thanks, Michele! I'll definitely take a look at these.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6