Last update of the drawing 2019/08 (Cf. GUESSmyLT):
Original post:
I tried to review the meanings of the different RNA-seq library types because it's often confusing and hard to understand/know what they are despite their importance in downstream analyses.
I would be glad to get some comment/correction/criticims and complementary information, to make the resource the most exhaustive as possible. Given the numerous question about RNA-seq library type I found on internet I'm sure this small contribution could be useful not only for me but for a broad audience.
Here are nice resources amongst others I found by googling:
- https://bioinformatics.uconn.edu/reference-based-rna-seq-data-analysis/#
- https://github.com/igordot/genomics/blob/master/notes/rna-seq-strand.md
- https://chipster.csc.fi/manual/library-type-summary.html
- https://galaxyproject.org/tutorials/rb_rnaseq/
- https://dbrg77.wordpress.com/2015/03/20/library-type-option-in-the-tuxedo-suite/
- http://onetipperday.sterding.com/2012/07/how-to-tell-which-library-type-to-use.html
- https://sailfish.readthedocs.io/en/master/library_type.html
- https://rnaseq.uoregon.edu
- http://seqanswers.com/forums/showthread.php?t=6317
I summarised the result in these figures:
And here is the few information I found about which technologies produces them:
--fr
orientation are produced using the Illumina paired end protocol.
--fr-firststrand
dUTP, NSR, NNSR
--fr-secondstrand
Ligation, Starndard SOLiD
--rf
orientation are produced using the Illumina mate-pair protocol.
--ff
orientation are produced in using the SOLiD mate-pair protocol. It also the case for Roche 454 paired-end libraries (these are called paired-end, but are based on the same principles as the mate-pair libraries)
Extra information not necessarily obvious for everybody:
f
mean forward and r
reverse. Consequently --fr
means forward reverse and --rf
means reverse forward.
Trinity doesn't use the same referrntiel (it uses the DNA) so RF
corresponds to fr-firststrand
and FR
to fr-secondstrand
.
@igor has a nice summary here: https://github.com/igordot/genomics/blob/master/notes/rna-seq-strand.md
that too is a good start, just does not explain what one sees in say IGV
Wondering why these schemes (Igor's) end with a PCR reaction. If you do a PCR before loading the sequenced into a massive sequencer, you lose the stranded information, so I believe these are sort of confusing
In addition.. I believe Igor's schemes are not right. In the dUTP method it ends with a fragment orange at the left and blue at the right, when it actually reproduces the same original sequence
I am happy that you took on this, but in my opinion, the explanation is still a bit too complicated. Too many things going on at once.
I would explain this from the point of view of the transcript alone. Your first image is the DNA, that complicates the concept in my opinion. The second problem is that you are also discussing it as a paired-end protocol, this also adds to the complexity.
The DNA complicates the concept, but in the same time it allows avoiding to add extra explanation about how the resulting reads map when we align because this information (maybe not obvious) is already present.
This is a nice summary. Here are a few suggestions:
The
AAAAAA
is relevant because of the fragmentation, right? If there was no fragmentation you'd never hit into that region (based on short reads).But you bring up a great point - the explanations above appear to suggest that all reads will map to the start of the transcript. But because of fragmentation, this is not how it works.
It is hard to explain this properly actually - and I am happy to see this effort to clarify the terminology.
Yes, what I actually meant is that you will not (usually) reverse-transcribe the full mRNA from the 5' end to the 3' end. So sometimes you will have the poly-A in your fragment, but not always. This is much clearer in the revision now.