I've been provided with more than a billion reads of RNAseq data for a poorly annotated nematode species. They appear to be 2x100 paired-end Illumina reads – I currently know frustratingly little about the RNAseq protocol used, but need to perform assemblies using Trinity.
Trinity demands that I specify whether or not the reads are strand–specific, and also which strand is which through the --SS_lib_type parameter, which needs to be either FR or RF.
For each tissue sample, I have been given paired fwd and rev FASTQ files. How can I tell i) whether the data is indeed strand-specific, and ii) which strand is which, so that I know whether to use FR or RF with Trinity.
Any thoughts much appreciated. Here are the top four lines from two corresponding FASTQ files I've been given:
head -n 4 Tmuris_adult_R4*
==> Tmuris_adult_R4_fwd.fastq <==
@HS23_6814:1:1101:1592:2250#4/1
GCGGTATCAGTTGGTAAACCCTGCAGGCGCTCGCATAACGGTCGAAGGCTTTTTGCGGATCGTCGTCATTGTCGTTGACCTCAGCATCGCNCACCTCCTC
+
B3:64JGADLBACJHH3EACD@DJAHLJDIENFEKIJJ6LE-HFJH57H7L9=BAFI8@FK>,GBDH764,5,4A='+G+,+,*E++@+2!+:+1>1=+4
==> Tmuris_adult_R4_rev.fastq <==
@HS23_6814:1:1101:1592:2250#4/2
CGAACCCNGTATNTTTGCGCTACTNTGTCTCCTACGCCTTTGTCTGTCTTGCCTGCATGGCTAACACTGCCCTGTTGGTTCAAGTGTCGTCTGCCGGAAG
+
:ABEGGH!G8EJ!8EJE6IEFBIH!HF8EKDD66FFAMDCKE/5>D5LD?E=?AHG>=AE5@E5I@CGB<KK@GG<B2E:H@2I9ICI?C@HC2@2:0@2
Thank you matted. I had thought that might be the case – I suppose I hoped there might something in fastq headers that gave away strand specificity. I'm building a SAM now and will use IGV for viewing
I agree it does not bode well! I am awaiting a reply from our collaborators in Cambridge (UK, not MA!) who I do imagine will be very helpful.
Best of luck. Just a clarification: strand specificity for RNA-seq is achieved through using a specific library preparation protocol, so it's totally decoupled from the sequencing step (see e.g. this paper). That is, the sequencer (which is producing the read names and the fastq file) has no way of knowing if the DNA molecules it's sequencing come from a strand-specific library or not. The only (rare) exception might be if the sequencing facility manually changed the read names to something that reflected the library status.
There is now a nice compilation of all the different variations of this question: Read pair orientation : Illumina TruSeq Stranded mRNA library