Question

How To Determine If Paired–End Illumina Rnaseq Reads Are Strand–Specific

21

Entering edit mode

11.8 years ago

bedeabc ▴ 190

I've been provided with more than a billion reads of RNAseq data for a poorly annotated nematode species. They appear to be 2x100 paired-end Illumina reads – I currently know frustratingly little about the RNAseq protocol used, but need to perform assemblies using Trinity.

Trinity demands that I specify whether or not the reads are strand–specific, and also which strand is which through the --SS_lib_type parameter, which needs to be either FR or RF.

For each tissue sample, I have been given paired fwd and rev FASTQ files. How can I tell i) whether the data is indeed strand-specific, and ii) which strand is which, so that I know whether to use FR or RF with Trinity.

Any thoughts much appreciated. Here are the top four lines from two corresponding FASTQ files I've been given:

head -n 4 Tmuris_adult_R4*
==> Tmuris_adult_R4_fwd.fastq <==
@HS23_6814:1:1101:1592:2250#4/1
GCGGTATCAGTTGGTAAACCCTGCAGGCGCTCGCATAACGGTCGAAGGCTTTTTGCGGATCGTCGTCATTGTCGTTGACCTCAGCATCGCNCACCTCCTC
+
B3:64JGADLBACJHH3EACD@DJAHLJDIENFEKIJJ6LE-HFJH57H7L9=BAFI8@FK>,GBDH764,5,4A='+G+,+,*E++@+2!+:+1>1=+4

==> Tmuris_adult_R4_rev.fastq <==
@HS23_6814:1:1101:1592:2250#4/2
CGAACCCNGTATNTTTGCGCTACTNTGTCTCCTACGCCTTTGTCTGTCTTGCCTGCATGGCTAACACTGCCCTGTTGGTTCAAGTGTCGTCTGCCGGAAG
+
:ABEGGH!G8EJ!8EJE6IEFBIH!HF8EKDD66FFAMDCKE/5>D5LD?E=?AHG>=AE5@E5I@CGB<KK@GG<B2E:H@2I9ICI?C@HC2@2:0@2

illumina paired-end strand trinity • 38k views

ADD COMMENT • link updated 4.1 years ago by andrzej.koczut • 0 • written 11.8 years ago by bedeabc ▴ 190

0

Entering edit mode

Thank you matted. I had thought that might be the case – I suppose I hoped there might something in fastq headers that gave away strand specificity. I'm building a SAM now and will use IGV for viewing

I agree it does not bode well! I am awaiting a reply from our collaborators in Cambridge (UK, not MA!) who I do imagine will be very helpful.

ADD REPLY • link 11.8 years ago by bedeabc ▴ 190

1

Entering edit mode

Best of luck. Just a clarification: strand specificity for RNA-seq is achieved through using a specific library preparation protocol, so it's totally decoupled from the sequencing step (see e.g. this paper). That is, the sequencer (which is producing the read names and the fastq file) has no way of knowing if the DNA molecules it's sequencing come from a strand-specific library or not. The only (rare) exception might be if the sequencing facility manually changed the read names to something that reflected the library status.

ADD REPLY • link 11.8 years ago by matted 7.8k

0

Entering edit mode

There is now a nice compilation of all the different variations of this question: Read pair orientation : Illumina TruSeq Stranded mRNA library

ADD REPLY • link 6.2 years ago by igor 13k

score 20 · Answer 1 · 2013-03-16

20

Entering edit mode

11.8 years ago

matted 7.8k

I'll first say that it doesn't bode well for the rest of the project if your collaboration is so distant that you can't get a basic question like this answered from the providers of the data.

There's no way to figure out the strandedness from looking at raw reads. I would align the reads to a reference genome (or a closely related species, if you don't have a perfect reference). Then load the alignments in a browser (e.g. IGV) and look for expressed transcripts by eye. You could look at known gene locations, depending on the quality of the annotations. If the reads are typically observed from both strands on the exons, you can assume it's not strand-specific. If each exon seems to have a clear strand preference, you would assume it's strand-specific.

ADD COMMENT • link 11.8 years ago by matted 7.8k

10

Entering edit mode

In IGV, best is right-click on the reads, then color alignments by first-of-pair strand. Ideally, you should predominantly see one color within a transcript. From that, you can figure out whether read1 has the same direction as the gene, or the opposite.

ADD REPLY • link 11.8 years ago by Christof Winter ★ 1.1k

0

Entering edit mode

I now know that when I am looking at the gene that is marked as (+) in UCSC browser, then left read is marked as second in pair, whereas right read is marked as first in pair. Does this mean that I am dealing with FR or RF data? Thanks a lot. It is indeed true that when I check color alignments by first-of-pair strand every transcript gets only one color.

ADD REPLY • link 11.2 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 2 · 2015-03-13

To tell whether RNA-seq reads are strand-specific, you must first perform an alignment. Following alignment you can use an automated tool like infer_experiment.py in the RSeQC package that will tell you whether the data is single-end or paired-end, strand-specific or unstranded, and if strand-specific, what type (first strand or second strand).

infer_experiment.py samples a few hundred thousand reads from your bam/sam and tells you the portion that would be explained by each type of strandedness, e.g.:

$ infer_experiment.py -i accepted_hits.bam -r hg19.refseq.bed -s 400000
Reading reference gene model hg19.refseq.bed ... Done
Loading SAM/BAM file ...  Total 400000 usable reads were sampled

This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0209
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9791
Fraction of reads explained by other combinations: 0.0000

When you get your output you can refer here for help with interpretation.

score 1 · Answer 3 · 2013-03-18

1

Entering edit mode

11.8 years ago

Ryan Thompson ★ 3.6k

If you don't have a reference that you can use, you could try doing a de novo assembly and mapping the reads to it. The assembly doesn't have to be great to give you enough information to tell if the reads are stranded.

ADD COMMENT • link 11.8 years ago by Ryan Thompson ★ 3.6k

score 0 · Answer 4 · 2020-11-27

https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html

Hands-on: Determining the library strandness

Convert GTF to BED12 tool to convert the GTF file to BED:

param-file “GTF File to convert”: Drosophila_melanogaster.BDGP6.87.gtf

Infer Experiment tool to determine the library strandness with:

param-files “Input .bam file”: mapped.bam files (outputs of RNA STAR tool)

param-file “Reference gene model”: BED12 file (output Convert GTF to BED12 tool)

“Number of reads sampled from SAM/BAM file (default = 200000)”: 200000

Infer Experiment tool tool generates one file with information on:

Paired-end or single-end library

Fraction of reads failed to determine

2 lines

For single-end

Fraction of reads explained by "++,--": the fraction of reads that assigned to forward strand

Fraction of reads explained by "+-,-+": the fraction of reads that assigned to reverse strand

For paired-end

Fraction of reads explained by "1++,1--,2+-,2-+": the fraction of reads that assigned to forward strand

Fraction of reads explained by "1+-,1-+,2++,2--": the fraction of reads that assigned to reverse strand *strong text* If the two “Fraction of reads explained by” numbers are close to each other, we conclude that the library is not a strand-specific dataset (or unstranded).