I believe a similar question has been asked before (Question: Is Star Managing Strand-Specific Data ?
), but in the answer to that, it was mentioned that we don't need to do anything extra in STAR alignment if we have strand-aware RNA-seq data. But at another place (Are any additional parameters for STAR to hand the stranded RNA-Seq data?), the author of STAR mentions that presently STAR doesn't use strand information for mapping, and that if we use a certain option (--quantMode), it provides output for 3 kinds of strandedness).
So which is it? Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts, or if not, then what is the best way to achieve that? I am working with Tru-Seq Stranded Total RNA library, which apparently maps to the opposite strand of the genome (opposite to what you would normally expect).
I am primarily interested in finding novel non-coding transcripts in my data
There is no such thing. There are strand-specific library preparations, though, which generate strand-specific (relative to the mRNA) reads, which is probably what you mean.
Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts
You can't get transcripts, sense or antisense - you can get mappings and counts from STAR. What are the downstream analyses you intend to perform?
STAR maps reads to an indexed (most commonly annotated) reference genome, and it doesn't care about transcript strandedness. From these mappings, it can output many different things, which can be used for several downstream analyses. The three most commonly used outputs from STAR are:
read mappings to the genome, in BAM format. By default, strandedness has no importance: STAR maps the reads to the best location, regardless of strand, gene location or whatever. However, the parameter --outSAMstrandField intronMotif - which should be used only if you want to analyse downstream with Cufflinks, and have unstranded RNAseq dara - alters the default behaviour and adds XS strand attribute. Read the section Compatibility with Cufflinks/Cuffdiff from STAR's manual. If you have stranded data, you don't have to change anything in STAR.
read mappings to the transcriptome, in BAM format. Those are to be used with RSEM, mostly.
counts of reads mapping to genes, similar to HTSeq or featureCounts. Here STAR outputs three columns of counts, one for unstranded, one for stranded, and one for reverse stranded.
Thanks a lot for the detailed answer. I had a question -
What are the downstream analyses you intend to perform?
I basically want to find novel non-coding transcripts in my data. For this, I use STAR for mapping, then use Stringtie for transcript assembly (since it can generate novel transcripts nicely), and then use Tximport that takes the abundance values from Stringtie and creates counts that can be fed to DESeq2 (chose tximport because it provides a way to integrate Stringtie results with DESeq2).
So, now that you mention that STAR mapping itself is strand-agnostic, when/how can I use the strand info so as to be able to do better counting? (Note that my focus is less on genes and more on (novel) non-coding transcripts). Thanks a lot!!
Use StringTie quantification. Or use featureCounts with the StringTie GTF to quantify know and novel transcripts. Or extract the StringTie transcriptome to fasta and quantify with Salmon or kallisto.
Counting number of reads per gene. With --quantMode GeneCounts option
STAR will count number reads per gene while mapping. A read is counted
if it overlaps (1nt or more) one and only one gene. Both ends of the
pairedend read are checked for overlaps. The counts coincide with
those produced by htseq-count with default parameters. This option
requires annotations (GTF or GFF with –sjdbGTFfile option) used at the
genome generation step, or at the mapping step. STAR outputs read
counts per gene into ReadsPerGene.out.tab file with 4 columns which
correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count
option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count
option -s reverse)
Select the output according to the strandedness of your data. Note,
that if you have stranded data and choose one of the columns 3 or 4,
the other column (4 or 3) will give you the count of antisense reads.
With --quantMode TranscriptomeSAM GeneCounts, and get both the
Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.
You should look at column 4 for your counts (i.e. anti-sense reads to the given genomic feature).
Thanks a lot! I am interested in finding novel non-coding transcripts. In that case, is it still column 4 that has all the relevant counts? I plan to use Stringtie with the BAM files to do transcript assembly, followed by tximport for generating counts for DESeq2. In that case, where, in this pipeline do I need to do anything to be able to use the extra info that a strand-specific library provides? Thanks a lot!
If you are searching for novel non-coding transcripts than doing this type analysis is not the right way to do it.
You would need to assemble a transcriptome. If you want to search for novel anti-sense transcripts then you could take a look at the non-reverse strand reads (column 3). If you are simply searching for non-coding RNA that are already annotated than you should stick with column 4.
That was very helpful, thank you! Why do you think this type of analysis is not the right way to go about it? You mention assembling a transcriptome, and I am basically using Stringtie to do exactly that. Then, once we have a unified transcriptome that represents all the samples, Stringtie gives abundance values for all those transcripts in each sample (https://www.nature.com/articles/nprot.2016.095#procedure), and then I use tximport to export those abundance values to counts that DESeq2 can use.
I specifically used Strigtie because it can assemble novel transcripts too while doing the transcript reconstruction. Is there anything wrong in this method?
One quick question - If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?
STAR does not use strand information for mapping and it will output gene counts for 3 kinds of strandedness. There is no contradiction here. Just run it and look for yourself.
Thanks a lot Mr. Barnes! I am interested in finding novel non-coding transcripts in my data. Given that STAR doesn't use strand info for mapping, is there a way to use that info to better quantify these novel non-coding transcripts? I just thought using the strand info might improve this quantification
If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?
Just did, found that strand information of the read is stored in one of the bitwise flags. Thank you for that! The reason I was asking is that some mappers do map in a strand-aware fashion (tophat2 for example), whereas STAR doesn't. So I was wondering that if STAR doesn't care about the strand of the read, how does it still report the correct strand in the BAM file (I can understand how an aligner that is strand-aware would be able to do that). But maybe I can ask another question for this, I don't want to unnecessarily trouble you with too many comment questions!
Do you have clear in your mind the difference between the strand STAR reports and the strand from strand-specific library preparations?
STAR is reporting the strand the read is mapped onto the reference genome.
Stranded library preparations select a specific strand from the mRNA to be sequenced (and, if performing paired-end sequencing, the second read will be on the opposite strand from the first read).
One is different from the other. One can use the strand STAR reports together with the information from which strand (in relation to the reference genome) the gene is transcribed to quantify strand-specific expression - this is exactly what featureCounts and HTSeq do.
Thanks a lot for the detailed answer. I had a question -
I basically want to find novel non-coding transcripts in my data. For this, I use STAR for mapping, then use Stringtie for transcript assembly (since it can generate novel transcripts nicely), and then use Tximport that takes the abundance values from Stringtie and creates counts that can be fed to DESeq2 (chose tximport because it provides a way to integrate Stringtie results with DESeq2).
So, now that you mention that STAR mapping itself is strand-agnostic, when/how can I use the strand info so as to be able to do better counting? (Note that my focus is less on genes and more on (novel) non-coding transcripts). Thanks a lot!!
Use StringTie quantification. Or use featureCounts with the StringTie GTF to quantify know and novel transcripts. Or extract the StringTie transcriptome to fasta and quantify with Salmon or kallisto.
The following thread might be of interest: Unspecific strand.