Question

How to perform strand-aware (RNA-Seq) mapping using STAR?

2

Entering edit mode

5.7 years ago

c_u ▴ 530

Hi,

I believe a similar question has been asked before (Question: Is Star Managing Strand-Specific Data ? ), but in the answer to that, it was mentioned that we don't need to do anything extra in STAR alignment if we have strand-aware RNA-seq data. But at another place (Are any additional parameters for STAR to hand the stranded RNA-Seq data?), the author of STAR mentions that presently STAR doesn't use strand information for mapping, and that if we use a certain option (--quantMode), it provides output for 3 kinds of strandedness).

So which is it? Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts, or if not, then what is the best way to achieve that? I am working with Tru-Seq Stranded Total RNA library, which apparently maps to the opposite strand of the genome (opposite to what you would normally expect).

I am primarily interested in finding novel non-coding transcripts in my data

RNA-Seq • 18k views

ADD COMMENT • link 5.7 years ago by c_u ▴ 530

3

Entering edit mode

5.7 years ago

benformatics 4.1k

The mapping is independent of the quantification.

The quantification is stranded.

Counting number of reads per gene. With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the pairedend read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.

You should look at column 4 for your counts (i.e. anti-sense reads to the given genomic feature).

ADD COMMENT • link 5.7 years ago by benformatics 4.1k

0

Entering edit mode

Thanks a lot! I am interested in finding novel non-coding transcripts. In that case, is it still column 4 that has all the relevant counts? I plan to use Stringtie with the BAM files to do transcript assembly, followed by tximport for generating counts for DESeq2. In that case, where, in this pipeline do I need to do anything to be able to use the extra info that a strand-specific library provides? Thanks a lot!

ADD REPLY • link 5.7 years ago by c_u ▴ 530

1

Entering edit mode

If you are searching for novel non-coding transcripts than doing this type analysis is not the right way to do it.

You would need to assemble a transcriptome. If you want to search for novel anti-sense transcripts then you could take a look at the non-reverse strand reads (column 3). If you are simply searching for non-coding RNA that are already annotated than you should stick with column 4.

ADD REPLY • link 5.7 years ago by benformatics 4.1k

0

Entering edit mode

That was very helpful, thank you! Why do you think this type of analysis is not the right way to go about it? You mention assembling a transcriptome, and I am basically using Stringtie to do exactly that. Then, once we have a unified transcriptome that represents all the samples, Stringtie gives abundance values for all those transcripts in each sample (https://www.nature.com/articles/nprot.2016.095#procedure), and then I use tximport to export those abundance values to counts that DESeq2 can use.

I specifically used Strigtie because it can assemble novel transcripts too while doing the transcript reconstruction. Is there anything wrong in this method?

ADD REPLY • link 5.7 years ago by c_u ▴ 530

0

Entering edit mode

I was assuming your reference was a standard one (e.g. UCSC/Ensembl) but if you did the assembly correctly then it should work.

ADD REPLY • link 5.7 years ago by benformatics 4.1k

0

Entering edit mode

One quick question - If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?

ADD REPLY • link 5.7 years ago by c_u ▴ 530

2

Entering edit mode

5.7 years ago

swbarnes2 15k

STAR does not use strand information for mapping and it will output gene counts for 3 kinds of strandedness. There is no contradiction here. Just run it and look for yourself.

ADD COMMENT • link 5.7 years ago by swbarnes2 15k

0

Entering edit mode

Thanks a lot Mr. Barnes! I am interested in finding novel non-coding transcripts in my data. Given that STAR doesn't use strand info for mapping, is there a way to use that info to better quantify these novel non-coding transcripts? I just thought using the strand info might improve this quantification

ADD REPLY • link 5.7 years ago by c_u ▴ 530

0

Entering edit mode

If STAR doesn't use stand information for mapping, then am I right to assume that it somehow stores the information about the strand of the mapped read, somewhere in the bam file, which can then pass on this information to the quantifier (like Stringtie/featurecounts), which can then do the quantification based on strand information?

ADD REPLY • link 5.7 years ago by c_u ▴ 530

0

Entering edit mode

Have you really not read the sam format specifications?

ADD REPLY • link 5.7 years ago by swbarnes2 15k

0

Entering edit mode

Just did, found that strand information of the read is stored in one of the bitwise flags. Thank you for that! The reason I was asking is that some mappers do map in a strand-aware fashion (tophat2 for example), whereas STAR doesn't. So I was wondering that if STAR doesn't care about the strand of the read, how does it still report the correct strand in the BAM file (I can understand how an aligner that is strand-aware would be able to do that). But maybe I can ask another question for this, I don't want to unnecessarily trouble you with too many comment questions!

ADD REPLY • link 5.7 years ago by c_u ▴ 530

2

Entering edit mode

Do you have clear in your mind the difference between the strand STAR reports and the strand from strand-specific library preparations?

STAR is reporting the strand the read is mapped onto the reference genome.

Stranded library preparations select a specific strand from the mRNA to be sequenced (and, if performing paired-end sequencing, the second read will be on the opposite strand from the first read).

One is different from the other. One can use the strand STAR reports together with the information from which strand (in relation to the reference genome) the gene is transcribed to quantify strand-specific expression - this is exactly what featureCounts and HTSeq do.

edit: I think the post Antisense transcription from strand specific RNA-seq may be very pertinent to you.

ADD REPLY • link 5.7 years ago by h.mon 35k

score 5 · Accepted Answer · 2019-11-13

strand-aware RNA-seq data

There is no such thing. There are strand-specific library preparations, though, which generate strand-specific (relative to the mRNA) reads, which is probably what you mean.

Does STAR perform strand-aware mapping, which we can use downstream to get antisense transcripts

You can't get transcripts, sense or antisense - you can get mappings and counts from STAR. What are the downstream analyses you intend to perform?

STAR maps reads to an indexed (most commonly annotated) reference genome, and it doesn't care about transcript strandedness. From these mappings, it can output many different things, which can be used for several downstream analyses. The three most commonly used outputs from STAR are:

read mappings to the genome, in BAM format. By default, strandedness has no importance: STAR maps the reads to the best location, regardless of strand, gene location or whatever. However, the parameter --outSAMstrandField intronMotif - which should be used only if you want to analyse downstream with Cufflinks, and have unstranded RNAseq dara - alters the default behaviour and adds XS strand attribute. Read the section Compatibility with Cufflinks/Cuffdiff from STAR's manual. If you have stranded data, you don't have to change anything in STAR.
read mappings to the transcriptome, in BAM format. Those are to be used with RSEM, mostly.
counts of reads mapping to genes, similar to HTSeq or featureCounts. Here STAR outputs three columns of counts, one for unstranded, one for stranded, and one for reverse stranded.

Most of these points were already clearly made by Alex Dobin in the thread Are any additional parameters for STAR to hand the stranded RNA-Seq data?, and the STAR manual.