Question

RNA-seq: Explain STAR quantMode geneCounts values

40

Entering edit mode

8.5 years ago

williantafsilva ▴ 400

Background: I mapped my reads against the reference genome using the STAR command:

star
--runThreadN 10
--runMode alignReads
--readFilesCommand zcat
--outFilterMultimapNmax 1
--outFilterMatchNmin 35
--outSAMtype BAM Unsorted SortedByCoordinate
--quantMode GeneCounts
--twopassMode Basic
--outFileNamePrefix "$newfilename"
--genomeDir $indexedreference
--sjdbGTFfile $gtfreference
--readFilesIn $inputfile1 $inputfile2

Output: As explained on the STAR manual (here), I get a file called *ReadsPerGene.out.tab with the following results:

N_unmapped  20460635    20460635    20460635
N_multimapping  0   0   0
N_noFeature 23097482    39371367    23239579
N_ambiguous 497097  12313   293179
ENSDARG00000009262  139 0   139
ENSDARG00000018618  2372    5   2367
ENSDARG00000045874  206 4   303
ENSDARG00000057468  13  101 13
ENSDARG00000045873  273 2   271
ENSDARG00000079143  72  0   72
ENSDARG00000057714  206 31  175

Explanation from the manual:

Counting number of reads per gene. With --quantMode GeneCounts option STAR will count number reads per gene while mapping. A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the pairedend read are checked for overlaps. The counts coincide with those produced by htseq-count with default parameters. This option requires annotations (GTF or GFF with –sjdbGTFfile option) used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:

column 1: gene ID

column 2: counts for unstranded RNA-seq

column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)

column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

Select the output according to the strandedness of your data. Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads. With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.

Questions:

Which column should I use for differential expression analysis? Why?
Should the sum of columns 3 and 4 equal column 2? Why or why not? What does it mean when they are not equal?
Why for some transcripts the value in column 3 is higher than the value in column 4, while for most of the transcripts the value in column 4 is higher than the value in column 3?
What does it mean when the value in column 3 or 4 is higher than the value in column 2?

Thank you!

RNA-Seq STAR Differential expression • 69k views

ADD COMMENT • link updated 7.0 years ago by Tao ▴ 540 • written 8.5 years ago by williantafsilva ▴ 400

1

Entering edit mode

Is this a homework assignment?

ADD REPLY • link 7.6 years ago by h.mon 35k

0

Entering edit mode

Thanks for the clear explanations. If I need FPKM data, do I use these read counts ?

ADD REPLY • link 7.6 years ago by neokao • 0

score 9 · Answer 1 · 2016-10-26

9

Entering edit mode

8.5 years ago

Rohit ★ 1.5k

The column you need might be the 2nd, if you have non-strand specific data. Unless you have specifically designed a strand-specific protocol, usually this is not the case. But for downstram analysis, it is encouraged to run htseq-count or another read-counting program again.
$2 is for unstranded hits, but those overlapping on opposite strand of features are considered ambiguous. $3 reports hits based on the strand you have given in your gff annotation, and $4 in the reverse direction of your features in gff (for PE-data the 5'3'-direction is also considered). Refer to -s option of htseq-count
Some of the reads might have a splice-form expressed higher in the preceeding exon, this can be the reason for lesser reads aligning. Splicing is usually the process which leads to differences in mapping coverages over the read. Usually what you need would be based on the analysis you need to do - gene-specific, transcript-specific or something else altogether.
If $3>$2 or $4>$2, it means that unstranded data has lesser hits than the cases of directions, happens when reads are assigned as ambiguous. Reads overlapping on the opposite strand (of the gff feature) are considered ambiguous for unstranded but are assigned to genes for $3 or $4. This is the reason for the difference, this was refered here on STAR discussion

Sorry for the confusion.

ADD COMMENT • link 8.5 years ago by Rohit ★ 1.5k

1

Entering edit mode

Thanks for the explanations!

Here is my understanding: column 3 and column 4 are corresponding to different strand-specific library preparation protocol. Please see HTSeq -s/--stranded description as follows. This blog explains the read orientations in different strand-specific/un-stranded protocols.

-s <yes no="" reverse="">, --stranded=<yes no="" reverse=""> whether the data is from a strand-specific assay (default: yes) For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

So, ideally, under the condition that we have the totally correct gene annotation and no library errors, there should be one column(column 3 or 4) with all zeros (no reads would be aligned under the wrong protocol specification) if it is strand-specific RNAseq. But the fact is, even under strand-specific protocol, there always read counts in both column 3 and column 4.

I think that's because 1) There might be some small amount of un-stranded reads left during strand-specific library preparation (not pretty sure). 2) The genome annotation is incomplete, there might be some novel anti-sense gene expression, and those RNA reads were incorrectly mapped to that gene.

So, if we analyze strand specific RNA-seq data, we should choose right column (3 or 4) based on the strand specific protocol during library preparation.

Please point out if my understanding is not correct! Thanks!

ADD REPLY • link 7.0 years ago by Tao ▴ 540

0

Entering edit mode

Since I have different values for columns 3 and 4, does that automatically mean that I have strand-specific data? If I that's the case, should I use column 3 or 4 for differential expression? And why?
Why is $2=$3+$4 NOT the case for many of my transcripts? For example, in the extract above I have 206=4+303 and 13=101+13, which are obviously wrong.
Wouldn't that bias my differential expression analysis if I use either column? And why should I use column x in particular, instead of the other?
In my original question, I show some examples in which $3>$2 and $4>$2, so it is possible. That means that there's something wrong with my data, there's something wrong with the calculations by STAR or there's something wrong in our understanding of those values.

Thank you for the quick response!

ADD REPLY • link 8.5 years ago by williantafsilva ▴ 400

0

Entering edit mode

I have edited the answer, I did not consider ambiguity of mapping before, sorry for that

ADD REPLY • link 8.5 years ago by Rohit ★ 1.5k

0

Entering edit mode

Can someone please explain point 1. Why is it advised to run htseq-count again if STAR is performing it with quantMode option?thanks

ADD REPLY • link 5.5 years ago by ajajoo ▴ 10

0

Entering edit mode

Although one can certainly do it, I don't think it is advisable, and both results should be largely the same.

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

Good explanation. Really helped with the clearing out the doubts. Thanks,

ADD REPLY • link 7.8 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Sorry for digging in this old thread, but I have an understanding problem:

1). I have around 4000 genes (of 14000 overall) where $2 < $3 + $4 but I have no overlapping genes on the +/- strand.
So how can this be explained?

2). For most reads I have $4 > $3, but for ~1000 $3 > $4. This I see in all replicated and all samples. So when I have GeneX with $3 > $4 than for GeneX I have $3 > $4 in all other samples/replicates. All these genes are equally distributed to +/- strand.
Does anyone have an explanation for this?

ADD REPLY • link 6.5 years ago by pentium3-user ▴ 30