STAR Aligner --quantMode GeneCounts
1
0
Entering edit mode
3.5 years ago
Seigfried ▴ 80

Hello,

I have tried searching for the answer of my query as it is by no means a new question. This question was discussed in RNA-seq: Explain STAR quantMode geneCounts values. I read it but I don't understand it.

I have paired-end RNA-Seq human data. The filenames are like this : A1_1.fq.gz A1_2.fq.gz

I have run the aligner STAR with the option --quantMode geneCounts.

The resultant ReadsPerGene.out.tab has 4 options : column 1: gene ID column 2: counts for unstranded RNA-seq column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

So I see that I have RNA Seq counts for both R1 and R2 strands. I assume that "1st read strand" refers to A1_1.fq.gz file while the "2nd read strand" refers to A1_2.fq.gz.

Since I have multiple files like this, I want to ultimately find differential gene expression by comparing 2 samples.

What do I do with this output? Do I sum the R1 and R2 strand output counts together?

I know the strandedness of a gene (from the .GFF or .GTF file), For example, if I know the gene is located on the + strand only so I assume that I should not consider the - strand counts at all. So I believe that I should sort my GTF into 2 parts + and - and just take the counts for genes specific to their strandedness.

Is this correct?

As a side question, is quantmode Gene counts comparable to htseq-counts? Or is htseq better?

RNA-Seq STAR • 2.4k views
ADD COMMENT
3
Entering edit mode
3.5 years ago

So I see that I have RNA Seq counts for both R1 and R2 strands. I assume that "1st read strand" refers to A1_1.fq.gz file while the "2nd read strand" refers to A1_2.fq.gz.

Nope. That's not what that means at all. If you run STAR with just R1, you'll get the same format of output, and the numbers won't be so different either.

Since I have multiple files like this, I want to ultimately find differential gene expression by comparing 2 samples.

Two? Just two? No biological replicates? Just use Excel. Other software uses clever math to understand the samples more statistically using replicate information, but you don't have that.

I know the strandedness of a gene (from the .GFF or .GTF file), For example, if I know the gene is located on the + strand only so I assume that I should not consider the - strand counts at all.

Wrong. You need to stop and learn what it is you are doing. Pushing nonsense through analysis software is only going to give you grief.

is quantmode Gene counts comparable to htseq-counts?

They are supposed to be the same. Using RSEM would be better, or starting over with Kallisto or Salmon on transcriptome references; all of those options will handle ambiguous gene assignments better.

ADD COMMENT
0
Entering edit mode

Hey thank you for your comments I do really appreciate you answering them. I am trying to understand this and I am sorry if my question irritated you.

I have 3 replicates per sample and around 48 ish samples as you cannot do RNA Seq without replicates but that is a later step which i will do with DeSeq2

Now I do know that a gene can be present either on sense or antisense strand but maybe this doesn't matter. What i really need to find out (i guess) is whether

R1 maps on coding strand R1 maps on complementary strand R1 maps randomly (non-stranded library)

ADD REPLY
1
Entering edit mode

You need to talk to the person who did the library prep, find out what library prep kit was used; different kits work different ways.

ADD REPLY

Login before adding your answer.

Traffic: 2668 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6