Which column to use from STAR output for Differential Gene Expression Analysis?
2
0
Entering edit mode
6.7 years ago
caggtaagtat ★ 1.9k

Hi,

during mapping with STAR I generated my count matrix for the differential gene expression analysis with the parameter --quantMode GeneCounts, however, I don't now which column from the output I should use for the analysis.

The help site says, it depends on how my data is stranded, however I'm not sure how to determine that in my data.

In this google group, Mr. Dobin suggests to take the 4th column, if the read counts in in the 4th column are generally higher than in 3rd column, which is the case in my data. However he also states, that the 4th column represents the output from ht-seq with the parameter -s reverse, which is described in the manual of ht-seq like a setting only for paired end reads (I only have singel end reads).

For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

So should I now use the 4th column of the STAR output or the 2nd (nonstranded)?

RNA-Seq STAR • 5.7k views
ADD COMMENT
0
Entering edit mode

The real question is what was the library prep procedure for your reads? The choice of column depends on the answer to this question.

Hint: for Illumina standard stranded kits, you should use the 4th column.

ADD REPLY
0
Entering edit mode

You're right, i was just wondering, if I have to try to get the "strand information" about the lib prep or if I can maybe see it in the read distribution. I will consider my data reverse stranded for now and also try to get to know, if my data is stranded at all :)

ADD REPLY
0
Entering edit mode

I just wanted to update, that I found the protocol for the RNA-Seq lib preperation and it was a stranded library :)

ADD REPLY
5
Entering edit mode
6.7 years ago

that the 4th column represents the output from ht-seq with the parameter -s reverse, which is described in the manual of ht-seq like a setting only for paired end reads

I don't think that's quite what the manual means. My lab almost never does paired end RNAseq, but I always pick the 4th column for counts, because the library prep we used makes all the reads align in the reverse.

ADD COMMENT
0
Entering edit mode

Ok thanks, I will assume my data is reverse stranded for now and try to find out more information about the lib prep kit used for this RNA-Seq

ADD REPLY
2
Entering edit mode
6.7 years ago

The strandness of your data depends on the protocol you used.

You can find the strandness information using RSeQC, module : infer_experiment.py.

That will tell you if your data are stranded or not.

If it is, you have two cases :

  • "++,--" has a majority, your data are first read forward
  • "+-,-+" has a majority, your data are first read reverse

( Related to this post : Infer-experiment.py, is strand-specific? and this RSeQC documentation : http://rseqc.sourceforge.net/#infer-experiment-py )

Related to this post A: Count the number of mapped and annotated reads in bam files (h.mon answer), you have 3 counts columns in your Star output:

If your data are unstranded you take the first one

If your data are stranded first read forward ("++,--" in RSeQC) you take the second one

If your data are stranded first read reverse ("+-,-+" in RSeQC) you take the third one

In any case if you know that your data are stranded but you don't know if first read is forward or reverse, just add all counts by column and check at the result. You will find a huge difference of total count in your columns. Take the bigger one.

( I'm not an expert, so if someone can confirm this, I would be thankful :) )

ADD COMMENT
0
Entering edit mode

Thanks, I will give that tool a try than.

ADD REPLY

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6