Question

[RNA-Seq] STAR vs Subread and TPM?

1

Entering edit mode

4.9 years ago

Payton Yau ▴ 10

Hi all,

I searched for many discussions regarding RNA-Seq, many of them used STAR + FeatureCounts for their "standard" analysis and suggested using "TPM" for count data normalisation. My questions are

(1) Is there any reason why STAR is better than SUBREAD?

(2) I read a post "Differential expression analysis starting from TPM data", and Gordon Smith said TPM is not good for DE analysis. Is there any reason why we still use TPM for downstream analysis? and when do we need to use TPM?

(3) There are many posts also suggest using Salmon or Kallisto and to normalise the outcome to TPM. Would that be a big difference using FeatureCounts and convert it into TPM?

Thank you.

RNA-Seq • 6.4k views

ADD COMMENT • link 4.9 years ago by Payton Yau ▴ 10

score 8 · Answer 1 · 2020-09-23

Hi,

(1) I think STAR deals quite well with the mapping of reads from spliced junctions (I'm not saying that SUBREAD doesn't) and it is quite fast. (I'm not very well familiar with any of these softwares, sometimes is just a matter of taste).

(2) Yes, at least edgeR and DESeq2, will require raw (not transformed/normalized) read counts. The normalization will be applied internally by each one of the softewares. For DE analysis you shouldn't use TPM. It is preferable to use raw counts. TPM does not normalize across samples, only within sample, so its not the preferable normalization method in both softwares packages. Also read this: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts

(3) Yes, at least Salmon gives you TPM values. You can use tximport R package to import these counts to R and directly to DESeq2. In this case tximport will try to revert and estimate the raw counts based on countsFromAbundance method parameter given, and also can summarize reads at gene-level. No. You should always prefer to use raw counts, unless a particular software specifically requires TPM values, for some reason. You can use featureCounts and then convert this to TPM. There isn't any difference, regardless the difference related to the methods itself (I mean if you use Salmon or STAR + featureCounts and check the TPMs obtained with both, they can be different, but this difference is due to the mapping/summarisation methods).

I hope this helps,

António

score 7 · Answer 2 · 2020-09-24

I searched for many discussions regarding RNA-Seq, many of them used STAR + FeatureCounts for their "standard" analysis and suggested using "TPM" for count data normalisation.

I find hard to believe people were suggesting TPM normalization and using STAR + featureCounts for the same analysis. TPM means "transcripts per million", and is calculated for transcripts - TPM is meant as a within sample normalization, to allow comparing expression of different transcripts within a sample. The combo STAR + featureCounts (or STAR with --quantMode GeneCounts) is used, in general, to output counts of reads mapping to genes, to perform differential gene expression between samples.

As for your questions:

1) I don't remember seeing a recent independent comparison of STAR and Subread, but I expect both are pretty accurate and will produce similar results. STAR is probably faster, but uses a lot more memory than Subread (although the recent RSubread paper claims RSubread is faster than STAR, I will only believe this once I see an independent comparison).

A few years ago (which means, many, many versions ago), I compared STAR with --quantMode GeneCounts against Subread + featureCounts - the results were very similar, and the downstream gene expression analysis largely yielded the same results.

2) Depending on the workflow you choose, you may never use TPMs. TPMs would make sense for some visualization analyses (but there are other transformations, for example, edgeR uses CPMs, and DESeq2 uses rlog or vst transformations).

Of course, TPMs make sense for tools designed to be used with TPMs, but as I always used edgeR or DESeq2, I don't have the knowledge to comment further on this.

3) Salmon and kallisto already output TPM (along with estimated transcript counts), these would be more accurate than TPMs estimated with featureCounts output. To estimate TPMs from featureCounts output you either 1) use featureCounts to summarize over transcripts (I don't even know if featureCounts allows this, but, if it does, the counts will be bogus, because featureCounts discards reads mapping to multiple features), or 2) estimate a gene-level TPM from featureCounts results, but then, what length to use? Total gene length? Average of transcript lengths?

score 4 · Answer 3 · 2020-09-24

To answer the question "Is there any reason why we still use TPM for downstream analysis? and when do we need to use TPM?", there are plenty of reasons to use TPM in downstream analysis. Differential gene expression is only one type of downstream analysis. There are others: GSEA, correlation between genes, machine learning, etc. TPM is even preferred in certain cases because it actually normalizes for transcript length (so, within a single sample, you can figure out whether transcript A or transcript B is more abundant). TPMs aren't going away anytime soon.

I personally recommend using pseudoalignment (e.g. Kallisto) for most of your RNA-seq purposes. You can get estimated counts and TPMs in a fast and accurate way and you can do downstream DE analysis with sleuth or DESeq2.