Question

same transcript id with STAR quant mode

0

Entering edit mode

7.0 years ago

grant.hovhannisyan ★ 2.6k

Hi Biostars,

My gtf file (which I got by converting gff file using gffread) has this kind of information:

ctro_c_1    CGOB    exon    27749    28666    .    +    .    transcript_id "ctro_CGOB_00001_mRNA"; gene_id "ctro_CGOB_00001"; gene_name "ctro_CGOB_00001";
ctro_c_1    CGOB    CDS    27749    28666    .    +    0    transcript_id "ctro_CGOB_00001_mRNA"; gene_id "ctro_CGOB_00001"; gene_name "ctro_CGOB_00001";
ctro_c_1    CGOB    exon    770839    771455    .    -    .    transcript_id "ctro_CGOB_00002_mRNA"; gene_id "ctro_CGOB_00002"; gene_name "ctro_CGOB_00002";
ctro_c_1    CGOB    exon    771521    771554    .    -    .    transcript_id "ctro_CGOB_00002_mRNA"; gene_id "ctro_CGOB_00002"; gene_name "ctro_CGOB_00002";
ctro_c_1    CGOB    CDS    770839    771455    .    -    2    transcript_id "ctro_CGOB_00002_mRNA"; gene_id "ctro_CGOB_00002"; gene_name "ctro_CGOB_00002";
ctro_c_1    CGOB    CDS    771521    771554    .    -    0    transcript_id "ctro_CGOB_00002_mRNA"; gene_id "ctro_CGOB_00002"; gene_name "ctro_CGOB_00002";

ctro_CGOB_00002 has two exons but both have the same transcript_id ctro_CGOB_00002_mRNA. If I will use --quantMode TranscriptomeSAM GeneCounts option with STAR, it will sum up counts from both exons, right?

Thank you very much

RNA-Seq STAR gtf • 2.3k views

ADD COMMENT • link updated 5.8 years ago by manuel.belmadani ★ 1.4k • written 7.0 years ago by grant.hovhannisyan ★ 2.6k

score 0 · Answer 1 · 2019-01-28

Yes that's right.

From the author of STAR

Read counting (e.g. htseq-count, featureCounts or STAR --quantMode GeneCounts) simply counts the number of uniquely mapped reads that overlap exons of each gene.

In the same thread Lior Pachter also mentions an important caveat with gene counts:

The main problem with htseq or featurecounts is that reads are not disambiguated between isoforms of genes, and when these isoforms have different lengths, the naïve counting methods can be very inaccurate. This is not an alignment issue but a quantification issue. In other words, simple counting is wrong because the total gene "counts" obtained by aggregating all reads that map to a gene locus is not, in general, going to be proportional to the gene abundance.

I would recommend looking at RSEM, which is a pretty popular quantifier, and it provides an "expected count", which I believe normalizes for the portion of the gene mapped, and also provides FPKM, TMP in addition of counts. (See this thread for more info on expected count v.s. raw count.) It supports STAR directly too.