Question

gene IDs in stringtie output

1

Entering edit mode

6.4 years ago

blooming.daisy333 ▴ 110

Dear All,

Im using stringTie to assemble the transcripts using my genome annotation file with -G flag. but stringTie assigns its own IDs like MSTRG .1, MSTRG.2 to genes and MSTRG1.1 and MSTRG 2,1 to transcripts despite of using geneom annotation file and im unable to get same gene IDs as to that in genome annotation file. I need those IDs for subsequent functional analysis. Can anyone suggest me how to get the same IDs in stringtie output as to that in genome annotation file????

thanks in anticipation

rna-seq • 5.7k views

ADD COMMENT • link updated 4.2 years ago by Kristoffer Vitting-Seerup ★ 4.1k • written 6.4 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

Hello blooming.daisy333,

Don't forget to follow up on your threads. Please give some feedback to the answers/comments on your last questions:

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

0

Entering edit mode

Dear finswimmer, I really appreciate your kind and quick help and im extremely sorry for the delay but im still working on those questions. actually these are interconnected for my analysis. I wiill surely give you the comments like have given before. please give me some time. further for some posts that solved my problem, i could not see any upvote/accepted sign to click on. thats why they are not marked.

ADD REPLY • link 6.4 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

one way is to intersect each mstrg coordinates with known transcriptome gtf @ blooming.daisy333

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Hello, I am having the same issue as I am getting MSTG ID instead of gene name. Were you able to solve this issue? If yes, please help me and let me know how did you do it?

Many thanks

ADD REPLY • link 6.2 years ago by arshad1292 ▴ 110

0

Entering edit mode

Please don't ask question in the space reserved for answers, use the ADD COMMENT button instead.

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Sorry about that. I am new and didn't realize this.

ADD REPLY • link 6.2 years ago by arshad1292 ▴ 110

score 0 · Answer 1 · 2020-09-21

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer