How to deal with MSTRG tag without relevant gene name?
2
6
Entering edit mode
7.0 years ago
stcatpang ▴ 60

Hi~ I used the hisat2-stringtie pipeline to deal with RNA-seq data and got a result with MSTRG tags. Some of them had gene name which was convenient to do function annotation after. But 1/3 of my data had rows with MSTRAG tag merely like this:

chr6    StringTie   transcript  72101340    72101890    1000    -   .   gene_id "MSTRG.58117"; transcript_id "MSTRG.58117.1";
chr6    StringTie   exon    72101340    72101890    1000    -   .   gene_id "MSTRG.58117"; transcript_id "MSTRG.58117.1"; exon_number "1";

Is there any suggestions on how to deal with them?

Thanks! Aoi

rna-seq • 20k views
ADD COMMENT
1
Entering edit mode
7.0 years ago

MSTRG IDs are default given names by stringtie while merging transcript gtfs. Naming convention of MSRTG is explained here by Stringtie devs. From the manual, you can change the default name of the transcripts while using stringtie merge option. Is reference GTF provided while mergiing?

-l <label>  name prefix for output transcripts (default: MSTRG)

However, these tags (MSTRG) are not useful in comparing across samples.

copy/pasted from Dev suggestion:

"you cannot rely on MSTRG.gene# identifiers but instead I'd suggest converting those gene IDs into locations on the genome (or some common reference annotation gene IDs/symbols, though such will not be available for "novel" genes)."

TL;DR:By default, stringtie appends MSTRG if no name is given.

ADD COMMENT
0
Entering edit mode

Really appreciate for your reply. I used the GTF file from Ensembl. Transcript listed above had location information but no reference annotation gene IDs. So is it proper to drop them away and keep those with gene symbols for further analysis? Thanks!

ADD REPLY
1
Entering edit mode

It depends on end goal of the study. If you are interested only standard transcripts/genes (i.e Ensembl, all or targeted), it is okay to exclude MSTRG transcripts/genes for downstream analysis. But do not throw away those genes/transcripts. Try to analyze these coordinates with care. They might be partial /& novel transcripts/genes or may be available in other databases.

ADD REPLY
0
Entering edit mode

Hello everyone, I also have above same problem i.e; ( in my case Cuffdiff gives gene ID but there specific gene names are missing) I used reference.gtf file during every steps. I also try to get specific gene name using there chr. locus number but no result found, did blast also. No any information get from databases, please guide what steps I do to find gene names. I need gene name for further downstream analysis.

ADD REPLY
0
Entering edit mode

Hi, divya~ I think you can check whether the reference.gtf matches your data. If there were no specific gene names for any sequence, one possible reason is that the reference.gtf and your bowtie index genome were different (hg38 and hg19 for example).

ADD REPLY
1
Entering edit mode
4.2 years ago

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer

ADD COMMENT
0
Entering edit mode

Hi, I know this is a very old post. I'm very new in bioinformatics please bare with me. I'm trying to create a heatmap from TPM to compare gene count between samples. i used importRdata() in IsoformSwitchAnalyzeR and My question is how do i obtain unnormailized TPM. this is my warnings

Warning messages: 1: In importRdata(isoformCountMatrix = stringTieQuant$counts, isoformRepExpression = stringTieQuant$abundance, : Using row.names as 'isoform_id' for 'isoformCountMatrix'. If not suitable you must add them manually. 2: In importRdata(isoformCountMatrix = stringTieQuant$counts, isoformRepExpression = stringTieQuant$abundance, : We found 933 (10.58%) unstranded transcripts. These were removed as unstranded transcripts cannot be analysed 3: In importRdata(isoformCountMatrix = stringTieQuant$counts, isoformRepExpression = stringTieQuant$abundance, : No CDS annotation was found in the GTF files meaning ORFs could not be annotated. (But ORFs can still be predicted with the analyzeORF() function)

ADD REPLY

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6