Hi, there,
I've used Hisat-StringTie-ballgown pipeline and Using mouse genome 91 gtf file from ensemble creates MSTRG values as partial output of Stringtie,
The number of those MSTRG can be high, and I'm not sure is a real, as too many new transcripts, or unassembled transcripts.
Is there a different way to do it to avoid this probably technical issue? Using a different assembler?
Or a different gtf file?
This issue has been raised before, however, no good answer was provided: Gene names in Ballgown differential expression analysis How to deal with MSTRG tag without relevant gene name? Converting MSTRG from stringtie with gene name https://stackoverflow.com/questions/47621574/search-and-replace-between-two-files-post2
Thank you.
I guess you should be using better annotations (gtf file).
I used this one: ftp://ftp.ensembl.org/pub/release-91/gtf/mus_musculus/Mus_musculus.GRCm38.91.gtf.gz You suggest that this is not good enough?
I am not sure if the GTF you mentioned above includes all the transcriptome annotation. If you would like to restrict read alignments to annotations to GTF you supplied, use -e option (from http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual) in stringtie execution. Try to use -C option as well. This would be useful to identify novel transcripts (MSTRG) with full coverage (of reads), if there are any.
I did use this option like it's been described in Pertea 2016 Hisat Stringtie ballgown paper. MSTRG are still there....
this MSTRG it's a nightmare, I've also tried the python script here
https :// gist.github.com/gpertea/b83f1b32435e166afa92a2d388527f4b
but at the end without success ...
any update about this issue?
Thank you