Hi, I have been trying to run RNA seq analysis on some paired end data. I have aligned on HISAT2, and run Stringtie, Stringtie Merge and then Stringtie again. To do the analysis I am using: grch38_tran.tar.gz - https://ccb.jhu.edu/software/hisat2/index.shtml Homo_sapiens.GRCh38.84.gtf - ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
My issue is that despite running stringtie again after merge to remove some of the MSTRGs, I am getting a large number of them in my data set. More alarmingly the MSTRGs that do exist represent the highest counts in my sample.HISAT2-2.1.0.aligned.sorted.StringTie.1.3.3.gene_count_matrix.
Number of each: 24801 mstrg / 33970 ensg
Fraction of total: .42199 mstrg / .57800 ensg
Sum of each counts: 78615368 mstrg / 778402 ensg
Fraction of counts: .99019 mstrg / .00980 ensg
So while the MSTRG only makes up ~42% of the gene ids, it is 99% of what has been counted. I have minimum coverage set to 5, and have -G set, as well as -e to restrict to the reference given.
Is there anyway to further optimize this? Have I missed out on an important step?
Do you need to run
stringtie
? Do you expect new transcripts and does your project requires dealing with them? Why don't you quantify against the reference transcriptome/GTF with tools likefeatureCounts
or use transcript quantifiers likesalmon
orkallisto
?