Hi All, I am currently using StringTie to identify novel transcripts (MSTRG ID transcripts) from RNAseq data in mouse. I have the gene.abund.tab output files from StringTie which gives the following info: Gene ID, Gene Name, Reference, Strand, Start, End, Coverage, FPKM, TPM. So from this file, I have the genomic loci where the assembled transcript maps to, but I don't know where to find the actual transcript sequence.
Also, an additional thing that is confusing me is that when I search the genomic location in the mouse genome using IGV, I'm finding that MSTRG transcripts are mapping to the full length known genes. How do I tell what is novel about them compared to the known annotated transcript? I'm guessing that if I know the actual sequence I can find this out.
-Jen
Hi Jen! It would be helpful if you posted the commands you used for building the transcriptome. I'm assuming you assembled your sample in separate and then you merged them (using the reference annotation) using stringtie --merge. You are correct in assuming MSTRG are novel transcripts. However, that does not mean that they are novel loci. It is unclear to me if you are searching for novel genes in the transcriptome or you just want novel transcripts that might be in already known genes.
If you want to get the sequences you can run gffread on your GTF output file
Sorry, I should've been much more specific. I mapped the reads using STAR and the ran the following:
I did the above for all samples, then ran the following:
I am currently looking at those MSTRG IDs that have no associated Gene ID, thus should be novel transcripts. From what I understand (I could be wrong), they could have no Gene ID because 1) they are new splice variants of a known gene (I would think these would have a Gene ID, so this may not be a valid reason) 2) They are a combination of 2 or more known known loci 3) They map to a completely new loci
I went and looked into one of my old GTF files and what I gathered was that your locus will only have the
"gencode.vM25.primary_assembly.annotation.gtf" gene ID if that locus does not contain new transcripts from your data.
One easy way to check for new loci would be running gffcompare:
This will output a file with your_merged_transcriptome.gtf with an additional attribute in the last collum (gene_name), if there is an intersecting reference gene. Hope this helps
I'm a little confused about what files to use as input to run the gffread on my output .gtf file. I'm looking at the GFF utilities page at the code to "Extract transcripts sequences". Not sure if this is even the correct code to run....
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
If it is, which files would I use? It seems that genome.fa would probably be the mouse genome file I used for the alignment, but unsure of the rest.
Your comand look fine to me. You have to use the same mouse genome that was used for the alignment and transcriptome assembly. What that command does is basically looking at your GTF file coordinates and extract the corresponding bases from the genome file
-w transcripts.fa <- is this the output file from gffread containing the transcript sequences I'm looking for? -g /path/to/genome.fa = I know this is the genome file used for alignment and assembly transcripts.gtf <- is this the output .gtf from StringTie
Sorry is this is obvious....
Yes you are correct in that assumption